[ILUG] SA statistics again
Niall O Broin
niall at linux.ie
Sat Feb 5 01:19:24 GMT 2005
Thanks to contributions from a couple of people the other day I came up with
this little script to produce a small report on the Bayes DB:
echo Spam Assassin Bayes Statistics
echo ""
echo Bayes Token Count
echo "Total Ham Spam"
sa-learn --dump |awk '{count += 1; if ($0 > 0.5) spam+=1; \
if ($0 < 0.5) ham+=1} END {print count "\t" ham "\t" spam}'
echo ""
echo -n "Number of ham messages learnt from: "
sa-learn --dump magic |awk '/nham/ {print $3}'
echo -n "Number of spam messages learnt from: "
sa-learn --dump magic |awk '/nspam/ {print $3}'
which runs at tne end of a script which sa-learns spam placed in folders by
humans during the day. After doing its nightly run, it reported as follows:
Spam Assassin Bayes Statistics
Bayes Token Count
Total Ham Spam
140114 78443 61671
Number of ham messages learnt from: 2109
Number of spam messages learnt from: 1387
I then fed sa-learn something over 1000 pieces of ham, and now the same script
gives me:
Spam Assassin Bayes Statistics
Bayes Token Count
Total Ham Spam
153518 10 153508
Number of ham messages learnt from: 2850
Number of spam messages learnt from: 0
AARGH! - what the hell has happened there. It has forgotten about ALL the spam
messages it ever learnt from, apparently, but conversely, 78000 ham tokens
have become spam tokens.
Straight sa-learn --dump magic now gives
0.000 0 3 0 non-token data: bayes db version
0.000 0 0 0 non-token data: nspam
0.000 0 2850 0 non-token data: nham
0.000 0 153508 0 non-token data: ntokens
0.000 0 1091609393 0 non-token data: oldest atime
0.000 0 1107564300 0 non-token data: newest atime
0.000 0 1107564852 0 non-token data: last journal sync atime
0.000 0 1107564590 0 non-token data: last expiry atime
0.000 0 1382400 0 non-token data: last expire atime delta
0.000 0 17827 0 non-token data: last expire reduction count
whereas sa-learn --dump magic from the databases as of 19:00 last night
(retrieved from the warm standby box) gives
0.000 0 3 0 non-token data: bayes db version
0.000 0 1342 0 non-token data: nspam
0.000 0 2096 0 non-token data: nham
0.000 0 138010 0 non-token data: ntokens
0.000 0 1106096390 0 non-token data: oldest atime
0.000 0 1107544172 0 non-token data: newest atime
0.000 0 1107538029 0 non-token data: last journal sync atime
0.000 0 1107478750 0 non-token data: last expiry atime
0.000 0 1382400 0 non-token data: last expire atime delta
0.000 0 5589 0 non-token data: last expire reduction count
Can anyone shed any light on this?
--
Niall
More information about the ILUG
mailing list