[ILUG] SA statistics again

Niall O Broin niall at linux.ie
Sat Feb 5 01:19:24 GMT 2005


Thanks to contributions from a couple of people the other day I came up with 
this little script to produce a small report on the Bayes DB:



echo Spam Assassin Bayes Statistics
echo ""
echo Bayes Token Count
echo "Total     Ham     Spam"
sa-learn --dump |awk '{count += 1; if ($0 > 0.5) spam+=1; \
if ($0 < 0.5) ham+=1} END {print count "\t" ham "\t" spam}'
echo ""
echo -n "Number of ham messages learnt from: "
sa-learn --dump magic |awk '/nham/ {print $3}'
echo -n "Number of spam messages learnt from: "
sa-learn --dump magic |awk '/nspam/ {print $3}'

which runs at tne end of a script which sa-learns spam placed in folders by 
humans during the day. After doing its nightly run, it reported as follows:

Spam Assassin Bayes Statistics

Bayes Token Count
Total   Ham     Spam
140114  78443   61671

Number of ham messages learnt from: 2109
Number of spam messages learnt from: 1387


I then fed sa-learn something over 1000 pieces of ham, and now the same script 
gives me:

Spam Assassin Bayes Statistics

Bayes Token Count
Total   Ham     Spam
153518  10      153508

Number of ham messages learnt from: 2850
Number of spam messages learnt from: 0


AARGH! - what the hell has happened there. It has forgotten about ALL the spam 
messages it ever learnt from, apparently, but conversely, 78000 ham tokens 
have become spam tokens.

Straight sa-learn --dump magic now gives

0.000    0          3          0  non-token data: bayes db version
0.000    0          0          0  non-token data: nspam
0.000    0       2850          0  non-token data: nham
0.000    0     153508          0  non-token data: ntokens
0.000    0 1091609393          0  non-token data: oldest atime
0.000    0 1107564300          0  non-token data: newest atime
0.000    0 1107564852          0  non-token data: last journal sync atime
0.000    0 1107564590          0  non-token data: last expiry atime
0.000    0    1382400          0  non-token data: last expire atime delta
0.000    0      17827          0  non-token data: last expire reduction count

whereas sa-learn --dump magic from the databases as of 19:00 last night 
(retrieved from the warm standby box) gives

0.000    0          3          0  non-token data: bayes db version
0.000    0       1342          0  non-token data: nspam
0.000    0       2096          0  non-token data: nham
0.000    0     138010          0  non-token data: ntokens
0.000    0 1106096390          0  non-token data: oldest atime
0.000    0 1107544172          0  non-token data: newest atime
0.000    0 1107538029          0  non-token data: last journal sync atime
0.000    0 1107478750          0  non-token data: last expiry atime
0.000    0    1382400          0  non-token data: last expire atime delta
0.000    0       5589          0  non-token data: last expire reduction count



Can anyone shed any light on this?




-- 
Niall



More information about the ILUG mailing list