For most of my business-related mail, I use Outlook 2000 on a Window 2000 box. I had installed the Spambayes client for Outlook 2000 to combat increasing amounts of spam, stemming from one stupid email to a Usenet group. While Spambayes did a wonderful job of detecting spam, I still had to download the garbage to my Windows client before the spam could be detected. Also, if I was checking my personal email via the web, I had difficulty seeing the real mail amongst all the spam.
What I needed was a POP3 client that would check the POP server
every so often, and delete the spam before I even saw it. Since the
hydrus web and mail server was up 24x7, I had a platform to support
this putative POP3 spam purger. I searched on the web, but couldn't
find a ready-made solution. [N.B. I subsequently noticed that
Spambayes came with a sb_imapfilter.py
script, which
offered the same kind of facilities I was after].
To start development and testing, I downloaded Spambayes onto my
Debian-based desktop machine. The installation was slightly
complicated because it required the Python distutils
modules. This is not installed as part of the standard Debian
Python package, but comes with the additional python23-dev
package.
Writing a Python script to act as a POP3 client seemed easy enough,
using the Python-supplied poplib
module; the key was how
easy it would be to integrate the spam detection capabilities of
Spambayes.
As it turned out, it was remarkably easy; a testament to Python and the developers of Spambayes. I thought that it would not be possible to copy the training database from Windows to UNIX and use it. I was wrong - it worked just fine. For the purposes of research, I did create a training database from scratch using the existing set of ham and spam messages from Outlook. See below.
The first time I tested the pop client, I found that the percentages reported by Spambayes were not as high as I expected. I tracked this down to the fact that the downloaded mails did not contain an appropriate "From" header line. Once I forced this header into each mail, before passing it to Spambayes, the scores were more reasonable.
The next task was to port the solution to FreeBSD. There didn't
seem to be a port for Spambayes, so I copied across the installation
tar file from the Debian machine. No problems this time with
missing distutils
modules. I copied the spam database I
had created under Debian, and started testing. One immediate
problem surfaced; Python could not open the spam database,
complaining that the module _bsddb
could not be found. It
appeared I needed to install another port to support the BSD
database formats - databases/py-bsddb
, which offers a set
of python wrappers for the BSD databases. Note this is an older
version of the python wrappers; the newer version can be found in
databases/py-bsddb3
. It was because I had used Python 2.3
on Debian, that the database was in the older format.
Once the Python script seemed to work, all that remained was to create a crontab to run it at regular intervals (every 4 hours to start with).
sb_imapfilter.py
seems to offer just
this capability).
Here's the current source code for the popdespam.py script. Help yourself if you find it useful.
#!/usr/local/bin/python """ NAME popdespam.py SYNOPSIS popdespam.py [-n] [-v] -n do not actually delete any spam -v verbose mode DESCRIPTION Deletes spam from a POP3 server, using scoring provided by Spambayes. A summary of the messages found and deleted is displayed. """ import getopt import time import poplib import sys from spambayes import hammie ######################################### # key parameter settings ######################################### server="" # name of pop server username="" # pop server username password="" # pop server password spamdb="" # location of spambayes database verbose=False dodelete=True seen_file="" # file to store last message number seen max_size=100000 # ignore messages greater than this number of bytes ######################################### # read command line options (if any) try: opts,args = getopt.getopt(sys.argv[1:],'vn') for o,v in opts: if o == '-v': verbose = True elif o == '-n': dodelete = False except getopt.GetoptError,e: print "%s: illegal argument: %s" % (sys.argv[0],e.opt) sys.exit(1) h = hammie.open(spamdb) p = poplib.POP3(server) status = p.user(username) if verbose: print status status = p.pass_(password) if verbose: print status stat = p.stat() if verbose: print "# Messages:",stat[0],"; # bytes",stat[1] nmsgs = stat[0] # get highest message seen in last run try: seen = int(open(seen_file).read()) except IOError: seen = 0 # if # of messages in mailbox now is less than seen, assume mailbox # has been emptied since last run, therefore all messages must be # scanned if nmsgs < seen: seen = 0 if verbose: print "last seen:",seen msg_list = p.list()[1] if verbose: print msg_list ndel = 0 current_time = time.asctime(time.localtime(time.time())) try: for i in msg_list: msg_index = i.split() i = int(msg_index[0]) size = int(msg_index[1]) if i <= seen: continue if size > max_size: continue msgt = p.retr(i) if verbose: print "Message:",i,"; # bytes:",msgt[2], msg = "From xxx "+current_time for line in msgt[1]: msg = msg+"\n"+line spamprob = h.score(msg) if verbose: print "Score:",spamprob if spamprob>.9 and dodelete: ndel += 1 p.dele(i) except poplib.error_proto,e: print "[%s] (%s) poplib error: %s" % (current_time,server,e) p.quit() sys.exit(0) status = p.quit() if verbose: print status print "[%s] (%s) %4d msgs; %4d deleted" % (current_time,server,nmsgs,ndel) # save seen message number open(seen_file,mode="w").write("%d" % (nmsgs - ndel,))
Since my existing training database was on Windows, I decided to see how easy it would be to create a new one on the Debian machine, using the mail messages from my Windows mail client. It was relatively easy (if tedious) to export all the Outlook messages as text files and transfer them to Debian. However, the signature used to detect the start of a mail message in unix mailboxes, "^From name Day Mon nn HH:MM:SS Year" (where "^" indicates start of line), did not exist in the mail exported from Outlook. I had to spend some time creating the right headers, using Emacs and the existing information in the file. To create a training database, use a command of the form:
sb_mboxtrain.py -d sb.db -g ham.txt -s spam.txt