Deleting spam from a POP3 server

For most of my business-related mail, I use Outlook 2000 on a Window 2000 box. I had installed the Spambayes client for Outlook 2000 to combat increasing amounts of spam, stemming from one stupid email to a Usenet group. While Spambayes did a wonderful job of detecting spam, I still had to download the garbage to my Windows client before the spam could be detected. Also, if I was checking my personal email via the web, I had difficulty seeing the real mail amongst all the spam.

What I needed was a POP3 client that would check the POP server every so often, and delete the spam before I even saw it. Since the hydrus web and mail server was up 24x7, I had a platform to support this putative POP3 spam purger. I searched on the web, but couldn't find a ready-made solution. [N.B. I subsequently noticed that Spambayes came with a sb_imapfilter.py script, which offered the same kind of facilities I was after].

To start development and testing, I downloaded Spambayes onto my Debian-based desktop machine. The installation was slightly complicated because it required the Python distutils modules. This is not installed as part of the standard Debian Python package, but comes with the additional python23-dev package.

Writing a Python script to act as a POP3 client seemed easy enough, using the Python-supplied poplib module; the key was how easy it would be to integrate the spam detection capabilities of Spambayes.

As it turned out, it was remarkably easy; a testament to Python and the developers of Spambayes. I thought that it would not be possible to copy the training database from Windows to UNIX and use it. I was wrong - it worked just fine. For the purposes of research, I did create a training database from scratch using the existing set of ham and spam messages from Outlook. See below.

The first time I tested the pop client, I found that the percentages reported by Spambayes were not as high as I expected. I tracked this down to the fact that the downloaded mails did not contain an appropriate "From" header line. Once I forced this header into each mail, before passing it to Spambayes, the scores were more reasonable.

The next task was to port the solution to FreeBSD. There didn't seem to be a port for Spambayes, so I copied across the installation tar file from the Debian machine. No problems this time with missing distutils modules. I copied the spam database I had created under Debian, and started testing. One immediate problem surfaced; Python could not open the spam database, complaining that the module _bsddb could not be found. It appeared I needed to install another port to support the BSD database formats - databases/py-bsddb, which offers a set of python wrappers for the BSD databases. Note this is an older version of the python wrappers; the newer version can be found in databases/py-bsddb3. It was because I had used Python 2.3 on Debian, that the database was in the older format.

Once the Python script seemed to work, all that remained was to create a crontab to run it at regular intervals (every 4 hours to start with).

To Be Done

Popdespam.py source code

Here's the current source code for the popdespam.py script. Help yourself if you find it useful.

#!/usr/local/bin/python
"""
    NAME
        popdespam.py

    SYNOPSIS
        popdespam.py [-n] [-v]

        -n    do not actually delete any spam
        -v    verbose mode

    DESCRIPTION
        Deletes spam from a POP3 server, using scoring
        provided by Spambayes.  A summary of the messages found and
        deleted is displayed.

"""
import getopt
import time
import poplib
import sys
from spambayes import hammie

#########################################
# key parameter settings
#########################################
server=""            # name of pop server
username=""          # pop server username
password=""          # pop server password
spamdb=""            # location of spambayes database
verbose=False
dodelete=True
seen_file=""         # file to store last message number seen
max_size=100000      # ignore messages greater than this number of bytes
#########################################

# read command line options (if any)
try:
    opts,args = getopt.getopt(sys.argv[1:],'vn')
    for o,v in opts:
        if o == '-v':
            verbose = True
        elif o == '-n': dodelete = False
except getopt.GetoptError,e:
    print "%s: illegal argument: %s" % (sys.argv[0],e.opt)
    sys.exit(1)

h = hammie.open(spamdb)
p = poplib.POP3(server)

status = p.user(username)
if verbose: print status
status = p.pass_(password)
if verbose: print status

stat = p.stat()
if verbose: print "# Messages:",stat[0],"; # bytes",stat[1]
nmsgs = stat[0]

# get highest message seen in last run
try:
    seen = int(open(seen_file).read())
except IOError:
    seen = 0

# if # of messages in mailbox now is less than seen, assume mailbox
# has been emptied since last run, therefore all messages must be
# scanned
if nmsgs < seen:
    seen = 0

if verbose: print "last seen:",seen

msg_list = p.list()[1]
if verbose: print msg_list

ndel = 0
current_time = time.asctime(time.localtime(time.time()))
try:
    for i in msg_list:
        msg_index = i.split()
        i = int(msg_index[0])
        size = int(msg_index[1])
        if i <= seen: continue
        if size > max_size: continue

        msgt = p.retr(i)
        if verbose: print "Message:",i,"; # bytes:",msgt[2],
        msg = "From  xxx "+current_time
        for line in msgt[1]:
            msg = msg+"\n"+line
        spamprob = h.score(msg)
        if verbose: print "Score:",spamprob
        if spamprob>.9 and dodelete:
            ndel += 1
            p.dele(i)
except poplib.error_proto,e:
    print "[%s] (%s) poplib error: %s" % (current_time,server,e)
    p.quit()
    sys.exit(0)

status = p.quit()
if verbose: print status
print "[%s] (%s) %4d msgs; %4d deleted" % (current_time,server,nmsgs,ndel)
# save seen message number
open(seen_file,mode="w").write("%d" % (nmsgs - ndel,))

Training a UNIX version of Spambayes with Windows mail messages

Since my existing training database was on Windows, I decided to see how easy it would be to create a new one on the Debian machine, using the mail messages from my Windows mail client. It was relatively easy (if tedious) to export all the Outlook messages as text files and transfer them to Debian. However, the signature used to detect the start of a mail message in unix mailboxes, "^From name Day Mon nn HH:MM:SS Year" (where "^" indicates start of line), did not exist in the mail exported from Outlook. I had to spend some time creating the right headers, using Emacs and the existing information in the file. To create a training database, use a command of the form:

  sb_mboxtrain.py -d sb.db -g ham.txt -s spam.txt