[ Back to Kevin's Homepage ]

Mail Matching Research

Statement of Problem

Yes, there's an awful lot of unsolicited email, and most of it comes from email addresses that have been found by crawling the web and extracting them from old and current documents. Our good-natured willingness to share information has made us victims of unsolicited emails. After a few years of using kjw@rightsock.com, I currently get about 10 of these messages per day. Below, I describe a fairly effective, albeit computationally expensive, method of discovering these messages.

Suggested solutions

Some suggest that you change your email every few months or years, but that's terribly inconvenient. I'm also just plain stubborn and do not want to change; it can/will cause me to lose contact with distant friends, because they may not get the message that I've changed my email addresses.

Some also suggest that you not put your email address anywhere that some system might pick it up. This is also impractical for those of us who want to participate in various venues of public discourse. Furthermore, it does not help when the mailing lists themselves get spammed. bug-cvs@gnu.org is a huge example of this - they get probably 2-3 completely inappropriate unsolicited emails per day.

Some also use various public services like hushmail, but I don't trust anyone else to go through my email.

Some pay to have the problem managed for them - Earthlink has a hugely active and marketed feature of their service that tackles this problem. However, they aren't my ISP, nor would I want them. They don't provide any of the services that I want to use; I'm not a generic end user. This is a good solution, but just not for me.

Existing solutions

Mail header analysis

Several systems look at the mail headers, and try and figure out whether an email was really for them. I created a set of header analysis filters, but can only catch about 30-50% of the unwanted messages. The order of the rules is very important, and is also partially to blame for why my particular filters do not catch very much. Here's what I use:

This works relatively well, but the headers keep changing and these rules require a lot of maintenence. Too much for my busy schedule these days.

Proposed mail content matching algorithm

Step one. Trick the senders of these emails to purposely send messages to a separate email address. This address should NEVER be used for anything legitimate, because it will be used as input to the mail removal algorithm. For example, I use trawler@rightsock.com.

Step two. Use the incomming mails as a database. Edit them into the 'internal' direct comparison format. This means stripping off things that make the algorithm break easily, and probably won't make a difference. This includes leading ">"'s and various symbols. run the bodies of the messages through 'fmt -1' and store them.

Step three. Apply the database against the desired mailbox. Using diff (a line-by-line comparison algorithm) and convincing diff to only output 1- a + or 2- a - or 3- a = to indicate that one whitespace-separated word is 1- one file A but not in B, or 2- vice versa, or 3- match identically between the two files. Count the characters, and you can now calculate how sequentially the same the messages are between the database and the inbox.


I tried this with a test set of about 1000 messages, and I was amazed that this simple algorthm resulted in a COMPLETELY bi-modal distribution. Either the messages were 0-30% alike or 90-100% alike. There were virtually no messages in the middle (fewer than 10!). One could very easily set the threshold at 70% and catch virtually everything.

The problem is that this method requires input, and that input may come after the database has been applied to the mailbox. Therefore, the process must be reapplied. But to try and do this to the entire mailbox (my main inbox typically runs 200-600 messages backed up) is incredibly slow. Therefore, one must incrementally reapply the algorithm in two ways:

  1. when a message gets added to the database, apply only that message to the mailboxes. This filters after the fact of receipt, and means there will be a delay in removal, but much activity happens overnight, so by morning, the cleanup should have already happened.
  2. when a message gets added to the mailbox, it needs to be compared to the database. just the one. Therefore, a list of all messages already processed needs to be kept track of. IMAP may be of some help here, since it guarantees unique IDs for all messages.

Have I done anything more than test out my theory? no, since I haven't a reliable input. I hope that by publishing the address trawler@rightsock.com here on this page, that some crawler will pick it up and over the next few months it should start to get fed. I could also reply to known unsolicited emails, masquerading as the trawler account, but that takes time that I haven't decided to dedicate yet. Anyways, once the database gets a good, regular, and maintenence-free continuous feeding, I will start putting it to use.

Further needs

MIME, honeypot, more

last modified - 2002.12.06 kjw
created - 2002.12.06 kjw