A Spam mail by definition is a mail that has been sent un-solicited and in bulk to many people at one time. A Spam filter is a software which processes incoming mails and blocks out such spam mails.
Spam Filtering software are of various types, and can be broadly classified as Client side filters and Server side filters.
This filter written in Python script belongs to a server side spam filter which can be used to filter all mails coming to the server.
The spam filter follows various rules and principles to mark a particular mail as spam. The current filter uses a rule to check for percentage of html content in each line and if the percentage is over 60 % then its marked as spam.
However this may not be a very effective way to either block spam or prevent legitimate mails from being marked as spam.
For example a plain text based spam mail will easily pass through this spam filter while a legitimate newsletter or an HTML based invitation mail will be blocked by the spam filter as it would have more than 60 % of html content.
To further improve the effectiveness of the spam filter the following techniques or methods could be implemented.
Black lists:
The spam filter can be programmed to look up a database of spam email addresses to see if the sender of the mail is listed on the black list and if so can block the mail as spam. Email users can mark a particular email sender as a spammer and add it to the black list, and henceforth any mails from the same email address will be treated as spam.
However now a days spammers use different email addresses to send out the mails so that they don’t get blocked, another approach would be to also store the IP address of the sender in the blacklist, this will definitely help make the spam filter more efficient.
White list:
A white list is the reverse of a black list where valid email addresses or valid IP addresses are entered into a white list database, and the spam filter can be programmed to check to see if the email sender’s email address or IP is listed in the white list and then consider it to be a legitimate mail. One drawback of just a system would be that it will allow emails from only known people and would not allow mails from any new source to come in.
Subscribing to Real time Spam Black Lists (RBL):
Real time Spam Blacklists are usually community driven central database of all known spammers and spam related activities. RBLs constantly updates its database of the known spammers and the IP addresses of servers from where spam is originating. The Python script can subscribe to this listing and block any mails that are originating from an IP address that is listed. This will avoid the need to constantly update our local black list with the list of new spammers or spammers who have changed their information. Since the RBLs are mostly community driven they contain the most updated spam information.
Some of the RBL we can subscribe to are: Spamhaus Block List (SBL), Arbitrary Black Hole List (ABL), VOX DNSBL etc.
Language Specific Filtering:
The Python script can check for the language of the email and depending on the native language setting for the user block out mails that do not match his language settings.. This is ideal to block out mails originating from countries like China and Korea which comprise of a majority of spam mails that a user receives.
Country Specific Filtering:
A technique used by some of the spam filtering software is to identity countries from where most of the spam is generated and simply block mails originating from that country. However this is not a very fool proof method and a very large percentage of legitimate mails are blocked or marked as spam.
Content Specific Filtering:
The script can scan through the subject and mail body of the mails and search for words which would hint at the email being a spam email.
The list of keywords can be stored in a database, and constantly updated with keywords and phrases based on the nature of spam mails being sent out. While implementing a content specific filter its important to list the keywords along with its variations. For example spammers advertising Viagra many a times misspell the work on purpose to by pass the spam filters.
Blocking Anonymous Sender:
The script can read the header information of each mail and any mail where the sender’s name is absent or seems to be obscured can be marked as spam. In most cases a spammer doesn’t want to reveal his true identity and hence tries various ways to hide his information like email address, server IP address etc. the script can look for such obscured headers and easily block the mails as spam.
Rule Based Scoring System:
This is one of the most popular methods of filtering out spam; this technique involves defining a set of rules and allocating a scoring system for each occurrence. For example, the script can allocate a +2 point for every occurrence of the word “Discount” or a +5 points for every occurrence and variation of the word “Viagra” , after running through all the rules and calculating the points each mail scores, the script can decide if the mail can be
Using Bayesian Theorem:
The Bayesian theory is evolving as one of the most fool-proof methods of preventing spam. The Bayesian theory tries to estimate the probability of an event happening based on its occurrence in the past. Adding the Bayesian algorithm to the Python script will involve advanced programming skills where the filter would need to be customized for individual mail boxes. Practical results have shown that a spam filter based on the Bayesian theorem, improves over a period of time and can effectively block up to 99% of the spam with a very low percentage of marking a legitimate mail as spam.
Thus it can be seen that with the evolving methods and techniques of sending out spam the spam filter will need to use one or more methods to prevent spam from entering the mail boxes. Over and above the spam filter would need to constantly evolve and be scalable to incorporate new features or rules to effectively prevent spam.