Home > BlogTech > Blog Spam Solution

Blog Spam Solution

June 22nd, 2005

I should provide a wrap-up on how the blog spam is being handled, in case this could help others with an onerous amount of spam on their sites. I seem to have found a solution that allows for most spam to be blocked or moderated. So far, despite thousands of attempts, not one spam comment or trackback has gotten through the defenses on this site since they were erected. It’s still early in the game, but I have a feeling that this solution will work. I should note that this applies only to people using Movable Type to administer their blogs.

First, a word about the different kinds of spam: first, there is blog comment spam. This is usually generated by marketers who have automated scripts that search out the comment-generating scripts used by blogs, and then use them to generate hundreds or even thousands of spam-laden comments on the blog. The spam is in the form of links either in the comment body or in the URL link allowed under the visitor’s name at the bottom of the comment. The links are not just to bring people from the blog to their advertising site, but just as much if not more to generate “Google Juice,” as more external links to their site will boost its all-important ranking in search engine result lists.

Second, there is TrackBack spam. A “TrackBack” is when someone refers to someone else’s blog post in a posting on their own site; their blog software will send a ‘ping’ to the other person’s site, informing them that their post was referenced in another blog. Often blog software will create lists of TrackBacks with a link back to the source. Spammers, again trying to get Google Juice wherever they can, send out fake pings so they can get free links again.

And finally, there is referral spam (also called “referrer spam,” often spelled “referer spam”). People who keep blogs like to know what attention they’re getting. Their web hosts usually provide statistical data, showing how many people visit their site, which pages they go to, and which sites are sending people to theirs via links. For example, if macsurfer.com creates a link on their page to one of my blog posts, people will follow that link to my site. Their arrival brings with them a string a data about them, including which site they came from via a link. This is a referral, which is counted by the stats on my site. I can see how many people followed the link from MacSurfer. Some people, happy to receive links, will obligingly set up top-referrer lists on their web pages, with links back to the nice people who sent visitors their way, as a way of saying ‘thanks.’ Spammers again take advantage of this by generating bogus automated ‘visits’ that carry falsified link data, making it look like hundreds of visitors came to the blog from the spam site, when in reality, no link and no visitors actually exist. But this faked referral spam will get them on the referrer’s list, and give them even more Google Juice.

Referral spam is loathed by site owners because it goes a long way to making site statistics useless. After all, if you want to know who is really visiting your site and you see thousands of hits from texas-holdem, porn, pharmaceutical and other commercial sites, you know the stats aren’t telling you the true story. With all that spam, it’s next to impossible to know exactly how much traffic is legitimate. Fortunately, a lot of referral spam originates from the same IP Address, and so even hundreds of spams will likely count as only a few “visitors” in the main part of your stats, making them a bit more reliable, though still altered and uncertain. But the list of referrers, the list of sites with links to yours, is utterly trashed.

Often times these tricks will not even get the spammers what they want; bloggers have more or less stopped posting top-referrer and trackback listings. But the ability to spam in massive quantities is so cheap that spammers don’t care if the return is minimal; it’s still worth their few bucks because it will likely get them enough to pay for the spamming and then some. And for the spammers, turning up the volume doesn’t cost much more, if anything at all, and so the volume of spam being sent is increasing more and more. As bloggers try to block the spam, spammers think of new ways around the defenses.

The current programs to block and filter spam usually work from blacklists. However, the blacklists only know what you’ve banned in the past, they can’t know what a new spammer looks like until you tell them unless the spammer is really obnoxious. So the new spammer, or the ever-changing new face of the spammer, is allowed through the defenses at first before the blog admin catches him and puts him on the blacklist. Once on the blacklist, future attempts to spam are automatically blocked. There are a few tricks that the new apps use to detect spam, and so the new solutions mean that very little, if indeed any, spam gets through to appear on the blog at all, even for a short time.

So, what are the solutions?

First: MT-Blacklist. This is a long-standing spam filter that I’ve used for more than a year now. MT-Blacklist maintains a list of text strings (either words, strings with wild cards, or URLs) that the user has designated as spamworthy. Every new comment and trackback ping that comes in to the blog gets checked against this list, and is blocked if a match is found. Furthermore–and this is what wowed me the first time I used it–it has the ability to “de-spam,” which means that even if new spam has gotten past the filters, if you add a text string from the offending spam messages, MT-Blacklist could access all the archived comments and purge all the spam from there as well. These two combined features–blocking and de-spamming–made controlling spam completely possible.

Unfortunately, the de-spamming feature is missing in the new version. Movable Type itself has a comment filter in version 3, which allows you to single out comments by a common email, name, or IP address, after which you can auto-check all of them and de-spam. This will be useful, but only if the email, name or IP in a huge chunk of first-time spam that gets through is identical in multiple comment spams. However, that is all too often not the case. If Movable Type could add a filter for strings inside the comment text itself, that would plug that hole and bring back full de-spamming power.

However, as the author of MT-Blacklist pointed out here himself in a recent comment, MT-Blacklist, especially when used in conjunction with SpamLookup, blocks so much of the spam that de-spamming may not ever be a problem task. Still, it would be nice to have around just in case.

The new MT-B has some handy new features which makes catching that spam possible. One of them is the ability to filter comments with too many URLs; a common practice among spammers is to leave a comment with dozens of links. Since no real commenters do this, it’s an excellent way to filter out the spammers. MT-B will also now block comments or trackbacks which are duplicates of prior entries, another good filter against spam. MT-B will also automatically update the blacklist from the master blacklist for the app, meaning you get protection even before the spam reaches your site.

The second element to block spam, as I mentioned above, is SpamLookup. A new MT plug-in, it does something similar to MT-Blacklist, using IP addresses of spammers from a master list to help block spam. This method is called DNSBL. This plug-in also uses a good number of other defenses, including multiple-URL blocks, domain blacklist checks, and your own blacklist of text strings to force-moderate or block a comment or trackback ping, similar to MT-Blacklist.

I have both MT-Blacklist and SpamLookup installed. I am not sure which one filters before the other, but I suspect MT-Blacklist does. It catches pretty much all the comment spam. SpamLookup then gets all the trackback spam. Together, they make it extremely unlikely that much spam will get through. And what does get through–maybe one out of every thousand or so spams–gets moderated, where I can give it the boot manually, adding it to the blacklist–which brings us to:

Third, the blog software itself, Movable Type 3.1. The upgrade from version 2 to 3 brought with it very important blog spam safeguards: moderated posts and comment registration. Before, even with MT-Blacklist doing the efficient job it did, the first-time spams I mentioned before would get through, and sometimes stay up on my site for hours before I could spot them and take them down, especially if they hit during the night time here in Japan. Post moderation means that no spam gets through to the blog at all; I might have to moderate it if it doesn’t get filtered, but at least it never gets posted publicly, which is a big thing if you want to de-incentivize spammers. In addition, the Typekey registration allows commenters to post comments without having to wait for moderation, putting the onus of spam control partially on that side of the equation, and allowing legitimate commenters to bypass the moderation in a way spammers can’t.

So these tools combined make it possible to block pretty much all of the comment and trackback spam. What remains is the referral spam. As yet, there is no fix-all for this problem, but it appears that a patch for the statistics program “AwStats” (my favorite) is doing a fairly good job at keeping most of the junk out. This patch will not de-spam your stats, but it will block any incoming referrer spam by using the MT-B blacklist file as a filter. The spammers may hit your site, but the hits don’t get counted in AwStats.

After having used it for a few weeks now, it seems to be doing its job. In May, before I got the patch, the top 25 spammers left more than 10,000 referral spams in my logs. In the first five days of June, the top 26 referrers were all spammers, and had hit me a total of 867 times between those top spammers, averaging out to about 175 new top-26 spams each day. Since the patch was installed, only 21 of the top 26 are spammers (5 legitimate referrers got back on the list!), and those spammers have a total of only 1434 hits–a total of 567 new hits over the course of 16 days, or just 35 new top-26 spams each day, meaning the major referral spam has been cut by about 80%. Some referral spam is getting through, but mostly it’s just the first-time stuff, before it gets added to the blacklist. Next month, with the patch in place from the beginning, I will be interested to see how little spam gets through at all.

The down side to the AwStats patch (aside from the fact that you need AwStats in the first place) is that it has very little documentation. Because of this, it was unclear to me how the patch would be applied. AwStats itself is a server-level app, meaning that I can’t patch the Perl script that makes AwStats run. However, it seems that only the “.conf” file for AwStats needs to be changed. Again, the project site is unclear, but seems to indicate that a tool called “GNU Patch” is needed to install it. How that works, I have no idea. All I know is that the patch must be worked into the AwStats .conf file, and that the pathway to the MT-B blacklist.txt file must also be added… somewhere.

Fortunately, my web host (might as well give Surpass Hosting a plug, seeing as how they did me a favor) support technician helpfully volunteered to install the patch, and apparently did the job right, as it seems to be working. Unless you know how to use the GNU patch tool and stuff like that, you might want to ask your web host for the same favor. The file that needs to be patched is titled “awstats.domain.com.conf” (replace “domain.com” with your domain name and extension), located in “tmp/awstats” on your home directory. The pathway to the MT-Blacklist file is probably “/home/youraccountname/public_html/blacklist.txt”, though you should check that first. MT-B v. 1.6 allowed you to see the pathway, but I can’t seem to find the pathway listing in version 2.

So, to sum up: using Movable Type 3.1, MT-Blacklist 2, SpamLookup, and a patch for AwStats, I’ve been able to solve–finally!–most of my spam headaches.

Now to wonder about what the spamming scumbags will come up with next…

Categories: BlogTech Tags: by
  1. June 22nd, 2005 at 22:32 | #1

    Luis, thank you for taking the time to write this exhaustive and extremely useful post about blog spam. I’ve installed SpamLookup but unfortunately I cannot install the awstats patch since my host maintains that. Oh well..

  2. BlogD
    June 22nd, 2005 at 22:40 | #2

    Roy: what do you mean by “my host maintains that”? Have you tried to FTP your account, go into the directory above “public_html” and open the “tmp” directory, then the “awstats” directory? The .conf file should be there.

    Also, have you tried asking your host to patch it for you? Their tech people probably understand how that should be done. Or are they the type to ppretty much ignore anything but the basic keep-you-on-the-air maintenance?

  3. June 23rd, 2005 at 02:22 | #3

    Good lord! You are correct. I’ll give it a go later on. Thanks!

  4. BlogD
    June 23rd, 2005 at 02:26 | #4

    Just be sure to save a backup of the original .conf file, just in case! FTP as ASCII text of course.

Comments are closed.