Blog Verification

Posted by Justin on October 22, 2006
Computers, The "I figured it out" Dept., Websites

I noticed a comment spam in on of my newest stories and was a little surprised, so I looked into my logs and figured out some things…

3.54 (Score)
0.5: Comment has no URL in content (but one author URL)
1.04: Commenter granularity (based on URL): 3 old comment(s) (karma avg: 5.180000), 0 recent comment(s) (karma avg: 0.000000).
2: Trackback Source Site (http://webtekconcepts.com/2​006/10/17/google-code-search-kru​gle-or-koders/trackback/)​ does contain Blog URL domain (webtekconcepts.com).

5 hours, 49 minutes
2006-10-17 10:00:43GMT
848290 Blog Verifica
Author: 848290 Blog Verification
E-mail:
IP: 82.146.98.203
URL: http://webtekconcepts.com/2006/1​0/17/google-code-search-krugle-o​r-koders/trackback/​
Google Code Search, Krugle or Koders…. 848290 Blog Verification… 848290…

Traceroute:
xxxxxxxhost [/xxxxxxxxxxxxx]# traceroute 82.146.98.203
traceroute to 82.146.98.203 (82.146.98.203), 30 hops max, 38 byte packets
1…..[Removed]
2…..[Removed]
3…..[Removed]
4 g3-4.core01.iad01.atlas.cogentco.com (69.31.30.2)
5 v3496.mpd01.dca01.atlas.cogentco.com (154.54.5.45)
6 t2-4.mpd03.jfk02.atlas.cogentco.com (154.54.6.14)
7 g13-0-0.core02.jfk02.atlas.cogentco.com (154.54.5.233)
8 p2-0.core01.lon01.atlas.cogentco.com (66.28.4.190)
9 p4-0.core01.bru01.atlas.cogentco.com (130.117.1.157)
10 XS4ALL.demarc.cogentco.com (130.117.19.42)
11 ge-1-0-22-0703.bru-ix-d02.ipv4.evonet.be (82.146.112.18)
12 195.144.72.173 (195.144.72.173)
13 195.144.72.205 (195.144.72.205)
14 195.144.72.206 (195.144.72.206)
15 082-146-098-203.dyn.adsl.xs4all.be (82.146.98.203)

Whois: EVONET Belgium Internet routing

I did a lot of searching and found others that had the same problem, but had odd domains with ######.ro in it. Mine didn’t, which was a little weird. Tracing all the IP’s listed online from different forums found that one provider: Evonet in Belgium is the underlining cause. The culprit, one of their DSL customers apparently running software checking out blogs.

Digging into the comment a little more (ie: actually getting into MySQL and looking over the data), the comments are Track backs with a user agent of WWW-Mechanize/1.20. Mechanize is a Python based Screen Scrapper.

Now, how does this effect you? Screen Scrapping usually comes from a Scrapping Site. This is where a site Scrapes yours, puts your content on theirs and calls it good. This may not seem that bad until you find out they’ve got some time of advertising on their site. Their whole purpose is to steal the traffic that is supposed to come to your website. They sign up for AdSense and the like in hopes of grabbing enough content to make their page rank higher than yours for when someone searches.

Oddly, there is a movement to stop this type of crap by logging the sites that are Made For Advertising (MFA). Check out http://www.topmfa.com/

As of yet, I still have not found another site with the same exact content as some of my posts. On top of this, the fact that URL is all messed up in most cases, it tells me that someone is either building up content or has really miss-configured the software.

Oddly enough, Screen Scrapping and MFA blogs are linked to my recent experiment on Link Blogs automatically linking to you when Tags show up that they want.

There are a few ways to get rid of this kind of crap. The easiest way to get rid of this crap is to block the IP range 82.146.98.* That will stop the blog verification stuff cold - I haven’t had any make it through since. You can also block the Agent WWW-Mechanize, but I wouldn’t do this since people actually use it for valid stuff. You could possibly alienate some of your visitors. Oddly, my stats rate Mechanize as Unknown, but Spam Karma picks it up… Either way, whichever you choose, the spam will be gone until something changes.

An alternate way: Some people have figured out that Captchas work really well for preventing this type of stuff. I’m not into this, and at the moment, I don’t mind people putting in their own websites, sigs, etc. Maybe later when I start getting more traffic I’ll change my mind.