Looking through the logs logs of my web server last night I noticed some odd behaviour, a complete spidering of tamonten.com in a couple of mins from an ip address. No biggy, it's just a bot, but the UA string was that of a normal browser and at no point did it request robots.txt
After a bit of analysis I find that the IP address belongs Cyveillance Inc, a "Cyber Intelegence" company in Washington. and they seem to have a /27 range. A quick log grep later and I find they're spidering me about once a month, but at four seconds for the pass it doesn't look like they're stopping to read anything ( ; _ ; ) Interestingly, well for me at least, they grab the full pages, not checking for changes It seems like bandwidth is cheaper then storage for them.
Now I can't see them being up to no good, in fact they've got some interesting stuff on their blog, and if they're doing some sort of proactive malware scan (a la MS's honeymonkey) I can understand the UA masquerading and not just using the If-Modified-Since header
I know I'm not doing anything nefarious here, so as far as I'm concerned they're wasting my bandwidth. In my digging it seems a lot of people are of the same opinion.
Which is where I get to the point of this post.
Http:BL is a nifty service from Project Honeypot, that makes me get a little geek boner for many reasons. 1) Distributed collection of behaviour is always good 2) The security community seem to have written plugins for everyfuckingthing 3) How you get the data over DNS, is _inspired_.
It's a classic using one thing for a totally different thing while keeping all the pros of the first thing type of thing. Downstream caching, check. Built in expiry time for the data, check. Don't have to prat about with extra firewall rules for updates, check.