Check for fake googlebot scrapers
Wed, 10/05/2011 - 12:13 — sandipI noticed a bot scraping using fake GoogleBot useragent string.
Here is a one liner that can detect the IPs to ban:
$ awk 'tolower($0) ~ /googlebot/ {print $1}' /var/www/httpd/access_log | grep -v 66.249.71. | sort | uniq -c | sort -n
It does a case-insensitive awk search for keyword "googlebot" from apache log file removing IPs with "66.249.71." which belongs to google and prints the output in a sorted hit count.
You can validate the IPs with:
IP=66.249.71.37 ; reverse=$(dig -x $IP +short | grep googlebot.com) ; ip=$(dig $reverse +short) ; [ "$IP" = "$ip" ] && echo $IP GOOD || echo $IP FAKE
Replace the IP value with the one you want to check.
- sandip's blog
- Login or register to post comments
- Read more
Google Tricks and hacks by d00m
Thu, 06/03/2004 - 13:30 — himanshuGoogle.com is undoubtedly the most popular search engine in the world. It offers multiple search features like the ability to search images and news groups.However it's true power lies in it's powerful commands that can be used and misused.I am writing this article on the basis of my experience using google and trying out ideas when i am bored.Now enough of lecturing...let's get
down to business.)