Check for fake googlebot scrapers
Wed, 10/05/2011 - 12:13 — sandipI noticed a bot scraping using fake GoogleBot useragent string.
Here is a one liner that can detect the IPs to ban:
$ awk 'tolower($0) ~ /googlebot/ {print $1}' /var/www/httpd/access_log | grep -v 66.249.71. | sort | uniq -c | sort -n
It does a case-insensitive awk search for keyword "googlebot" from apache log file removing IPs with "66.249.71." which belongs to google and prints the output in a sorted hit count.
You can validate the IPs with:
IP=66.249.71.37 ; reverse=$(dig -x $IP +short | grep googlebot.com) ; ip=$(dig $reverse +short) ; [ "$IP" = "$ip" ] && echo $IP GOOD || echo $IP FAKE
Replace the IP value with the one you want to check.
- sandip's blog
- Login or register to post comments
- Read more
Find files used for htauth
Wed, 04/06/2011 - 15:39 — sandipBelow will list all of the files that are used for apache authentication in /var/www/html file path:
find /var/www/html -name .htaccess | xargs awk '{sub(/^[ \t]+/,"")};/File/{pr int $2}' | sort | uniq
Here is the breakdown:
find /var/www/html -name .htaccess
Find all files named ".htaccess" at path "/var/www/html"
xargs awk '{sub(/^[ \t]+/,"")};/File/{pr int $2}'
The search output gets piped via xargs to awk, deleting leading whitespace (spaces and tabs) from front of each line and output is of only the second field of lines containing the text "File".
sort | uniq
Awk output is further piped through sort and uniq which results in the files being used for apache authentication.
- sandip's blog
- Login or register to post comments
- Read more
In place variable substitution with AWK
Thu, 02/04/2010 - 12:21 — sandipThe content of the input file becomes stdin for rm and awk. rm ignores the input and removes the file, but its file descriptor remains open until both commands, rm and awk is complete. AWK process this "nameless" file and creates a new file:
{ rm $CSV_FILE && awk -F',' -v stid="$ST_ID" '$1 ~ stid {gsub(/&/,"",$7) }1' > $CSV_FILE; } < $CSV_FILE
-
"-v" sets the awk variable that is passed in via shell script variable.
"gsub" replaces with "" all occurrence of & in the 7th field. Use "sub" for single/first occurrence substitution or GNU Awk's gensub for more articulated substitutions.
"1" is a shortcut which means print the current record:
Week of Month
Thu, 01/08/2009 - 10:24 — sandipHere is a simple one liner to get the week of month via awk from a `cal` output:
$ cal | awk -v date="`date +%d`" '{ for( i=1; i <= NF ; i++ ) if ($i==date) { print FNR-2} }'