AWK

Pattern scanning and processing language

Check for fake googlebot scrapers

I noticed a bot scraping using fake GoogleBot useragent string.

Here is a one liner that can detect the IPs to ban:

$ awk 'tolower($0) ~ /googlebot/ {print $1}' /var/www/httpd/access_log | grep -v 66.249.71. | sort | uniq -c | sort -n

It does a case-insensitive awk search for keyword "googlebot" from apache log file removing IPs with "66.249.71." which belongs to google and prints the output in a sorted hit count.

You can validate the IPs with:

IP=66.249.71.37 ; reverse=$(dig -x $IP +short | grep googlebot.com) ; ip=$(dig $reverse +short) ; [ "$IP" = "$ip" ] && echo $IP GOOD || echo $IP FAKE

Replace the IP value with the one you want to check.

Find files used for htauth

Below will list all of the files that are used for apache authentication in /var/www/html file path:

find /var/www/html -name .htaccess | xargs awk '{sub(/^[ \t]+/,"")};/File/{print $2}' | sort | uniq

Here is the breakdown:

find /var/www/html -name .htaccess

Find all files named ".htaccess" at path "/var/www/html"

xargs awk '{sub(/^[ \t]+/,"")};/File/{print $2}'

The search output gets piped via xargs to awk, deleting leading whitespace (spaces and tabs) from front of each line and output is of only the second field of lines containing the text "File".

sort | uniq

Awk output is further piped through sort and uniq which results in the files being used for apache authentication.

In place variable substitution with AWK

The content of the input file becomes stdin for rm and awk. rm ignores the input and removes the file, but its file descriptor remains open until both commands, rm and awk is complete. AWK process this "nameless" file and creates a new file:

  { rm $CSV_FILE && awk -F',' -v stid="$ST_ID" '$1 ~ stid {gsub(/&/,"",$7)}1' > $CSV_FILE; } < $CSV_FILE

  • "-v" sets the awk variable that is passed in via shell script variable.
  • "gsub" replaces with "" all occurrence of & in the 7th field. Use "sub" for single/first occurrence substitution or GNU Awk's gensub for more articulated substitutions.
  • "1" is a shortcut which means print the current record:

Week of Month

Here is a simple one liner to get the week of month via awk from a `cal` output:

$ cal | awk -v date="`date +%d`" '{ for( i=1; i <= NF ; i++ ) if ($i==date) { print FNR-2} }'

Comment