A client's website (of 3000 pages) has recently seen 1.3 Million pages indexed.
We need to hunt for the reasons why and the fastest solution is actually to go through the access_log files in apache and determine where to GoogleBot is getting into (what is likely to be) an infinite loop caused by some bad coding... by the client's web developer.
Click into action goes Nemek and requests that the coders deliver a grep output from the log file as we have SSH access instead of him downloading the whole (huge) raw log file.
A 2010 post by Ian Lurie at Portent explains some excellent ways to use this data so I won't cover that here, but you can check it out on archive.org here. This post is a quick primer to show that you've got to have things set-up in the first place to have the data to work with.
It seems that when the server was set-up the log file format simply didn't include the referrer and user agent data.
The http.conf includes the standard directive for "combined" log data...
LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"" combined
... but it wasn't being used....
<VirtualHost *:80>
ServerName CLIENT.com.au
ServerAlias CLIENT.com.au
DocumentRoot CLIENT/htdocs
CustomLog CLIENT/logs/access_log common
ErrorLog CLIENT/logs/error_log
</VirtualHost>
First rotate the logs with this quick script all on one line in SSH - so the logs start afresh
gunzip access_log_archive.gz; cat access_log >> access_log_archive; echo -n > access_log; gzip access_log_archive;
Edit the vhosts.conf file to replace "common" with "combined" and then restart apache.
service httpd restart
Do you have any custom log formats for SEO? Add them in the comments below...
PS. And in the time it has taken to write this post Nemek has found the offending script spewing URLs at GoogleBot. I'm off to edit some offending PHP code.