You are reading an archived post from the first version of my blog. I've started fresh, and the new design and content is now at boxofchocolates.ca

Slice and Dice your Access Logs

July 16, 2004

I like knowing where my readers are coming from by looking at the referrers for my sites. It gives you a better idea of your readership — not that we are identifying specific readers, but with referrers you certainly see what other sites people are reading as well, perhaps providing insight into the types of things they are reading elsewhere.

While many software packages exist to track and create statistics they don’t always provide the detail we would like to see. Using some command line tools (don’t be afraid…) you can get some very fine details from your access logs.

Typical stats packages include analog, webalizer, awstats, and various commercial products like WebTrends, Coast, or others. Shaun Inman’s latest version of ShortStat has picked up a bit of momentum in the blogosphere recently as well – I’ve installed it for this blog, and I quite like it.

These tools produce some nice statistical rollups of your site as a whole – this is one of their greatest strengths. However, one thing that Shaun’s ShortStat and the others don’t do is provide more details on the referrers for a specific page. I want to be able to ask the question, what are the referrers for this particular article/post? Which referrer sent the most traffic to my site this month? Which IP addresses are reading my ATOM feed?

ShortStat looks like it could be customized to add this functionality, but at this point, I’ll stick to my “old-fashioned” technique.

Command-Line and Conquer

When it comes to command line usage and doing simple things, *Nix based machines have some nice functional tools that are generally included with the distribution. AWK is your friend, and can give you all you need to get more information from your logs. To do this on a Windows machine, you may want to install Cygwin – a Linux emulator – that gives you access to many of the power tools a Linux like environment provides. I’m assuming that this would also work on a Mac OSX environment as well, though I can’t test it, since my brand new PowerBook doesn’t exist.

Here’s the command that I use to check referrers in a log file for a specific resource (article or blog post). It is quite a bit simpler than it looks (mind the wrap in these examples – they are all on one line):

awk '/GET \/path\/to\/file/ {print $11}' /path/to/logfile | sort | uniq -c | sort -nr

When I run this locally on my machine it would look like this:

awk '/GET \/articles\/hiddeninformation/ {print $11}' /data/accesslogs/wats.ca | sort | uniq -c | sort -nr

Lets take it from the top:

awk commands are basically formatted like this: pattern {action}

In the command above i’ve told the computer to start the awk engine, look for a certain pattern (in this case /GET /path/to/file/) and then process it, printing the 11th field it came across. Note that the pattern we are looking for is “GET /path/to/file”, but since the pattern is a regular expression, the slashes inside need to be “escaped” (hence all the \) so that the engine doesn’t think we’re at the end of the expression.

The final component before the first “pipe” (the | character) is the path to the logfile you want to process.

That results in a listing of all the referrers for that particular resource in that file. The problem is, that it is a big long list that doesn’t really give us anything meaningful. Now we do a little more processing with all the extra commands after the pipes.

The next is to “sort” — another useful command line tool that is pretty straightforward. Now we have a long list that is sorted by the referrer. There are however some duplicates (at least we hope so!) and we certainly don’t want to automating the counting procedure.

Filter the duplicates with “uniq” and add the “-c” parameter — now we have a list that contains no duplicates, and with the added “-c” (for count) and it will prefix each line with the number of times that unique referrer occurred.

Almost there — we now have a list that includes the count — for our final piece of processing, we use “sort -nr” which sorts the results in reverse numerical order.

Et voila. Instant feedback on where the traffic is coming from on a particular page. Just use it with caution — it can be addictive!

Notes and Thoughts

The value you print is dependent upon the log format you are using — my server is configured to use the Apache Combined Log Format, which means the referrer field in each line of the log is in position 11. Here’s a sample logfile entry from one of our recent articles Contradictions in Accessibility – Hidden Information:

24.211.220.168 - - [16/Jul/2004:01:07:41 -0400] "GET /articles/hiddeninformation/63 HTTP/1.1" 200 15251 "http://v1.boxofchocolates.ca" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7) Gecko/20040628 Firefox/0.9.1"

Note that “position 11” is essentially determined by the spaces – using awk to process the file splits the line at each space by default.

While Apache is the most commonly used web server out there, there’s no reason you couldn’t use this if you were using IIS Log files as well, but I’ll freely admit that I haven’t tried it.

You may also want to add another command to the end | more to keep the stats from flying off the screen

Filed under:

Comments are closed.