Easy site analytics with no Javascript, using GoAccess and shell scripts

Posted by Nathan Pilkenton

Last updated Aug. 1, 2021, 2:54 p.m. UTC

Site analytics without the Javascript

When I was building Chekkin, I needed some basic analytics to track how many people were coming to the site. Though I've used Google Analytics in the past, I decided not to go back to it for two reasons:

  • I was trying to keep the site as lightweight as possible
  • Since I was putting anonymity at the core of Chekkin, relying on Google for analytics felt wrong

Searching revealed plenty of alternatives, but what ultimately caught my eye was GoAccess–it's free, open source, and since it works by analyzing log files, there's no need for any Javascript.

There were two major drawbacks with GoAccess. One was that, since my server rotates log files, I'd eventually lose stats from older than ~60 days. The other was that out of the box, GoAccess has no way to filter by date. It just works with everything in any of the log files you give it, so it's not easy to analyze traffic from only the last day, or week, or month.

Fortunately, both were pretty easy to take care of with a couple of bash scripts, as you'll see below!


Archiving and combining log files

First, we'll need to get all the site log files we want to analyze. As mentioned above, my log files are rotated, but I didn't want to lose stats on old traffic.

As a solution, I wrote a short script to run daily and maintain one big log archive called combinedlogs.log. It concatenates the latest two rotating files with the existing big file, and then uses awk to strip out any duplicate rows–not elegant, but simple.

'log_archive.bash'
---
#!/bin/bash

mv combinedlogs.log oldlogs.log

cat /var/log/www.example.com.access.log /var/log/www.example.com.access.log.1 oldlogs.log | awk '!n[$0]++' > combinedlogs.log

rm oldlogs.log


Filtering logs by date for GoAccess

Now, to actually analyze the logs. (If you haven't already, you'll need to install GoAccess.)

In order to filter for just more recent traffic, I created another bash script that runs locally on my machine. Conveniently, we can run the archive script from above to generate an up-to-date combined log file to analyze:

'logs.bash'
---
#!/bin/bash

# run the log archive script from above
ssh [user]@[server] "bash log_archive.bash"

# download the latest combined log file
scp [user]@[server]:combinedlogs.log .

Then, we'll take an optional argument: an integer for the number of days of history to analyze.

If we run the script with no argument, we'll assume we want to see all traffic:

if [ $# -eq 0 ]; then

	goaccess combinedlogs.log --ignore-crawlers --anonymize-ip -o full_report.html --log-format=COMBINED

	open full_report.html

(Note: I've also passed two optional flags. --ignore-crawlers ignores traffic from some common bots, and --anonymize-ip "sets the last octet of IPv4 user IP addresses and the last 80 bits of IPv6 addresses to zeros" to provide some additional user privacy.)

On the other hand, if an argument is provided, we'll use grep to filter for only traffic that's happened within that number of days:

else

	for i in `seq 0 ${1:-8}`; do gdate -d "-$i days" +"%d/%b/%Y"; done | grep -f /dev/fd/0 combinedlogs.log >> recentlogs.log

	goaccess recentlogs.log --ignore-crawlers --anonymize-ip -o recent_report.html --log-format=COMBINED

	open recent_report.html

fi

That's it! Now, to see basic analytics, I just run bash logs.bash from my project folder. And if I want to filter on only traffic from the past week, I can run bash logs.bash 7 .


Comparison with Google Analytics

Of course, this approach has tradeoffs. It certainly has some benefits, as I already discussed above:

  • Less invasive for end users
  • No extra Javascript to load
  • Counts every visitor, including those using ad- or script-blockers
  • Free (many common alternatives to Google Analytics are paid subscription services)
  • Easy to re-use for other projects

But these benefits also come at a cost:

  • Less detail–you can't see statistics like bounce rate or analyze how users navigate your site
  • No real-time analytics (note: GoAccess can be run real-time, but this approach in particular doesn't allow it)
  • May be harder to filter out bots from the statistics (the --ignore-crawlers flag in GoAccess does a decent job, but it seems like a few bots slip past)

For now, I'm very happy with this solution. If traffic picks up to a point where I need the extra features, I'll probably switch to something like Simple Analytics. But until then, GoAccess is an awesome way to track basic site analytics, without compromising user privacy.