This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite

So, the database is slow - why?

There can be several reasons for this. Perhaps a few very heavy queries are bogging down the database. In this case, you’d typically set up slow query logging and find them in the slow.log. However, sometimes the reason is simply lots of frequent queries.

The simple query is run twice a second and completed in 30ms during the beta testing. However, the database has grown and the query contains clauses not catched by indexes you set up initially, so by now it takes more than 0.5s to run. It may be possible to catch it in the slow-query log by tuning down the threshold for what queries to run, but quite often one doesn’t want to log every query (due to IO-performance and disk usage) and analyzing the log file can be non-trivial when there are gigabytes of it

Enter snapshot logging. Simply capture the process list in a regular interval i.e. twice a minute; now it’s easy to do statistics on what queries the database is most busy with - it’s just a matter of counting the number of times we see the different queries!

MariaDB/MySQL/Percona

I usually throw this into the crontab:

* * * * * { for i in 1 2 ; do sleep $((RANDOM/1092)) ;  date --rfc-3339=seconds; mysql -e "show processlist" ; done ; } >> /var/log/mysql-processlist.log 2>&1

The random sleep is to avoid biases due to the monitoring always running exactly before or during cronjobs that accesses the database.

Log rotation is nice to have - this can be thrown into the logrotate configuration:

/var/log/mysql-processlist.log {
    create 640 mysql adm
    rotate 30
    daily
    compress
    dateext
    dateformat -%Y-%m-%d
}

And to analyze the data - usually something like this:

cat /var/log/mysql-processlist.log | \
  grep -v Sleep | \
  grep -vE '^\| Id.*User.*Host' | \
  grep -v 'show processlist' | \
  grep -vE '^20[12][0-9]-[01]' | \
  cut -f8 | \
  perl -pe 's/\'"'"'.*?(\'"'"'|$)/'"'"'asdf'"'"'/; s/201[0-9]-\d\d-\d\d/YYYY-xx-xx/g; s/\d{3,10}/XXX/g; s/ = \d+/ = XXX/g' | \
  sort | \
  uniq -c | \
  sort -nr | less

If the list is relatively short, probably the queries at the top should be investigated closer. If it has a long tail, probably the Perl regexps at the top should be fixed up better to group up similar queries.

Postgres

To get snapshot logging in postgres, we’ll need to set the track_activites boolean configuration parameter on. track_activity_query_size can also be tuned upwards.

Postgres is sweet, the process list is available in true database format. Why log to a log file when one can log it to a database table? If one knows a bit of SQL, it’s much easier to do analysis on data in a database than on a log file.

Set it up like this:

$ sudo -u postgres psql
postgres=# create database meta_pg_monitoring;
postgres=# \c meta_pg_monitoring;
meta_pg_monitoring=# select now(),* into pg_stat_activity_log from pg_stat_activity;

Then, going forward, use some crontab entry like this (as postgres user):

* * * * * { for i in 1 2 ; do sleep $((RANDOM/1092)) ;  psql -c 'insert into pg_stat_activity_log select now(),* from pg_stat_activity;' meta_pg_monitoring > /dev/null 2>&1 ; done ; } >> /var/log/mysql-processlist.log 2>&1

Simple statistics:

meta_pg_monitoring=# select count(*),current_query from pg_stat_activity_log where not waiting group by current_query  order by count(*);

You’d probably want to tune this by using regexps (the substring function) on the current_query, restricting the now-timestamp, adding some indexes and what-not. Partial indexes is allowed in postgres:

meta_pg_monitoring=# create index pg_stat_activity_log_nonwaiting on pg_stat_activity_log(now) where not waiting;

cleanup job:

10 * * * * psql -c delete from pg_stat_activity_log where now+'2 days'<now();

Note: look out for long-lasting transactions! If now()-xact_start is high on a connection, the transaction was started but never completed (with commit or rollback), this may cause a big performance drop!

Apache

We usually log elapse time in the access log, hence the snapshots aren’t that important for us. Still they may be useful, i.e. if the request never gets completed they won’t be logged in the access log.

Snapshots can be created by enabling the server-status server module. The snapshot will come in HTML-format, which is not so handy to work with.

* * * * * sleep $((RANDOM/512)) ; wget -qO /var/log/apache//server-status/$(date +\%s).html http://localhost/server-status

Since this creates one file for each snapshot, ordinary log rotation cannot be used … but this one ought to do the trick (adjust according to where your Apache log files are):

0 0 * * * cd /var/log/apache/ ; mv server-status server-status.$(date +\%F) ; mkdir server-status ; tar cjf server-status.$(date +\%F).tar.bz2 --remove-files server-status.$(date +\%F) ; find /var/log/apache/ -name 'server-status*tar.bz2' -mtime +5 -delete ;

System process list

I’d very much recommend installing atop, though by default it logs every tenth minute - that’s far too seldom when researching performance troubles, the servers may be totally clogged down for a whole minute and idle for nine minutes, and atop will show 10% resource utilization.

Experiences, summarized

There are two classes of troubles where the snapshot logging can be extremely useful, it’s for finding the culprit of performance problems, and it’s for doing post-incident investigations.

When it comes to performance problems, if you log all the requests (which one typically does with the web server) and have all the important statistics (particularly timing statistics) available there, then it’s not so much need for snapshot logging. However, if the complete log is huge, analyzing the snapshot logs may be a lot faster and a lot easier than analyzing the full logs.

As for incidents where the service has become unavailable or severely degraded, the cause may be found in requests that never finished, and those may not be found in the request log, hence snapshot logging can be useful for seeing exactly what the server was working with at the moment things went wrong.

In my experience, atop and snapshots of the database process list has in many cases been extremely useful, while I very seldom look into the snapshot logging of the web server process lists.

Tobias Brox

Senior Systems Consultant at Redpill Linpro

Tobias started working as a developer when he finished his degree at The University of Tromsø. He joined Redpill Linpro as a system administrator a decade ago, and have embraced working with our customers, and maintaining/improving our internal tools.

Thoughts on the CrowdStrike Outage

Unless you’ve been living under a rock, you probably know that last Friday a global crash of computer systems caused by ‘CrowdStrike’ led to widespread chaos and mayhem: flights were cancelled, shops closed their doors, even some hospitals and pharmacies were affected. When things like this happen, I first have a smug feeling “this would never happen at our place”, then I start thinking. Could it?

Broken Software Updates

Our department do take responsibility for keeping quite a lot ... [continue reading]

Alarms made right

Published on June 27, 2024

Just-Make-toolbox

Published on March 22, 2024