This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite
So, the database is slow - why?
There can be several reasons for this. Perhaps a few very heavy queries are bogging down the database. In this case, you’d typically set up slow query logging and find them in the slow.log. However, sometimes the reason is simply lots of frequent queries.
The simple query is run twice a second and completed in 30ms during the beta testing. However, the database has grown and the query contains clauses not catched by indexes you set up initially, so by now it takes more than 0.5s to run. It may be possible to catch it in the slow-query log by tuning down the threshold for what queries to run, but quite often one doesn’t want to log every query (due to IO-performance and disk usage) and analyzing the log file can be non-trivial when there are gigabytes of it
Enter snapshot logging. Simply capture the process list in a regular interval i.e. twice a minute; now it’s easy to do statistics on what queries the database is most busy with - it’s just a matter of counting the number of times we see the different queries!
MariaDB/MySQL/Percona
I usually throw this into the crontab:
* * * * * { for i in 1 2 ; do sleep $((RANDOM/1092)) ; date --rfc-3339=seconds; mysql -e "show processlist" ; done ; } >> /var/log/mysql-processlist.log 2>&1
The random sleep is to avoid biases due to the monitoring always running exactly before or during cronjobs that accesses the database.
Log rotation is nice to have - this can be thrown into the logrotate configuration:
/var/log/mysql-processlist.log {
create 640 mysql adm
rotate 30
daily
compress
dateext
dateformat -%Y-%m-%d
}
And to analyze the data - usually something like this:
cat /var/log/mysql-processlist.log | \
grep -v Sleep | \
grep -vE '^\| Id.*User.*Host' | \
grep -v 'show processlist' | \
grep -vE '^20[12][0-9]-[01]' | \
cut -f8 | \
perl -pe 's/\'"'"'.*?(\'"'"'|$)/'"'"'asdf'"'"'/; s/201[0-9]-\d\d-\d\d/YYYY-xx-xx/g; s/\d{3,10}/XXX/g; s/ = \d+/ = XXX/g' | \
sort | \
uniq -c | \
sort -nr | less
If the list is relatively short, probably the queries at the top should be investigated closer. If it has a long tail, probably the Perl regexps at the top should be fixed up better to group up similar queries.
Postgres
To get snapshot logging in postgres, we’ll need to set the
track_activites
boolean configuration parameter on.
track_activity_query_size
can also be tuned upwards.
Postgres is sweet, the process list is available in true database format. Why log to a log file when one can log it to a database table? If one knows a bit of SQL, it’s much easier to do analysis on data in a database than on a log file.
Set it up like this:
$ sudo -u postgres psql
postgres=# create database meta_pg_monitoring;
postgres=# \c meta_pg_monitoring;
meta_pg_monitoring=# select now(),* into pg_stat_activity_log from pg_stat_activity;
Then, going forward, use some crontab entry like this (as postgres user):
* * * * * { for i in 1 2 ; do sleep $((RANDOM/1092)) ; psql -c 'insert into pg_stat_activity_log select now(),* from pg_stat_activity;' meta_pg_monitoring > /dev/null 2>&1 ; done ; } >> /var/log/mysql-processlist.log 2>&1
Simple statistics:
meta_pg_monitoring=# select count(*),current_query from pg_stat_activity_log where not waiting group by current_query order by count(*);
You’d probably want to tune this by using regexps (the substring function) on the current_query, restricting the now-timestamp, adding some indexes and what-not. Partial indexes is allowed in postgres:
meta_pg_monitoring=# create index pg_stat_activity_log_nonwaiting on pg_stat_activity_log(now) where not waiting;
cleanup job:
10 * * * * psql -c delete from pg_stat_activity_log where now+'2 days'<now();
Note: look out for long-lasting transactions! If now()-xact_start is high on a connection, the transaction was started but never completed (with commit or rollback), this may cause a big performance drop!
Apache
We usually log elapse time in the access log, hence the snapshots aren’t that important for us. Still they may be useful, i.e. if the request never gets completed they won’t be logged in the access log.
Snapshots can be created by enabling the server-status server module. The snapshot will come in HTML-format, which is not so handy to work with.
* * * * * sleep $((RANDOM/512)) ; wget -qO /var/log/apache//server-status/$(date +\%s).html http://localhost/server-status
Since this creates one file for each snapshot, ordinary log rotation cannot be used … but this one ought to do the trick (adjust according to where your Apache log files are):
0 0 * * * cd /var/log/apache/ ; mv server-status server-status.$(date +\%F) ; mkdir server-status ; tar cjf server-status.$(date +\%F).tar.bz2 --remove-files server-status.$(date +\%F) ; find /var/log/apache/ -name 'server-status*tar.bz2' -mtime +5 -delete ;
System process list
I’d very much recommend installing atop, though by default it logs every tenth minute - that’s far too seldom when researching performance troubles, the servers may be totally clogged down for a whole minute and idle for nine minutes, and atop will show 10% resource utilization.
Experiences, summarized
There are two classes of troubles where the snapshot logging can be extremely useful, it’s for finding the culprit of performance problems, and it’s for doing post-incident investigations.
When it comes to performance problems, if you log all the requests (which one typically does with the web server) and have all the important statistics (particularly timing statistics) available there, then it’s not so much need for snapshot logging. However, if the complete log is huge, analyzing the snapshot logs may be a lot faster and a lot easier than analyzing the full logs.
As for incidents where the service has become unavailable or severely degraded, the cause may be found in requests that never finished, and those may not be found in the request log, hence snapshot logging can be useful for seeing exactly what the server was working with at the moment things went wrong.
In my experience, atop and snapshots of the database process list has in many cases been extremely useful, while I very seldom look into the snapshot logging of the web server process lists.
Why automate Ansible
Ansible can be used for many things. There are only a few things I have on my bucket list of things I would like to do, where Ansible cannot help me.
One of my most urgent things to handle was the increasing complexity of Ansible, its configuration and in particular the role development. As I got deeper into Ansible, more and more factors needed to be taken into consideration when setting up a role: the role structure, linting issues, molecule ... [continue reading]