Executive summary
- TCP connections may freeze due to network troubles, spontaneous reboots or failovers. The application should be robust enough to handle it.
- The concept of “TCP keepalive” is one way of solving it. Linux does not come with any knobs to turn it on globally, and by default it’s turned off - but there are workarounds.
- “TCP keepalive” needs to be configured according to the network; one may want slightly different configuration for communication between two servers in the same server center and communications towards a Mars rover.
Problems happen
Redundancy is important in environments where uptime is important, but it’s not a “silver bullet”. There are countless examples of redundant systems where the “failover” turned out to be a “failtogether” (to mention one, the Viking Sky incident was very close to being a major disaster).
We’re running some MariaDB/Galera database clusters with a service IP address and keepalived to ensure the service address will be moved to another node should anything go wrong with one node in the cluster. We also wrap this together as a “Database as a Service”-product. The other day a node in the database cluster had a hiccup. The failover worked as it should and shifted the load to another node. However, this caused problems on the (non-redundant) application side, TCP-connection apparently frozen. I was a bit surprised that this wasn’t on some list of “known problems with failover that everyone ought to know about”.
Briefly, this is what happened:
- The client had an open TCP-connection to a server.
- The client sent a time-consuming request to the server
- failover happened and another server took over the IP-address. (This problem is not limited to failover setups - the story would be the same if a stand-alone server would suddenly reboot without a clean shutdown).
- The other server (or the freshly rebooted server) knows nothing about the old TCP-connection.
- The client is still waiting for the data from the request - data it will never get - so it will wait forever.
The problem only happens if the client is waiting for data from the server.
Swap points two and three above, and the server will be kind enough
to inform the client (RST
) that something has happened, allowing the
application to either establish a new connection to the server and
retry the transaction - or fail and be restarted by systemd.
During our maintenance windows it’s reasonable to believe that MySQL
goes down in a controlled manner prior to failover. Then the server
will send FIN
(“Goodbye, I’m out of here!”) to the clients just
before failover, allowing the application to reconnect or be
restarted by systemd.
That’s two conditions that has to be true, the failover must be without a clean shutdown, and it has to happen just when the application was waiting for data from the server. Since such sudden failovers are pretty rare events, and since the applications usually don’t send long-lasting requests to the database all the time, this is a bit of a corner case.
Reproducing the problem
On MariaDB and MySQL, it’s possible to simulate a long-running query
by doing SELECT sleep(10)
. In PostgreSQL the equivalent is SELECT
pg_sleep(10)
.
In our setup the failover can be easily triggered by systemctl
restart keepalived
. Alternatively, dropping
traffic from the database server using iptables
or nft
,
i.e. iptables -A INPUT -j DROP -s $DB_IP
should cause same behaviour.
We’d like feedback from the database after ten seconds, but instead we’re experiencing waiting indefinitely for the results.
Solutions
I don’t see any ways to fix this from the server side, but three ways to fix this at the application side:
- If using some library/software component for connection pooling/proxying, then there is a chance that it will be smart enough to discover such problems and automatically reestablish all connections.
- It’s usually sane to set up some timeouts at the application side, causing the application to assume something is wrong if the database server hasn’t responded within some certain time period.
- TCP keepalive - it basically involves sending empty packages over
the otherwise idle connections and check that there will be
ACK
-packages coming in return - and the connection will be shut down after a failover. Most operating systems supports this.
Connection pooling is quite common. Also, quite many of our customers are running web solutions - web servers would typically time out requests after three minutes. Perhaps a server thread would actually become frozen, but we would probably not notice. Single-threaded batch servers seems to be one of the few corner cases where things just stops working due to this bug.
With the exception of web requests, timeouts for database queries needs to be tuned carefully - there is no “one value fits all”. If a query takes 49 hours to run, it would be very uncool if the connection got closed due to a timeout after 48 hours. TCP keepalive packages also need some configuration, and the configuration depends most of all on the network connectivity. Unless the TCP packages are transported over RFC 1149 or on tapes (pretty safe assumption), the keepalive timeout can be set in minutes or even seconds even if the SQL queries may takes hours to run.
I can think of three disadvantages of TCP keepalive:
- Arguably it’s a feature that TCP-connections can survive long-lasting network glitches - like, I’m happy if I can put my laptop into suspend mode, open it again some hours later, with all my SSH connections still being alive.
- It’s a feature that TCP-connections can survive on unreliable networks. With keepalive turned on, there is a risk that the TCP-connections will be shut down too early.
- Keepalive-packages causes extra traffic, stealing from the available bandwidth. On server-to-server communication this extra bandwidth should be negligible. Corner cases may exist, like on the 300 baud connection from that Mars rover, or on a server having a million of open, idle TCP-connections (One of our customers have tens of thousands open TCP-connections, but that’s quite rare).
As with timeouts, the keepalive settings may need to be tweaked. Aggressive keepalive settings may cause excess bandwidth consumption and TCP-connections being dropped too often. Too lenient settings, and application may be frozen unacceptably long. With cabled network, servers located in well-run server centers, few network hops between application and database server, single-digit millisecond round-trip figures, excessive amounts of free bandwidth we can have quite aggressive settings.
While TCP keepalive is almost certainly a good idea, it’s off by default in Linux and it cannot be turned on globally under Linux, it has to be done by the application. The default configuration is also more suitable for the Mars rover than for server-to-server communication within the same (or nearby) server center(s).
The LD_PRELOAD workaround
Keepalive may be enabled by the application when initiating the
connection, but sometimes it’s not possible, not easy or not desirable
to modify the application. On Linux it’s possible to override any
function - all that is needed is a little bit of programming and
setting an environmental variable before starting the application. My
colleague quickly wrote up some C-code for overriding the int
socket(int, int, int)
-function letting it set the appropriate flags
for doing keepalive - but he was reinventing the wheel, the code
already exists and is even available using the standard Debian/Ubuntu
package repositories. So basically, all that should be needed should
be an apt-get install libkeepalive0
and to add some lines to the
systemd service definition:
[Service]
Environment="LD_PRELOAD=/usr/lib/libkeepalive.so"
Environment="KEEPIDLE=7"
Environment="KEEPINTVL=1"
Environment="KEEPCNT=3"
With the setup above, it will take minimum three but maximum ten seconds from the TCP connection is broken until it’s terminated. Those settings are quite aggressive - while it works out between the application server and the database server, if the application server is having TCP-traffic going to other parties - like, say, mobile clients, then the settings above is most likely too aggressive.
For some reason, I didn’t get the environmental variables above to work. I didn’t do more research into why, but tossing the configuration into the global Linux network configuration solved my problems:
cat > /etc/sysctl.d/tcp_keepalive.conf
net.ipv4.tcp_keepalive_time = 7
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 3
^D
sudo systemctl restart systemd-sysctl
Beware that the settings above will be valid for all applications using keepalive. In real production with TCP-connections going to arbitrary third-parties, I’d recommend a lot higher values - 80, 20 and 4 sounds about right - or even higher numbers if expecting many long-lasting important TCP-connections from/to mobile clients.
If one cannot alter the environment and/or wants a global fix, then
toss /usr/lib/libkeepalive.so
into /etc/ld.so.preload
Why not just to set it on the server side?
Both the latest versions of MySQL and MariaDB supports turning on TCP keepalive in the configuration file. For a short moment I thought “why not just configure it once and for all on the server side?” Well, if the keepalive was a part of the TCP protocol and negotiated when setting up the TCP connection, then probably this could be solved at the server side - but that’s not the way it works. Keepalives configured on the server side will only help the server detecting that the client has gone dead, it will not help the client to detect that the server is dead.
More reading
- TCP Keepalive HOWTO (from 2007, but still relevant)
- Libkeepalive on Sourceforge
Credits
I received lots of support from Kjetil Homme and Tore Anderson while investigating and fixing this problem.