This one I'm about to describe is a serious one, yet there has been a very quiet response from Percona bug team at Launchpad, where I first submitted it (because we use Percona server here). If you have read my other posts you know by now that I am a laz... err, busy guy so bear with me if only today I have verified that this is, in fact, an upstream bug and have raised a bug to Oracle too.
It affects 5.6.33, latest version at the time of this article.
Anyways, the bug started to manifest itself in the form of complete stalls of our main cluster, which was happening at different times and apparently in a way that was unrelated to the amount of traffic hitting the master. When stalling, system CPU time was topping 100% of total available CPUs, no wonder that the server was unresponsive...
Complete stalls are not nice: everything is frozen and you can't even check what's going on, as MySQL would not respond to any command. In some cases this just lasted few seconds, in some other we had stalls which lasted well over 1 minute .
We have the habit to run a script that takes snapshots of "SHOW ENGINE INNODB STATUS" every 10 seconds on each master, and saves them to files; we use these to do post-mortem analysis of incidents that hit the production (traffic spikes and the like). But this time, the snapshots didn't give any useful information: the DB was just running smoothly, low traffic, no sign of anything suspicious. Few second later, total stall. Then smooth again. Weird.
I had to find a way to catch the bug while it was biting. I decided to craft to a small script to enable the general log for 60 seconds, then turn it off again and run pt-query-digest on the generated log, printing a digest of top offenders. Then I waited for the next occurrence, patiently watching sysstat output. As soon as the system CPU time skyrocketed again, I ran my script that captured 60 seconds of full database traffic.
I was really surprised to find that in the sampled time, the top query by number of invocations was SHOW WARNINGS, with about 10,000 invocations per second.
Whoa!!! What the hell is that? Why are production applications executing SHOW WARNINGS in the first place? And not only that, but the amount of them was INSANE!!
Well, it was now obvious that the server stall was actually caused by a flood of SHOW WARNINGS. Not so obvious was the reason for such large amount of this statement to arrive on our master.
Lo and behold, turns out that there is a neat feature in Hibernate, the ORM that we use here, that automatically causes a SHOW WARNINGS to be issued after each and every statement that is executed. To have this enabled it is sufficient to set the Hibernate log level to "warning". D'oh!!!
Of course, Hibernate's setting was the source of all troubles. We asked the devs for a quick fix and after bringing Hibernate's log level back to "error", the problem disappeared.
Still, the behaviour of MySQL when a large amount of SHOW WARNINGS is received was not proper: it would be very easy for anyone to stall any server by just flooding it with these statements, which do not require any special privilege. I felt that this is a serious bug (after all, it has caused downtime in production for us) so I decided to find a way to reproduce it, and to raise a bug as soon as I succeeded.
So I went to our 32 core test server and set up a sysbench test using this simple piece of LUA code:
pathtest = string.match(test, "(.*/)") or ""
dofile(pathtest .. "./sysbench/tests/db/common.lua")
Then I ran sysbench in order to try and reproduce the stall. The command line I used at first was the following:
./sysbench/sysbench --mysql-host=localhost --mysql-user=sysbench --mysql-password=sbtest --mysql-db=sbtest --test=show_warnings.lua --max_requests=0 --max_time=20 --num-threads=30 --mysql-engine-trx=yes run
To my dismay, this did not produce the desired effect: there was no stall and the server was able to perform approximately 560,000 SHOW WARNINGS per second using 30 cores out of the 32 available on the server (keep in mind I was also running sysbench on the same machine).
This was really head scratching to me, as I was pretty sure that this statement was the root cause of our incidents. So I decided to look at the connection string used by the offending application and noticed that there was link compression enabled. This was a common setup throughout the company, and was in place for historical reasons, to save bandwidth when we were on the old non-gigabit infrastructure.
I decided to add compression to the sysbench test and try again:
./sysbench/sysbench --mysql-host=localhost --mysql-user=sysbench --mysql-password=sbtest --mysql-db=sbtest --test=show_warnings.lua --max_requests=0 --max_time=20 --num-threads=30 --mysql-engine-trx=yes --mysql-compression=on run
Needless to say, that did the trick!! System CPU time skyrocketed and used over 80% of the 32 cores of the test server. Now, this is absolutely weird - what can protocol compression have to do with SHOW WARNINGS? Moreover, throughput fell down from 560k to 48k qps.
I have filed this as a bug on Percona launchpad long ago, but there has been very little activity on it, as you can see.
I feel that this bug is a serious one, as Hibernate is widely used at MySQL based shops, and there may be other ORMs that have similar options that can flood poor MySQL with thousands of SHOW WARNINGS per second.
Since I have verified that this is an upstream bug, today I have filed it on bugs.mysql.com as well, hopefully it will receive the attention it deserves.