Tired of hearing complaints from developers who couldn't join tables because they were on different servers, I decided to give MariaDB a try to see if I could leverage this neat feature.
At the time, we had 5 main clusters, classic multi-master active/standby configuration, with some slaves under each of them. I wanted to create a "super slave" that would contain the dataset from all the five clusters, so that developers could connect to it and join at will.
The initial creation of the MariaDB superslave was easy. After installing the binaries, I just bootstrapped it with an xtrabackup copy of our main cluster and set up replication. All went just fine as expected.
Suddendly I realized that I couldn't use xtrabackup to bring the datasets from other masters in the superslave, as the neat xtrabackup tool (which I love) only works at tablespace level. So, armed with lots of patience and time, I imported each dataset into the superslave using mysqldump (which can also save the master position) and after approximately two weeks, I had the superslave running with a impressive (at least for me) 7 TB dataset.
Everything ran just fine for some days, then suddendly I started getting replication errors for queries coming from a particular master. Basically, the slave couldn't find a row that on the master was deleted just fine. My first aid response was to run a pt-table-checksum (yes, you can run that with a multisource slave if you are careful) and I was surprised to find hundreds chunks with differences for that table, considering that we had no failures whatsoever since the superslave was bootstrapped.
I decided to sync the table anyways.
To my surprise, another checksum ran shortly after the sync reported still a large amount of differences. This was really head scratching, especially considering that the problem only appeared on one specific table out of thousands we had imported from the various clusters.
I started to suspect that some odd bug was biting me and this table, for some reason, was the only one triggering it. It was your average small logging table, nothing fancy, but I noticed that there was a datetime column as part of the primary key. In any case, I had no idea what could cause the inconsistency between the master and the slave for this particular table by the look of it.
I decided to peruse the binary logs to see what was passed from master to slave.
At that time, we were using MIXED replication format, which is by the way the default.
Using mysqlbinlog -v I printed out INSERT statements from one of the most recent binary log files from the master, with the intention of comparing the values passed via replication with the actual values as found on the MariaDB slave.
I grepped the mysqlbinlog output for the INSERT statements on this table, and noticed that all of them contained fractional seconds. This is actually quite common with Java ORMs and libraries, they like to pass millisecond information, however I knew that the DATETIME and TIMESTAMP columns in MySQL do not save this information by default: you have to create the column including the desired precision for that, otherwise the fractional part is discarded.
At least, this is what I was believing before actually comparing rows of this table between the master and the slave. ...
Comparing the MySQL master and the MariaDB slave, I found that, for some of the rows, there was a one second difference in the datetime column, which was part of the multicolumn primary key.
That explained the replication errors, as of course, the PK on master and slave were, in fact, different and as a result, delete statements on the master couldn't find the row on the slave!
It was easy enough to correlate this data drift to a different behaviour between MySQL master and MariaDB slave regarding the fractional seconds handling: by looking at some of the rows on both master and slave, it was clear that the difference only existed when the fractional part was greater than 500 - the master was rounding to next second, while the slave was always truncating.
A quick test confirmed the above:
master>insert into test_datetime values ('2015-10-05 10:00:00.501');
Query OK, 1 row affected (0.00 sec)
master>select * from test_datetime;
| dt |
| 2015-10-05 10:00:01 |
1 row in set (0.00 sec)
MariaDB slave [rick]> select * from test_datetime;
| dt |
| 2015-10-05 10:00:00 |
1 row in set (0.00 sec)
I realized then that the problem, although only showing up in this unfortunate situation, was actually much bigger than that, as Java was the main technology used in the Company and I was pretty sure that almost all inserted datetimes and timestamps contained the damn milliseconds!
Some googling suggested that, although MariaDB 10 had some new and neat features backported from 5.7, and almost all the features from 5.6, was actually based on MySQL 5.5 - and the fractional seconds rounding in MySQL was only introduced in 5.6.4.
Anyways, I raised a bug to MariaDB back in 2015 when I first discovered this, as it should at least be clearly documented that replicating from MySQL 5.6 into MariaDB 10 will lead to inconsistent slaves if fractional seconds are used. You can check the status of the bug yourself...
I have actually found a workaround to this bug, which basically consists in switching to ROW based replication. when using ROW based replication the fractional seconds thing is no longer in play, as the binary log will contain the actual row content from the master, instead of a statement to insert the row values.
My experience is based on MariaDB 10.0.21, but judging by the attention that the above bug has received, I believe this may still be valid today.
Of course, it goes without saying that I had to spend 2 weeks running table checksum/table sync on all 7 TB of data in order to bring the superslave back to consistency, after setting replication format to ROW. But it has been running flawlessly since.