Monday, October 31, 2016

MyRocks: migrating a large MySQL dataset from InnoDB to RocksDB to reduce footprint

I have been following Facebook's MyRocks project (and Mark Callaghan's blog) for a long time. The idea of an LSM based engine for MySQL is actually a great idea.

We all know that InnoDB sucks at INSERTs.  BTree in general sucks when it's about insertion speed, and the more rows you insert, the more it sucks at it. There are many blog posts on the web that shows the insert speed degradation in InnoDB when the amount of rows in the table grows. Things get much worse faster if your primary key is a random key, for example an UUID.

We hit this problem with our caching servers (yes, we do caching with MySQL!), and in order to be able to scale these servers up we moved since a couple years to the TokuDB engine with great success. TokuDB is based on fractal tree technology, and guarantees the same insert speed, no matter the number of rows you have in the table; furthermore, it has no problems with random INSERTs (inserting rows with random PK values) and it also achieves great compression when compared to InnoDB.

Enter RocksDB and the LSM (Log Structured Merge) technology: we are entering a new planet as far as data storage goes. The idea of an "append only" type of approach for writing database rows is something I never heard before. Databases usually are the most random writer ever! And although SSD will not suffer from write randomness, there are other factors that need to be taken into consideration. I'll not go into more details about how LSM or RocksDB work, or why LSM is good for your SSD and your workload, as there is plenty of information about it on Mark's blog (see above) or on the official page for RocksDB and for MyRocks (the RocksDB engine for MySQL) on GitHub; just check the Wiki there.

For quite some time I dreamed  to give MyRocks a try with our giant (at least by our standards) write-mostly tracking database, currently occupying about 2.1T of disk space and growing about 200G per month (and yes we DO have data retention in place).  This cluster has a very write intensive workload and the written data is  very seldom read, usually to extract statistical reports, or for application or incident post mortem analysis. At the time of this post, it contains approximately 830 tables.
So I was really excited to hear at the recent Percona Live AMS  Yoshinori Matsunobu announcing that MyRocks was ready for production use.  I explicitly asked him if partitioning was supported by MyRocks (as we use range partitioning to prune old data) and got a positive response. Yay!!

Even if production ready, MyRocks is still in early stages and therefore I needed to clone the git repository and compile the server from source code. I also had to compile the GCC compiler first, as it is a good idea to use one of the GCC versions that were tested by the Facebook team so that I can report issues that may happen later. Peter Z announced that MyRocks will be included in Percona Server soon, but if you want to give it a try now, this is the only way.

Next step was to obtain a consistent dump of the cluster, along with master position in order to add the MyRocks slave to the cluster after the data import. I knew this dump was going to take a long time, so instead of using mysqldump  I decided to use mydumper, the parallel dump-and-load tool that I have used many times in the past, especially to migrate out of Amazon RDS (anyone?). 

Also, I knew that some editing of the dump was required in order to accomplish the following:

- remove references to InnoDB and/or replace them with ROCKSDB
- introduce binary collation required by MyRocks  (insert  "COLLATE latin1_bin" or "COLLATE utf8_bin" in the table definitions)
- remove InnoDB compression where enabled (many of the larger tables are compressed)

So I downloaded, compiled and installed mydumper and after a whopping 44 hours, my dump was complete (and occupying about 2T of disk space uncompressed).

MyDumper creates two files for each table, one contains the table creation command and the other one contains the actual dump of the table's data.  I crafted a simple script that goes through all the table creation files and applies the changes above in them:

for f in bigdump/*schema*sql
  fn=$(basename $f)
  sed -e "s/InnoDB/ROCKSDB/g" -e "s/ DEFAULT CHARSET=latin1/ DEFAULT CHARSET=latin1 COLLATE=latin1_bin/g" -e "s/ DEFAULT CHARSET=utf8/ DEFAULT CHARSET=utf8 COLLATE=utf8_bin/g" -e "s/ ROW_FORMAT=COMPRESSED/ /g" < bigdump/$fn > bigdump/$fn.mangled
  mv bigdump/$fn tmp/$fn.orig  
  mv bigdump/$fn.mangled bigdump/$fn

So I was ready to import this 2T of stuff into MyRocks. Of course, this is also a very time consuming operation, when you have more than a few G's to import. So it is really important to use all the available optimizations in order to speed up the bulk load

Turns out that MyRocks contains quite a few tweaks to help with this; the most notable ones being rocksdb_skip_unique_check, which disables checking on unique constraints, and rocksdb_commit_in_the_middle which only commits every 1,000 rows or so. More info about options suitable for bulk loading data can be found on the Wiki page here.   Beware that these settings cannot be used in regular production mode, they can cause corruption of your data! Therefore it is best to create a dedicated config for bulk data loading purposes only, and use the standard "production" config after the import is completed.

Side note: before you attempt to load large datasets into MyRocks, make sure you set your open file limit high enough! I have learned it the hard way, that MyRocks keeps one file descriptor open for each SST file you have in your database, and the number of SST files can grow very high when you are importing lots of data. In my case, I have topped 34,000 open file descriptors when the load was completed, which are then kept open even during normal use. I had originally reported this issue and following that a better default was recommended to me for the variable target_file_size_base in rocksdb_default_cf_options. Changing the recommended value from 32m to 64m made a big change regarding the amount of file descriptors required, dropping the above number to 12,500.

After a couple failed attempt at importing the data (mostly due to me being too lazy to properly read the available documentation) I finally managed to complete the import with satisfying results, using the configuration below.  You will notice by reading the config that MyRocks cannot at this time coexist with InnoDB. I have been inquiring with Yoshinori about this, and the reason is that at this time, the crash recovery mechanism of the two engines will conflict. I can't wait for the day when we will be able to use RocksDB and InnoDB in the same server!


# config for bulk load


During the import I have been watching the disk space (footprint) occupied by MyRocks' data directory. MyRocks creates a .rocksdb subfolder in your datadir and stores all its files in there, so I ran a script to monitor both the # of file descriptor in use (with lsof) and the footprint (with du) every 60 seconds:

223G /storage/rockstat/data/.rocksdb
229G /storage/rockstat/data/.rocksdb
219G /storage/rockstat/data/.rocksdb
220G /storage/rockstat/data/.rocksdb
226G /storage/rockstat/data/.rocksdb
211G /storage/rockstat/data/.rocksdb

You can clearly see the compaction of data taking place while the data loads. Compaction pushes data to lower levels, packing and compressing it in the process and therefore the footprint decrease while the import progresses.  Cool!!

But what was really cool is finding out that MyRocks was able to pack 2.1T worth of data into a mere 611G  - reducing an already compressed dataset by more than 3 times! This is even more amazing if you consider that about a dozen of the larger tables in this particular dataset contain BLOBs which have  compressed data in them. In the steps above I had disabled compression at the engine level where it was configured in the source dataset, but of course compressed data in BLOBs will remain compressed and MyRocks did a very good job at compressing regardless.

Importing the dump took about 17  hours. During the import I ran into some stalls that I reported here. Once the import completed I shut the server down and switched the my.cnf file used for bulk load with the "production" one below. Please note that I had to keep rocksdb_block_cache_size low due to other stuff using memory on the test server at the same time;  you should raise this accordingly to the amount of memory available, keeping in mind that MyRocks, similarly to TokuDB, also uses filesystem cache extensively. I didn't find a recommendation on the MyRocks wiki, so I decided to set this parameter to 50% of the available server memory, which is the recommended value for Toku



Using the information saved by mydumper in the "metadata" file I set up replication from the actual InnoDB based slave to the MyRocks slave and let it catch up. I used MTS (Multi Threaded Slave) to speed up the catch up (parallel replication happens to work very well for this cluster's workload) and the slave caught up pretty quickly. After the catch up, I stopped the slave and disabled parallel replication, as it is not compatible with relay_log_recovery in case of a slave crash, and the latter option is needed by MyRocks to guarantee a crash safe slave.

I quickly found out that having a slave which uses an engine different than the master needs also some other tweaks in the configuration. Remember that I had to disable InnoDB in order to make MyRocks work properly.  So what happens when on the master some DDL is executed which explicitly references ENGINE=InnoDB?

Here's the answer:

               Last_SQL_Errno: 1286

               Last_SQL_Error: Worker 7 failed executing transaction '' at master log dbstat02.034383, end_log_pos 99530607; Error 'Unknown storage engine 'innodb'' on query. Default database: 'rick_rocks'. Query: 'create table ttable2 (id int not null auto_increment primary key) engine=innodb'

D'oh! Of course, this is not what I wanted. I quickly remembered about a SQL_MODE option that I always wondered what  could be used forNO_ENGINE_SUBSTITUTION

From the manual page:

With NO_ENGINE_SUBSTITUTION disabled, for CREATE TABLE the default engine is used and a warning occurs if the desired engine is unavailable. For ALTER TABLE, a warning occurs and the table is not altered.

Hey, sounds like exactly what I want!! We never use ALTER TABLE anyways as we do everything  via pt-online-schema-change, so this option should be all we need on the slave.

This option (NO_ENGINE_SUBSTITUTIONis enabled by default since MySQL 5.6.6, so I needed to revert the sql_mode to a blank string, by setting in my.cnf the following:


To my dismay, after restarting the MyRocks slave, the behaviour was unchanged.  A bit of digging in the MySQL source code and I discovered that the slave uses the master's SQL_MODE, and not its own! The mode gets written to the binary log for each and every transaction, and is used by the slave when applying events.
So, I went to the master and changed the SQL_MODE there. After restarting replication, now the behaviour is correct and a table created with Engine=INNODB is created with engine ROCKSDB on the slave.

Finally, here's a comparison of the disk usage between a classic "InnoDB" slave and our new "RocksDB" one,  over the last 48 hours:



Can you spot without using the caption which one of the two is our MyRocks slave?  The compaction algorithm can be clearly spotted in the graph.  The two big drops around 6am is when the older partitions are dropped every night to apply retention.

The graphs above show that the MyRocks slave grew from 47.75% to 48%, while the InnoDB slave grew from 77% to 78%.  What  they don't show is that the MyRocks storage is  a mere 1.4T where the InnoDB one is double that, at 2.8T!

So, bottom line is, MyRocks grew 0.25% of 1.4T = 3.5G while InnoDB grew 1% of 2.8T = 28G. This is a 7x improvement!

We will be monitoring this new slave for some time trying to compare CPU metrics and to observe general behaviour, like lag and general reliability.

But for now, I would say that MyRocks really ... rocks!!!  Well done FB Team!