Migrating MySQL to Postgres

Switching out your database layer is always a PITA, no matter what level you've abstracted away from it, but this week I ended up taking the leap instead of tackling a harder problem presented by MySQL, and it went pretty well.

Why is this a problem, they're both SQL-based relational databases, right?

You'd be forgiven for thinking the problem is as simple as a quick mysqldump --compatible=postgresql. Unfortunately, whilst the Postgres compatible dumps that mysqldump produces do get you some of the way there, they fall quite far short of what Postgres needs, and most likely, what you want, a few thinks you probably want to think about include:

  • Asserting your defaults, foreign keys and indexes carry over correctly
  • Converting types e.g. TINYINTs into BOOLEANs and needless VARCHAR(255)'s to TEXT types (Postgres can index TEXT just fine)
  • Handle multibyte encoding on any string fields you do want to keep length restrictions on (you probably want to double the length – I opted for TEXT fields rather than lots of VARCHAR(512)'s though)
  • Switch to timezone-aware DATETIME fields (you'll probably care later, if not now)
  • Switch to a document types for anything you had to serialize previously (e.g. JSON, powerful stuff)
  • You'll also want to validate the new structure is compatible with your application(s), that nothing was lost, and that perfomance hasn't been hit
  • ...and probably much much more

Getting started

pgloader to the rescue (edited: 2015-11)

Since writing this article, I came across Dimitri's pgloader at All Your Base Conf, which looks to make this problem a whole lot easier to solve:

  • Load data into PostgreSQL. Any Data.
  • Read data from MySQL, SQLite or dBase databases
  • Migrate from MySQL to PostgreSQL. In one command.
  • Transform your data. While streaming.

I am yet to try it, but it looks pretty darn sweet.

Migrating from MySQL with pgloader.

Lanyrd's MySQL to PostgreSQL Converter

tl;dr: pgloader is probably a way better idea than this

At this point you might be thinking this is sounding a lot more like headache. You'll be pleased to hear the great folk at Lanyrd open-sourced the script they used, MySQL to PostgreSQL Converter, which is pretty easy to read through being a simple Python script, and whilst it has a handful of shortcomings, and is seemingly unmaintained now they're done with it, it does a pretty good job, and with a few tweaks to my specific use case, got me most of the way there.

A few of the changes I made included:

  • Simplifying the VARCHAR handling to convert them all to TEXT types. If yours is a Rails app, this is what t.string (with no length) would have evaluated to in your migrations had you used Postgres in the first place anyway, and there's no advantage of VARCHAR over TEXT for indexing in Postgres, that's a MySQL thing
  • A fix to BOOLEAN handling where default values were lost
  • Trashing a fair bit of the index handling in favour of reapplying these with Rails migrations, it was easier

Application changes and MySQL gotchas

In short, if your application's database workload is relatively typical, and you're using a framework with an ORM or database abstraction – your changes should be few and far between.

The only notable things I came across in the Ruby on Rails application I was migrating was a handful of places where raw queries were written which included usage of the following:

Seeded randomisation

MySQL's seeded RAND is different to Postgres', basically, MySQL accepts a integer to seed RAND, whilst Postgres does it in a separate SETSEED command you call before RAND, and instead wants a float between 0 and 1.

MySQL's grouping used to be sloppy

A "feature" of MySQL up until recently allowed for the use of non-aggregated columns in SELECT, HAVING and ORDER BY when using GROUP BY – essentially a sloppy implementation of the SQL specification.

Postgres, of course, follows the SQL specification, so in short, you have to either aggregate things you want to use in these clauses e.g. with MAX (a common hack), or include them in the GROUP BY.

Note that as of MySQL 5.7.5, the default of ONLY_FULL_GROUP_BY was changed, and this behaviour is no longer default - so maybe you're good to go.

https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmodeonlyfullgroupby

Absense of order

In the SQL world, order is not an inherent property of a set of data. Thus, you get no guarantees from your RDBMS that your data will come back in a certain order – or even in a consistent order – unless you query your data with an ORDER BY clause.

http://dba.stackexchange.com/a/6053

I found a number of cases where ORDER BY clauses were omitted from queries being ran, the application relied on MySQL returning by insertion order, I believe, either way this behaviour seemed far more chaotic in Postgres and as above isn't guaranteed anywhere – I stuck some ORDER BY id's in.

Unicode NULLs

A problem I didn't expect to run into was caused by unicode nulls in my MySQL database screwing with my INSERT'S into Postgres, and even when I worked out what the problem was, it still took longer than I'd like to admit to work out how to nuke them.

Postgres does not allow null bytes ('\0') in string fields (CHAR, TEXT or VARCHAR), and if you try and store them, you'll receive the error:

ERROR:  22021: invalid byte sequence for encoding "UTF8": 0x00  

It appeared I had unicode nulls in a lot of places in the dataset I was working on, who knows where they came from, or where they were, but I chose to /simply/ replace them all, in every string type column, as as far as I could see they were semantically meaningless in the cases I came across.

To do that, after a lot of trial and error (I came across a lot of examples that didn't work), I whipped up a series of queries that looked something like this:

UPDATE comments SET content = REPLACE(content, CHAR(0x00), '') WHERE content LIKE CONCAT('%', 0x00, '%');  

If you're feeling super lazy (I was), you can write a query to generate those queries for you based on the information_schema table, which describes your other MySQL databases, mine looked something like this:

SELECT CONCAT(  
  "UPDATE `", TABLE_SCHEMA, "`.`", TABLE_NAME, "` ",
  "SET `", COLUMN_NAME, "` = REPLACE(`", COLUMN_NAME, "`, CHAR(0x00), '') ",
  "WHERE `", COLUMN_NAME, "` LIKE CONCAT('%', 0x00, '%');"
) AS q
FROM information_schema.COLUMNS  
WHERE TABLE_SCHEMA = '{{ your database name }}' AND DATA_TYPE IN ('char', 'text', 'varchar')  

If you then copy the results of that query, and run those as a query, your MySQL database should be ready to go.

Data transformation

Transforming some serialized Ruby hashes into JSON so that it could be aggregated was the whole reason I wanted to migrate to Postgres.

I achieved this with a ~7 hour Rails migration that I ran overnight, which obviously, would have been pretty costly to repeat in testing, so I turned the resulting data into batched raw SQL updates (with a WHERE on the updated_at column in case the row was updated since I took the dump) so that it could be loaded in minutes rather than hours.

However, if you need to do something similar, you may have luck with pgloader's support for transforming data, while streaming.

Validating the result

If your test suite is anything like the one I had, it certainly doesn't run against a real database, so how do you go about testing this?

I don't have a great answer to this, I started with browsing the structure and data and seeing if areas I suspected might break looked okay – mostly this surrounded defaults, types, indexes and keys. Whilst this helped iron out some bugs with Lanyrd's script, my confidence was still low, and given I was running this whole process end-to-end a lot between changes, wanted a faster way to validate the quality of the result.

Count comparison

Admitedly, a pretty rudimentary check, but better than nothing, I used the information_schema table to generate a bunch of queries to counts of rows of every table, and diff'd them to check they matched.

Schema comparison

ActiveRecord's schema.rb was a big help – as a manifest describing ActiveRecord's understanding of what the database looks like, the one produced in development from the new Postgres database can be compared to the schema.rb produced by the production MySQL database. Any changes that hadn't been intentionally made by me were pretty obvious, and using this, I was able to identify and resolve a number of issues I may well have missed otherwise, including:

  • BOOLEAN defaults being lost
  • Indexes coming out differently
  • Timestamps having nullable set different

Summary

I hope some of these tips come in useful if you're thinking about making the jump, best of luck!