Long-running transactions considered harmful

When working with ACID storages developers have a powerful tool at their disposal – transactions. Most RDBMS systems out there – think MySQL or PostgreSQL – are fundamentally ACID storages. Developers learn transactions as part of any basics course on RDBMSes. They seem pretty organic to underlying databases. One may get really surprised by the lack of such a feature in NoSQL storages. Or more precisely much more limited functionality of the feature.

Using transactions with ACID storages may seem like a good programming technique, a well-established pattern that should be always followed. Doing somehow related changes? Wrap the whole command set in a transaction. What if your command set will fail in the middle? You won’t have to collect any garbage left by already executed commands. You don’t have to deal with any possible inconsistencies. The world will be a better place 🙂

But as with most of the things in computer science actually usage of transactions is a tradeoff. And one usually has to learn that the hard way. Or at least I did.

The story

I once worked on a SaaS-system and we created a feature of demo accounts for potential customers. Basically we create an account with some sample data for a potential customer. With that set up they can evaluate the system as it stands before buying the thing.

To create a demo account we use a template company with template data. After cloning that template we also generate reports to fill in corresponding sections of the app. With around 1700 items of template data we generated 30 days of report data hence overall creating more than 50 thousand different records in the DB. It was PostgreSQL via AWS RDS.

That process takes around 30 minutes to complete. And one long-running transaction wrapped the whole thing That ensured it’s all or nothing in case something will go wrong. Everything was fine for a while – demos were created and used and all was good. Until once we faced a serious DB engine outage. Queries that seem simple got stuck for a long time. We had to investigate & found a solution quickly.

Epimyth

Without going into the nitty-gritty of the investigation & actually dealing with the situation the bottom line is as follows. Demo account creation was occasionally happening when there was another pretty intense DB workload. The demo account transaction was long, involved a lot of DB tables it produced a lot of different locks. Other concurrent DB processes had to wait for releasing of these locks before proceeding. That lead to the disaster we faced.

Should you avoid long-running transactions like a plaque? Of course not. Just remember that transaction is basically a tradeoff between consistency and system throughput/scalability. And yeah, transactions have different isolation levels. That may also be a way of fine-tuning DB usage in the context of a particular project.