Picking up where we left off, in the previous post we talked about the differences between RDBMSs and NoSQLs and why there is no “one size fit all” in this product line. Now it’s time to give the good news, having some constraints set we can come up with a pattern for our dress which might not fit all but some predefined age range (from 20 to 40 seems right). And if we confine our constraints accurate enough the product could be actually useful to a wide range of customers, no one can argue the value in that! And that’s what exactly I’m going to do in this post, defining a set of constraints in a way that we can actually base a data storage system on, providing its users with features of an RDBMS and a NoSQL at the same time.
The light bulb
While we can sit around and wait for some new technology to be invented, I came up with an idea to use today’s available technology to make things happen right now, and I named it Acidbase. But as I mentioned earlier there’s a catch to it which I’m going to get to later in this post.
So, if we want the features of both systems at the same time why not actually combine two of them (one of each) in one compound solution, literally!?
Here are some random facts:
- In data storage systems, the verb of writing predates reading (since if there’s nothing written then there won’t be anything read)
- We are after having transactions, especially when writing data into our storage (I mean if we are not, a single NoSQL suffice!)
- Transactions are supported by RDBMSs and not NoSQLs
- Once written, we use indices to speed up reading data
- NoSQLs are using distributed index and can be linearly scaled which makes them perfect choices for reading data
Having the facts above, what if we use an RDBMS for writing data and a NoSQL for reading them!? Then we can use RDBMS’ transactions while writing and distributed nature of NoSQLs while reading. All we need to do to achieve this is to sync the data between the two databases. This is where the catch comes in. There are two ways to sync the data between systems; synchronous and asynchronous.
Not taking into account the side effects of synchronous synchronization, we prefer it to asynchronous one and that’s because it assures us that the data is absolutely in the same state, at any point of time, in both of the systems (RDBMS and NoSQL). This is called data integrity which means data is not contradicting. But this feature is the exact same one that prevented RDBMSs to be linearly scalable in the first place. And this is why we want the async version, even though we will lose data integrity and from time to time we might face contradicting data. This is where the term eventually consistent fits in and it means that even though at the time of writing there might be contradicting data but eventually things will settle and over the time it will be consistent. This is what we have to sacrifice so we can achieve our desired scalability.
Now that we’ve got an idea of what we are going to do, let’s talk about how we are going to do it. First, we need to choose an RDBMS and a NoSQL. I went ahead and did that for you; PostgreSQL and Elasticsearch. To explain my choices, first of all, I was looking for open source solutions, which I think does not need more explanation why! Two major famous open source RDBMSs out there are MySQL/MariaDb and PostgreSQL. I love MySQL, I’ve used it for the most of my projects and it is reliable but compared to PostgreSQL it’s just a child’s play! List of features within PostgreSQL goes way beyond what you can hope to find in MySQL. It is solid stable and it is kept up to date pretty good.
To explain the Elasticsearch choice, what we are after in the NoSQL part of our proposition is indexing. And when it comes to indexing nothing beats Lucene. There are a lot of good examples of NoSQLs, no argument there but they don’t provide you with features like Lucene does. And once fixated on Lucene, there are two major solutions you can use; Apache Solr and Elasticsearch (both incorporate Lucene internally). And it seemed to me that Elasticsearch is the right way to choose at this point.
Having two databases set, all that is left is to sync whatever is written in the PostgreSQL into Elasticsearch. One way to do this is implementing it in the application layer of our software. But then it won’t be much of a solution, would it be!? And we have to redo each time we want to use it in a new project, that’s not a software engineer’s way. So it is settled, we want a software solution which we store data into and it will propagate the changes to some Elasticsearch internally and when we are looking after some data it will be the Elasticsearch’s indices in work. The following picture shows the different parts of the solution:
Note that from the application point of view the data written into PostgreSQL is immediately available to search in Elasticsearch even though it might not be always the case. The synchronization engine is responsible to take the data out of the PostgreSQL and applying it to the Elasticsearch which might be a time-consuming process. But the important fact is this lengthy process won’t affect the writing process and the transaction running within PostgreSQL.
In the depicted architecture, the application is not within the Acidbase scope and Elasticsearch is already scalable. All that remains are PostgreSQL and the synchronization engine. If the last two parts are scalable then we can assume we’ve got a system which all parts are linearly scalable. Let’s start with the parts that we are in charge of making, i.e. synchronization engine. This piece of program needs to be executed each time some data is changed within PostgreSQL. It should be rather easy to distribute its load using a queue system like RabbitMQ. The following diagram shows the new architecture incorporating a queue system:
In this new architecture, RabbitMQ is used to distribute the load of synchronization over a number of engines. It will take in data changes, as PostgreSQL notifications, and informs synchronization engines through its queue. This non-polling design is as optimum as it gets. Now RabbitMQ is the next concern to become a bottleneck, but RabbitMQ can be clustered and being a really fast optimize AMQP implementation there are no worries there.
Last but not least, PostgreSQL. If PostgreSQL could be successfully scaled we did not need to go take on so much trouble in the first place. Yet PostgreSQL has its own clustered version which could be used to improve its performance, but it will not be scaled linearly (as mentioned before). Unfortunately, we are stocked here and we can go no further. Yet we’ve come so far! Taking the read access load off the RDBMS can make a huge difference in its performance. Most accesses to RDBMS are read accesses and now that Elasticsearch will handle that, PostgreSQL will be free to take on the writing load’s responsibility much better. Personally, I find this pretty practical for a huge range of problems (we were not after one size fit all after all!). The total solution might not be linearly scalable (thank god I didn’t promise that!) but it will perform much better than a single RDBMS and unlike NoSQLs it supports transactions.
That’s all folks. Next, I will explain the database design and structure used for holding data in PostgreSQL and how it is translated into Elasticsearch as these two do not speak the same language. Next post is titled: Connecting all the dots.