This is the first of a three-part post introducing an open source software I started, named Acidbase. In this first part I’m going to explain my motivation and the problem Acidbase is addressing; scalability without sacrificing ACID attributes, hence ACID + BASE.
Where does it all come from
Data storage is always the main bottleneck for a software as data tends to grow over the time. Through the years, number one strategy on how to tackle this problem has mostly been upgrading computer hardware. But in the 21st century, the data growth accelerated so vastly that processing power could not keep up. So now all hope relies on software to come up with solutions where hardware can not provide any.
Since early days of computers, the main software to manage big and compound data were and are RDBMSs. Such software solutions have gone through many years of development and now are bullet proof tools to store and retrieve data in a structural way. RDBMS stands for Relational Database Management System, which means such systems are in charge of managing data that are in relation with each other. If you take away the relation part out of your data, you won’t be needing an RDBMS to work with it, there could be much simpler solutions providing you with the service you need.
One major responsibility of an RDBMS is retrieving data with speed, and it is done using indices. An index is a data structure which helps finding a designated data in a huge pool of homogeneous ones with much less effort than traversing them one by one. In practice, you want to keep index data in main memory all the time to get the best out of it. But the size of an index is a ratio of original data and there’s always the possibility that your index grows bigger than the size of memory you can have (considering the upper bound of how much memory you can have per each hardware). This is when you’ll need a workaround not to compromise the speed and in past few years, this workaround has embodied into software named NoSQL. NoSQL is an answer to ever growing data problem and mainly addresses this problem through sharding which means splitting data into sections which will be searched separately. Each shard can be kept in a separate hardware eliminating the problem of memory size upper bound per hardware (as you can add a new section (A.K.A. shard) with its own hardware if you can not add more memory to existing ones).
In the next section, I’m going to give a comparison between features of an RDBMS and a NoSQL, how they differ and when to use each of them.
RDBMS vs NoSQL
Each of the two software solutions comes with a set of unique features which are needed in different scenarios. In other words, it is incorrect to consider NoSQL a substitution for RDBMS. A more accurate statement would be; NoSQL came around to fill the gap where RDBMS could not perform. This means RDBMS is still number one choice solution for the problem it was originally designed for; storage management for structured data with consistency concerns. To give a comparison between RDBMS and NoSQL, it is only fair to compare NoSQL with clusters of RDBMS which is a multi-instance version of it, similar to a NoSQL.
Both RDBMS and NoSQL are designed to store and retrieve data and in that sense they are alike. In fact, there is only one feature that I think is the main difference between the two. RDBMSs support transactions while NoSQL can be scaled linearly, and that’s it. I’m not saying that they are exactly the same in other factors but these two properties are what that can be offered by one and not the other. And that’s because these two are contrasting properties (according to CAP theorem). In other words, if NoSQL was to support transactions, as RDBMSs do, they could not scale linearly as well. So let’s take a quick look at what are these two properties and when they are an absolute necessity.
It’s all about ACIDity and concurrent access to data. ACID stands for Atomicity, Consistency, Isolation and Durability, they are implemented using transactions. I’m not going to talk about what transactions are and how to use them. Let’s just leave it at that RDBMSs do support transactions and NoSQLs don’t. And it’s because of this lack of support that NoSQLs can outperform RDBMSs when it comes to handling big data. Also, data sensitive solutions (like enterprise software) need this feature to perform in a reliable way.
To explain this property, you need to keep in mind that we are talking about multi-instance software where you can add new nodes to your collection in order to improve the performance of the whole system. Consider two states of your multi-instance software, one with
N nodes and other with
N+1 nodes. If we define
P(N) the performance you are going to get out of your system by having
N nodes, then
P(N+1)-P(N) is how much your system is going to improve by adding
1 node to your system when it has already got
N previous nodes.
The problem with RDBMSs is that they are dependable on
N when you are adding a new node. In other words, how much performance improvement you are going to get out of adding a new node to a cluster, depends on how many nodes it has already got. Your system will improve less by adding the 11th node compared to how much it was improved when you added the second one. This is called nonlinearity behavior and it is not the case with NoSQLs.
In a NoSQL,
∀ M,N P(M+1)-P(M) = P(N+1)-P(N), at least in theory that is. In other words, each new node added to the cluster improves the system’s whole performance regardless of the number of nodes it has already got. This is called linear scalability and is one important feature which NoSQL databases are generally popular for.
My experience mostly can be classified as enterprise software development and as the result, the ACID part of a database has always been important in my job. Not being able to eliminate the need for transactions, I still am fascinated by the progress in NoSQL world. And I have already faced the limitations of an RDBMS (in the aspect of performance) so I decided to take on the problem in my own way. Since I don’t have the resources nor the needed skill to tackle the problem from a theoretical way, I decided to use my ingenuity. So I came up with an idea to combine an RDBMS and a NoSQL trying to get the best of both worlds, as the result Acidbase was born.
Before wrapping this post up let’s have a quick chat about what we talked about here; firstly, RDBMSs and NoSQLs are not actually rivals. They both provide you with features that the other does not. Secondly, the answer to the question “Can I sacrifice ACID attributes (transactions) for BASE ones?” is the key to decide between which solution to use. I mean if your requirements dictate the need for transactions, it doesn’t matter how bad you want linear scalability, it’s not gonna happen dude. Of course everyone like the performance improvement of NoSQLs but it’s the matter of the things you are going to lose and not the matter of the things you are going to gain.
In the next post I’m going to talk about if and how it is possible to get the best of both worlds, and of course, it is titled: I want it all and I want it now.