Acidbase part three – Connecting all the dots

In the past two posts I described my motivation and idea on how to combine an ACID RDBMS with a BASE NoSQL to form a solution that can be beneficial to a specific range of problems. Now it’s time to bring it home by connecting the rest of the dots.

Mind your own business

There are two major parts in the proposed solution, an ACID and a BASE and the golden question is what to expect from any of them. Well, we’ve already explained that we are going to use the BASE for reading and ACID for writing. But that’s unreal if we are just going to read from one and write to another, I mean if we won’t be writing anything in the BASE, how can we be reading from it later? And also if we are just writing to the ACID, how is that any useful? Let’s be realistic and more specific when we talk about responsibilities of each of the two parts.

Let’s start with BASE part which is easier to explain. I can not think of any uses to write to the BASE part directly. Of course, we will be writing to BASE when some change is made to the ACID part but this data transfer is done internally and not by the application using them. The only kind of connection application makes to the BASE part is a read-only connection. And it is used for searching through the data, it could be a full-text search or analytical aggregation. In any case, what we are dealing with here is retrieving a group of records and not a single individual one. And more importantly, each record within the BASE (which in Elasticsearch’s case is called a document) will most probably be a tree and that’s because it serves a much higher purpose to do so. First of all NoSQLs like Elasticsearch do a magnificent job supporting nested documents and they also search them flawlessly. On the other hand, relations were a major reason for performance issues searching an RDBMS, so by incorporating a NoSQL database, not just we are benefiting from the linear scalability of them but also we are solving the other issue “bad searching performance of the hierarchical data”. That’s what is called killing two birds with one stone, I believe.

What we need to realize about BASE part, even though we are going to do our best to keep it up to date, matching it with the stored data in the ACID part, but there can not be a 100% guarantee that the data in BASE is at the final state all the times. In fact, that’s the compromise we decided to make in order for the whole solution to work!

So if an application needs to do some conclusive decision and read the final state of the data, we can not rely on the BASE part! In other words, only the data in ACID part is reliable, because this is the part supporting transactions.

Perhaps the previous statement would discourage you and make the whole solution tremble and seem worthless. Personally, I don’t think so, I still believe there’s a big value to such a solution since the major part of the system’s load falls in the action of reading and taking that load off the RDBMS is going to make a huge difference. And even though the BASE part is not as reliable as ACID, but through experience, I know that when you are searching or aggregating data you are not after precision and even the data of few seconds before are acceptable just fine. A state of the art data is only needed when you are going to change the data, since usually, it is the basis for making decisions. There’s no doubt, such a data can only come from the ACID part, so now you see that the application layer is going to read from ACID after all!

Now that I explained when and why an application needs to read data out of the ACID part, let’s talk about the type of read accesses we will be dealing in each part. We’ve already mentioned that in BASE part we are dealing with the data in bulk. We are either searching through the bulk of the data or aggregating them into average / min / max or other statistical terms. But what about ACID part? Well, we will be reading from ACID when we are dealing data as individuals, in other words when we know what exactly we are looking for, using their primary key. We won’t be searching through the records to find matching records in the database. The only scenario in which we will be reading from ACID part is when we know what exactly we are after. This minimizes the cost of reading by a great deal because using indexes (primary keys here) is most effective when the cardinality of the table equals the cardinality of the index which is the case when our index is unique.

So it is settled, searching through the data and aggregation goes through the BASE part and getting exact / individual data is done by the ACID part. Now let’s talk about structuring our data.

What’s wrong with generalization?

First of all, here by generalization, I mean designing one database structure for all the problems. The first thing that comes to mind is that it’s impossible to come up with one single design for all the databases out there, well you are wrong! Consider the following ERD:

fully-normalized-ERD

Using the given ERD you can store any kind of data with any sort of relation, you just might need to add a couple of tables to store other field types. It’s just like using a Map<String, Object> (e.g. in Java) instead of defining all kinds of classes, after all, it can hold any imaginable data object. Well then, why don’t we use such a structure in our software? Going back to SQL world, because finding the data we are looking for will be hell hard. Writing the data in the proposed structure is not an issue at all, but searching in such a structure is really hard. Just consider we are looking for all the invoices with some specific product in them! To accomplish this, you will either choose to write down a single query or multiple ones. Even though I’m a big of single query extraction strategy, but over the time things get complicated and eventually you will face the situation where things get out of hand. And when it comes to execute multiple queries to extract what you need, it means you either have to write down stored procedures or implement some code within your application layer, which has data transfer downside. And for any extraction need, you are dealing with a new code to write. In any case, it is not logical to use the given ERD and it’s all because of the problem with data retrieval.

It won’t be fair if we are not going to talk about the benefits of the proposed structure as well. First of all, it’s so dynamic. It can hold any kind of data and it also supports arbitrary relations by incorporating a type field. It means changes in the software are welcomed as warm as possible. One of the major issues with rigid database structures is that change comes expensive. You need to take so many precautions and go through so many channels in order to apply some new feature to a software. But with the proposed dynamic ERD change comes easy since all the structures in the world are already supported and you need not to change the database structure to include your new requirement.

With all the problems mentioned regarding the given ERD, that’s what I decided to pursue for my project, Acidbase. Why? I’m glad you’ve asked. Before answering this question, to wrap this part up, since PostgreSQL supports fields of type JSON, I didn’t end up implementing the exact ERD given. Instead, I introduced a field in the entity to keep all the fields of that entity in one JSON field.

It’s a new dawn, it’s a new day and I’m feeling good

As mentioned in the previous section, the major problem with the dynamic generalized design is data retrieval. Well we decided that we are going to use a BASE database for bulk data reading, so problem solved! The only read access we are going to make to the RDBMS is for individual entity access which is easily done in the given ERD so everything is alright. All that is left is to design the BASE part of the bargain.

Through years of experience, I can tell you in RDBMSs performance is facing the hardest challenge when your query includes a couple of tables joined and your criteria include more than one. This is mainly because it is not possible to define indices on multiple tables. Making use of multiple indices to help finding records of a query will not be as performant as using a single index. Usually, a workaround to this problem is flattening data which means you (as the application developer) will introduce a redundant table to hold all the data of the joined tables in the query, and use this late table to search. Putting aside the implementation difficulties of this solution, the major problem is that it does not support hierarchical structures. Consider an invoice with multiple products, the number of products vary from one invoice to another. How are you going to flatten this data? One solution is to serialize the data into one field and search over that. It’s not that useful as you can not name individual fields in your criteria. This is why I chose Elasticsearch for BASE part of the project because it supports the nested documents. In Elasticsearch each record (called a document) can have an arbitrary size of nested document list. Now we can flatten our data into Elasticsearch’s documents and introduce indices over them. This is the main plot regarding the BASE part of the project, now let’s get to design a mechanism for indices.

As I said earlier, flattened hierarchy is an added value when considered as index structure. Indexing a single entity in a BASE database is valuable in the sense that your index is distributed and scalable but when you index some flattened hierarchy of entities, you are doing something even more. From BASE point of view, there’s no difference to index a simple one level entity or a multi-level hierarchical one. But from ACID there’s a big difference when data is a single table or multi ones. This is called added values and what exactly we are going to do.

Below is an example index structure which tries to index the invoice / product example:

{
    "fields": ["subject", "invoice number"],
    "products": {
        "fields": ["price", "name"]
    }
}

I’ve chosen Json because it is simple to read and has got all the needed elements to define a structure. It is also seamlessly supported by PostgreSQL. What the above Json is indicating is that our index has a subject and an invoice number in the top level entity and an array, named products, each with price and name fields. The array part of products is implied because the way I designed all relations are implemented using arrays.

This late proposition to index hierarchy of entities introduces new challenges. Like: what if a product is updated while it is named within some invoice which is already indexed? How can we search for the index entries in the BASE part to update? Well, the answer is simple; just read the source code!

Conclusion

In the past three posts, I tried to summarize my journey how I came up with the need and idea to design Acidbase. There’s so much more to say but if you are interested to know more you need to study the code because I’ve got no idea how to explain them! You can find the source code for the Acidbase project at https://gitlab.com/groups/Acidbase. Please feel free to comment on my thoughts or if you’ve got questions. I will also try to post another article on how to use the Acidbase, particularly the sample GUI in the project. Have fun.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s