After I started reading the Elasticsearch’s documentation thoroughly, I realized I missed one really useful feature of it and that’s batch processing!
Yes, I had missed the fact that Elasticsearch supports batch insertion / update which as we all know reduces the overhead cost usually. Luckily my implementation could easily support batch processing of data with a small change. So after I tweaked the implementation and corresponding configuration, I could tune the number of records to be indexed with each message received in indexing engines. Now it’s testing time!
In order to find the effect of batch processing data, I started testing the system as before but to my surprise, system’s performance boost was so great that my previous rough estimation was not precise enough to base the analysis on. By that I mean the error margin was so high that it was even greater than the performance readings. So I had no choice but to sharpen my tools and aim for better readings. And this took me like 5 days since each failed attempt would need a test run before I can scratch it out and each test run would take like hours.
Anyways, I decided to go the extra mile and make this test as accurate as possible. So I added some code to the indexing engine and the frontend projects to timestamp when they start working and when they stop. Then, after each phase, I would ask them to flush down their readings while some other code gathers them all up. This is the most accurate way to measure how long each process takes due to the fact that there are no intermediate parties involved.
As for the test setup, the tests are all using the same dataset as before and among all tests, everything’s kept the same other than the parameters that we are trying to fine tune, which are the number of index engines and batch sizes. For this test, I tried six different number of engines (1 to 6) and six different number of batch sizes (1, 10, 100, 250, 500 and 1000). For a batch size of 10, it means that each request to Elasticsearch would include 10 entities to be indexed (i.e. at most since there might be less available).
So enough of the talking and let’s see the results:
The charts given here are heat maps and the darker each cell is, the more desirable that configuration would be. There are two actions, each shown with a different color, record insertion in red and record indexing in green. There’s also a chart in blue which shows their ratios.
Each chart contains information on three different indices; Products, Customers, and Orders (as described in the previous post). What I want to mention here is that the records for each of these indices have been inserted separately and the next one is started only when the previous one is completely done and finished. This is because I didn’t want any test to contaminate the other.
So what all these numbers and colors mean? For one thing, it shows the importance of the batch processing in this scenario. When records are dealt with one at a time (1BS) it’s a total disaster. But packing records into batches of (as small as) 10, it could boost the whole system’s performance a lot. Of course, adding extra index engines is also helpful in its own way. Definitely, the final solution is a combination of both.
But to be fair we also need to talk about the reduction in insertion rate as the result of batch sizes and/or multiple index engines (the red chart). There are two factors here to consider. First, you need to know that I’ve run all these tests on my single CPU desktop computer which means if there are multiple indexing engines running, it takes away processing power from other parts like PostgreSQL and thus worse performance on insertion. This could be improved by scattering the different parts of this system into separate hardware. But at the same time, there’s also another reason why better performance on indexing part means worse for insertion and that’s the fact that indexing engines need to read their data from the database. In other words, if the indexing engine is working faster, it means it’s reading the data from the database faster, thus putting more pressure on PostgreSQL to provide it with the data it needs. This drawback can only be resolved by using a clustered version of PostgreSQL and adding new nodes to it.
So there you have it, the results for Acidbase performance. I think it’s doing pretty good considering the fact that using batch sizes greater than 100 and a couple of index engines I managed to index records almost instantly. The exact size of batches or number of index engines is something that should be analyzed case by case and can not be generalized. But I think it’s a safe choice to consider a batch size of 500 and 5 index engines as the default config. Also, you need to remember that batch sizes only come in effect when there are enough records to form a batch of that size. Otherwise, the system will wrap up a batch with as little number of records it has at that point of time.