Data Analytics
Analytics databases in comparison
As data volumes grow, the demands on databases increase. As part of a research project at the Fernfachhochschule Bern, I compared Azure Synapse Analytics and Vertica. Which performs best?
Data Analytics
As data volumes grow, the demands on databases increase. As part of a research project at the Fernfachhochschule Bern, I compared Azure Synapse Analytics and Vertica. Which performs best?
In the Data & Analytics team, we work with databases on a daily basis. We advise customers and help them with their analytics solution. It is essential that we have a thorough knowledge of common database technologies. I am a learner in the team and investigate technologies or support various projects. I created a comparison for this in a semester-long project during my degree programme.
The aim of this comparison is to analyse the performance aspect between the technologies in more detail. This increases our understanding and provides scientific justification for our recommendations to customers.
The beginning of the project consisted of defining a scenario. As part of the work, I limited myself to a small star scheme. This corresponds to a typical case in our team, in a slightly smaller form.
The diagram shows the structure of the database. The fact table "sales" has 1,000,000 data records, while the smaller dimension tables had between 20 and 5,000 entries. The data was randomly generated using Vertica and loaded into both databases:
The experiments could now begin. I decided in favour of four queries with increasing complexity. I carried out the tests for two cases: Serial and parallel. The total execution time of the queries was measured, i.e. from sending the query to the arrival of the last packet. Query 3 follows as an example. The syntax differs between Azure Synapse Analytics and Vertica.
In order for the results to be meaningful, we had to create the same conditions for all databases. We used Azure to host the databases. This allowed us to compare the results with the costs incurred per day for the hardware in Azure. This result then forms the basis for the interpretations and the decision.
The costs were incurred in different ways. With Azure Synapse Analytics (PaaS), the costs were primarily incurred by the computers. The unit for performance in Azure Synapse is called the Data Warehouse Unit. This corresponds to a mixture of memory and working memory. The smallest setting of 100 DWUs was used in the test.
The data storage in Synapse is solved with so-called storage accounts. The advantage of this is that you only pay for the amount of data stored. This amounted to approx. 300 MB, resulting in costs < CHF 0.01.
For Vertica, Azure's ready-made resource was used, which supports you during setup. The biggest costs were incurred by the data storage, which ran on premium SSDs. The virtual machine was comparatively cheap. In contrast to Azure Synapse, however, I had direct access (IaaS).
Another point is the licence. For Azure Synapse, the licence is included in the price for the computing power. However, if larger projects are carried out with Vertica (from 1TB), a licence must be purchased, which incurs additional costs.
All the details on costs and hardware are described in the paper, but this blog post is not enough.
In order for the results to be meaningful, we had to create the same conditions for all databases. We used Azure to host the databases. This allowed us to compare the results with the costs incurred per day for the hardware in Azure. This result then forms the basis for the interpretations and the decision.
This is also done for Vertica, but with the costs for Vertica in the counter. For Synapse this gave a value of approx. one third and for Vertica two thirds. The orange columns are created when the blue column is multiplied by this factor.
Vertica is slightly better than Synapse in all categories except Query 1. In addition, the very short duration of Vertica in Query 2 is striking. There are several reasons for these results, which are discussed in more detail in the paper. These include, for example, different architectures or measurement errors due to the experiments.
The parallel queries could not be measured so easily due to the differences between the architectures. Instead, the ratio between serial and parallel queries was calculated for both databases and the improvement between them was compared.
In a formula, this looks like this:
There are specifications as to how many parallel queries can be processed simultaneously. This changes with the size of the database. With Synpase, the limit was 4 parallel queries. Vertica had no limit, but the performance did not change much above 10 threads.
Synapse slightly outperforms Vertica in queries one and two. However, Vertica is better in the last two queries. The results for both queries are close to each other. The costs were treated in the same way as for the serial results.
Both technologies have their strengths and weaknesses, which did not always come to light in the course of the work. In addition, there are aspects other than performance that are decisive when choosing a database.
Basically, however, we can see from our comparison that Vertica is a very good choice in terms of performance. In conclusion, however, I would like to point out that other characteristics than performance are also important when deciding on a database.
If you are interested in the work, you can contact me by e-mail: Lukas.Meier1@swisscom.com
Student PiBS Informatik
Find the job or career to suit you. A career where you can make a difference and continue your personal development.
What you do is who we are.