The relational database; it’s been a staple of any digital enterprise since the 1960s. Together with the relational model introduced in the 1970s, it has reigned supreme for 40 years. However, in the early 2000s, when relational databases could no longer keep up with the exponential increase of data, the NoSQL technologies popped up and ended the supremacy of the relational database. That’s not to say that relational databases aren’t still relevant; they very much are, and to this day are still in use in almost every organisation. However, the way they are being used is still right out of the 70s. So if you have a relational database, and you are waiting too long for results, this blog will dive into why that is happening and how you can improve performance.
The Structured Query Language, or SQL, was introduced in the 70s and with it the relational database model. The simplicity and structure tied in well with how enterprises were organised, and SQL became synonymous with relational databases from that point on. Non-relational databases have been around since the 60s, well before SQL was ever introduced, but it was only towards the turn of the century that they started to gain popularity. With growing unstructured data and an expanding number of use cases, newly developed non-relational databases proved to be up to the task. As a result, in the early 2000s, NoSQL was introduced to encompass all the non-relational database technologies.
The problem with SQL is SQL.
SQL is an old standard, but very flexible, which means that it has been able to keep up through the years. The NoSQL movement confirms this, since it stands for ‘Not only SQL’, meaning that SQL or relational databases still have their place in this data-driven world. The same flexibility that has kept it in the game, however, is also its biggest downfall. Allow me to explain.
The original databases were on servers where computing and storage were tightly coupled. Simply said, a server was both capable of doing calculations on data and storing it. However, as demands changed, the benefits of decoupling storage and computation became more clear (e.g. scalability and cost). So a storage server now has less computational power and is less of a ‘computer’ in the literal sense of the word. NoSQL technologies and their query languages were built with this in mind, but the querying language SQL was not. While relational database technologies themselves have been updated to deal with the new conditions, the querying language SQL remains ‘original’.
Why is that a bad thing? SQL is basically allowing users to do too much. Since it is extremely flexible, it allows users to get data sets out of a database in a (very close to) ready state through various heavy calculations in the queries (e.g. JOIN, AVERAGE, GROUPBY). But at the cost of computational power. A lot of long-running databases in enterprises are still running these types of queries on the same old principles of coupled storage and computation. It’s clear that this conflicts with the new way of server organisation, data set growth, and the increased number of data requests within their organisations and has a detrimental effect on performance. So while SQL's flexibility has definitely kept it in the game, it’s also the biggest issue when it comes to modern data architectures.
SQL the NoSQL way.
In order to deal with this, it is necessary to break through a lot of habits that have been cultivated over the past decades. It means looking at the design of an architecture, the storage solution, and queries. Below are 3 ways to modernize your database interactions, starting from small to big:
The first idea is simple: let’s take the title to heart; a database is not a ‘computer’ so don’t compute. Ideally, a query should just retrieve data from or store data in a database and everything else is done in a separate computing environment. Below is an example of what not to do:
SUM(CASE WHEN call.id IS NOT NULL THEN 1 ELSE 0 END) AS calls,
AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) AS avg_difference
LEFT JOIN city ON city.country_id = country.id
LEFT JOIN customer ON city.id = customer.city_id
LEFT JOIN call ON call.customer_id = customer.id
HAVING AVG(ISNULL(DATEDIFF(SECOND, call.start_time, call.end_time),0)) > (SELECT AVG(DATEDIFF(SECOND, call.start_time, call.end_time)) FROM call)
ORDER BY calls DESC, country.id ASC;
Even without knowing the data model of this database, this query is way too complicated. Three LEFT JOINs, several conditional SUMs and AVGs, and a GROUP BY cause the server, where the database resides, to spend a lot of computational power on ‘retrieving’ your data. Moreover, if everyone in your organisation is running similar queries, even the most beefed up server isn’t going to be able to keep up.
So how would you go about changing the query in the example? Stick to the basics:
SELECT <COLUMNS YOU NEED>
FROM <TABLE YOU NEED>
WHERE <FILTER IF NECESSARY>
The above query is as simple as they get: you select the columns you require for your result from the tables you need, and apply a filter if necessary. But what about the LEFT JOINs? If you need data from different tables, you just run multiple queries until you have all the data you need. The actual joining and processing is then done in code. The impact of running multiple queries is (in most cases) much smaller than computing everything in one go. If that’s not the case, the next solution will help.
It’s possible that the solution above won’t work in your use case. For example, it might be that the data set is too large to be used in memory. That means that for the calculations you are trying to do, the data might not be in the right format. Once again, the function of a database is to store and provide data to applications, and it needs to do so in the most efficient way. Therefore, the data model needs to be designed in function of data usage and not human interpretation as is often the case.
Changing the data model can be done in several ways. On the one hand, we can use the well-known practices of (de)normalization. On the other hand, we can use pre-calculation. The first has many online guides on how to choose and achieve a certain normal form, the latter became very popular in the early days of the big data craze. It basically creates an inbetween-stage for data to be stored in, where several calculations have already been done. The idea behind it is to help applications that use the data to be more efficient in formatting the data and limiting the number of queries they need to run to get the data. This pre-calculated data can be stored in the original database, or as part of a new one.
This solution works extremely well in a situation where a database is being used for older operational applications, but analytical solutions are being introduced. Data from the operational database can then be redirected and pre-calculated into a new database for those analytics, eliminating any interference in day-to-day operations. This new database is often referred to as a data warehouse.
This pre-calculation solution is also valuable if everyone in your organisation is running the same queries. Instead of each of them computing the same solution as discussed in the first idea, you just do it once and store it again as part of the database.
However, important to note is that this solution may introduce data duplication and therefore needs to be managed accordingly.
If you’ve optimized your queries and your data model, but still your performance is lacking on the queries, then it might just be that you are using the wrong database technology. This is the right time to look into (other) NoSQL database solutions. There are many NoSQL technologies out there, and each of them is highly performant in very specific use cases. There is no one solution fits all, and the selection of a NoSQL database technology will be specific to your use case(s). What to look for and how to decide which NoSQL database is perfect for you, would lead us too far, but I have included an introduction to the selection process.
With all of that being said, it is important to note that several NoSQL query languages have been based on SQL. While many have ditched the heavy computational aspects of the language (e.g. JOIN) some remnants might remain depending on the NoSQL technology. That means that it is possible to make similar mistakes in those query languages. The solutions mentioned above do work for those situations as well.
I wrote this blog because we see a lot of underperforming relational databases, and it seems that only two solutions are being used: upgrade the entire machine to a costly power-consuming server or switch out technologies entirely. Both high cost and high effort solutions that might not even lead to better performance in querying the database. Therefore, with this blog I wanted to give you a couple of options to look at first, that are much cheaper and much easier to implement before going all out on investments. For now on, if you run into an underperforming database, do the three checks below:
- Keep queries simple, do calculations in code on compute units.
- Change data model in response to query types
- If all else fails, change database technology