You wouldn't eat soup with a fork, would you? - Part I

Posted on 29-jul-2020 17:38:09

You wouldn’t eat soup with a fork, now would you? Why? Because it isn’t the right tool for the job. You might get some soup in your mouth from time to time, but in general it’s not an efficient way to get it down. So why are we still forking our data?

In this 2-part blog, I want to talk about how data is often used and how it should be used instead. This first part will focus on polyglot persistence and what it means and where it came from, while the second part will dive into making polyglot persistence practical. Let’s dive in!

On the evolution of databases

Before we get into semantics, we need to have a little history lesson🧐. From the beginning of time, or rather the beginning of computer science, relational databases have been the go-to way of storing data. A relational database represents data as tables (or 'relations'), consisting of columns and rows. The freedom you had in how data was stored in your organization, was limited to which vendor you would use and how you would organize the data in the tables (i.e. the data model). As a side-note, another term you probably heard to refer to relational databases is SQL databases. SQL, or 'Structured Query Language', is the recognized standard language by which we communicate with relational databases, which is why the two terms are used as equivalents.

However, when the age of big data rolled around, we quickly realized relational (or SQL) databases were no longer cutting it. Simply said, the speed at which data sets were growing outpaced the speed at which we were able to upgrade our computer (server) running our relational database. So instead of using one big computer (or server), we started using several smaller servers clustered together. The idea is all those smaller servers act like one large server, but there is no theoretical limit to how large your server can become (since you can keep on adding servers to the cluster). The age of big data therefore also introduced the age of distributed computing and storage, i.e. storage and computing across a cluster of servers.

The problem for relational databases at the time was that they did not do well in a clustered environment. Crucial performance was lost and the door was opened for an entire generation of non-relational databases, born in this distributed environment. We refer to these databases as 'NoSQL' databases or technologies. The 'No' actually stands for 'not only' (confusing, I know), which means that SQL/relational databases are actually also part of the NoSQL group (even more confusing, I know).

Allow me to explain: the explosion of new database technologies also triggered a high degree of differentiation, making different databases more suitable for different use cases or data models. In this explosion of the NoSQL movement, it also quickly became clear that there was still a place for relational (or SQL) databases, but in specific use cases and not as a solution-fits-all. That's why SQL is also part of NoSQL.

To SQL or NoSQL

Are you still here? OK! So to recap, we've seen that old relational databases are too restrictive for the age of big data and thus heaps of new database technologies spawned with highly specialized functionality. This means that when you develop a new application or architecture in an organization, you have a lot of freedom and choices for data storage, but a right or wrong choice has a tremendous impact. How big of an impact? It all depends on the use case, size, complexity, and other variables of the data set. In order to visualize this impact, we are including the results of a simple test in the graph below. It shows the response time for the same data sets, running the same query, for growing data set size (Note that the graph is in a logarithmic scale! For the techies, if you want to dive into the test use case, you can find more information here).
res_graph_latency

For this simple data set with 10 million entries, you can see over 300 seconds difference in response time between MongoDB (88.80 seconds) and Cassandra (CQL - 395.72 seconds). In other words, the results tell us that in order to retrieve the same data, Cassandra takes about 300 seconds or 5 minutes (!) longer than MongoDB. A significant impact, especially if you need to run this multiple times during the day. MongoDB and Cassandra are popular examples of NoSQL database technologies, but they are very different from each other. This means that the results should not be interpreted that Cassandra is bad. Cassandra is just not the ideal database technology for the test use case we used.

And > Or

So, when building our modern application or architecture, we need to map our use case onto our storage solution to make sure we get the best performance. However, today's applications and organizations often have more than one use case to cover. We could just decide on the storage solution that gives us the most performance overall, sacrificing performance in specific parts of our application. Or we could use different databases for the different use cases of our application or architecture, optimizing data performance of our application in each and every use case. One application would then store their data in two or more database technologies; a situation often referred to as polyglot persistence.

In essence, polyglot persistence is the opposite of pushing everything into one system (e.g. the all-ruling ERP system 😉). It uses different storage solutions for the different use cases that exist in your application or your organization to improve performance. The major drawback? Complexity! You need to support different query languages to communicate with the different databases, and support those databases in your organization, etc. How do you manage that complexity? That is our subject for the next part!

TL;DR: polyglot persistence is a situation where one application uses two or more database technologies. Born out of the NoSQL movement it offers better performance in applications and data architectures, but with additional complexity.

Topics: Big Data, Data management, Data Science

Thomas Vanhove

Written by Thomas Vanhove

CEO and Co-founder Tengu