Basic concepts

Many times when starting with databases one may ask the question, where the whole database is stored. The simple answer would be, in the disk and that answer would be correct, the art behind it, has to do with the way the data is stored in the disk as there are multiple ways to do that. But before we jump into the way the data is stored in the disk let's clarify some definitions.

IO: input/output operation. We try to minimize this as much as possible since this slows down our query when multiple IO operations are being done. An IO cannot read a single row, it reads a page(what a page is, is explained below). For this reason we try to minimize it. That is the reason why star operations(SELECT * FROM Students) is so slow. As a query it does so many IO operations, that in some cases it may also take a couple of seconds to complete the query. One needs to be careful when talking about IO because an IO does not always read from the disk but it can also read from the cache instead.

Page: a page is just a fixed size memory location of a bunch of bytes(in PostgreSQL the size is 8kb and in mySQL it's 16kb). Inside a page one can put the rows. Now the question that pops up is, how many? And the answer for that is crystal clear, it depends. It depends on the size of the row, size of the page, size of the data types, etc.

This way of thinking and storing data in the disk, influences the way how a query in itself works. When you read from the disk, you don't read, for example, a single row in this case. You read a page. So we retrieve many rows and we choose.

Why is this important? Well this plays a huge part in the speed of a query. One needs to understand how many pages are being pulled, how much information needs to be retrieved in order to get the info that we want. After getting the page the DB in itself does the unmarshalling of the bytes to filter out the information that it needs. This marshalling/unmarshalling costs.

Heap: data structure where all the tables are being stored with all the pages one after another. Traversing it is very expensive, that's why one needs to use the indexes of the heap and tell which page needs to be fetched and read from(if that's possible).

Index: another data structure that is separated from the heap which has pointers to the heap. This is one of the most useful concepts when it comes to databases. An index can make a query be wayyy faster but it also costs memory so one needs to be cautious about what one uses as an index. We can index on one column or more, and the index tells exactly which page to fetch in the heap. The smaller the index, the faster the search because it saves up memory. Index in itself is also stored as pages and it also costs IO to pull from the index, but still we save so many IO operations when we pull from the index because it tells us where to do the other IO operation when pulling from the heap. Index is also stored as a B-Tree in the system. This makes traversing it quite fast in comparison to just saving the entries of the index sequentially (O(log n) instead of O(n)).

Now that we got that out of the way let's go on with the real “How” when it comes to storing a database in the system. There are 2 different ways of how we can store the database information in the disk. We can either store it as a row-oriented database or as a column-oriented database.

Row-oriented database

The tables are stored as rows in the disk as the name already suggests. But what does this mean? What does this say? This means that when we want to find a particular information about a specific row, we get all the columns that correspond to that row. The fact that we get the whole information that corresponds to the row may be a good or a bad thing, depending on the information that we're looking for.

An example where this may be useful would be in a query like SELECT Name FROM Students WHERE student_id = 182. Let's assume that we do not have indexing(since having an index would make this whole thing a whole lot easier) so we need to look up everything one by one. After finding the row with the ID = 182, we can easily retrieve the name of the student because we have received all of the information from the disk. In order for us to get the information we do not need to do an IO and jump to the disk, rather we can just read it from the RAM.

Another example where having a row-oriented database is a good choice, is when we do queries of form SELECT * FROM Students WHERE Id=122. In this case we still get everything we need pretty cheaply because of the fact that we already have loaded everything from the disk.

A bad example of a query that would be expensive when executed in row-oriented databases would be an aggregate query. In that case we would need to do O(n) IO operations where n is the number of rows, retrieving all the information from each row, extracting what we need and then do something with that information. So in this example we retrieve a ton of information, which we do not need, and we make a ton of IO operations(something that we want to avoid when doing queries). Okay so in practice it is not like these types of operations take minutes to take place since databases use all sorts of tricks to make them faster(e.g. Multi-threading) but still it is more expensive doing these types of operations on these databases in comparison to row-oriented DB.

So a row-oriented database is not that efficient when it comes to aggregation operations but when working with multiple columns it can be quite fast.

Column-oriented database

From the name it is pretty clear that in this case the tables are saved as columns in the disk. In this case there are less IOs required to get more values of a column but there are also trade-offs when it comes to working with multiple columns as it requires more IOS.

An example of a good query would be an aggregation query, say SELECT sum(grades) FROM Students. What we need to do, is to just retrieve all the information that is stored in that column and sum them all up. If we would do the same thing in a row-oriented database, we would need to retrieve information from each row, and add that up. The addition is not the problem, but rather the fact that we need to make so many IO requests can be detrimental for the performance.

So a column-oriented database is efficient when it comes to queries that aggregate results but inefficient when we do queries with multiple columns. It is also used widely in OLAP where we do lots of analysis of data which involves lots of aggregation of data.