Column-store, in-memory, and tabular are all about how the data is physically stored. You may also see people talking about shared-nothing and shared-disk, which is also about how data is stored but at a higher level.
Tabular format is where data is physically stored in rows. In a table with 10 columns and 10 rows, you will have 100 data items physically stored. Data compression can reduce the amount of space needed to store this data, but you still have 100 data items stored.
Column format is where the row is broken down into columns, and only unique values for each column are stored. So with the table with 10 columns and 10 rows, is all values are unique you still get 100 data items stored, but if all rows have the same value then you only have 10 data items stored. You also get the same compression capabilities as row format on top of this to save space.
On space savings alone you typically see a 3:1 saving for compressed tabular format data but typically you get a 10:1 saving for column format data. This means that if it took 10 separate disk read requests to get your uncompressed tabular data, it would take only 3 read requests to get compressed tabular data and only 1 read request to get compressed column data. In other words, column format data typically gets your data faster even without any further cleaver bits.
However, column format data is (for almost all types of column format DBMSs) automatically indexed on every column as part of the data storage process. This is needed so that the DBMS can quickly reassemble rows for presentation to the user. This means that you no longer have to worry about which columns to index to improve performance - they are all indexed.
For a simple SELECT * FROM table you do not see the benefits of column format storage, but as soon as you add a WHERE clause things start to hum. Add a SUM(column) and a GROUP BY and column format means you no longer get a coffee break between starting your query on a 50 million row table and getting the result.
At my place we are still in the early stages of exploiting column format but typically get a simple SUM(column) with a GROUP BY query on a 25 million row table returning in 1 to 2 seconds. More complex stuff takes a bit longer, but our users are happy about waiting 15 seconds for a detailed breakdown of our market penetration month by month over the past 2 years. When this data was in tabular format they had to plan other work to do while the query was running.
In-memory takes things a step further. All the big 3 (SQL, Oracle, DB2) take the same approach, which is also shared by Hana, Redshift, Postgres and many other DBMSs, in that only column format tables are held in memory. Holding a table in memory is different to having a portion of it in the bufferpool, as the access mechanisms are completely different and have a shorter CPU instruction path.
A cleaver guy in SAP worked out in the mid noughties that if you make your data transfer size the same as your CPU cache size then transfer between memory and cache can be done in a single CPU instruction. It also takes less CPU instructions to query the data if everything that is needed for a given compare is held in the CPU cache. This led to the Hana database, and the other vendors have taken this idea on for their own products. (Don't ask me about patents and licensing, if anything was being ripped off we would know).
The end result is that if you combine column-format with in-memory you typically get a response time between 1 and 2 orders of magnitude faster than you get from tabular format. The impact of this is that aggregation can almost always be done at query time while producing results fast enough to avoid the need to pre-aggregate into a cube.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara