xsevensinzx - Tuesday, March 28, 2017 4:35 PM
That's a nice tale.
Yes, and pixie dust lets you fly, sure.
Let's refer to the masters of the game:
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Columnar Storage
Dremel stores data in its columnar storage, which means it separates a recordinto column values and stores each value on different storage volume, whereastraditional databases normally store the whole record on one volume.
Here is the answer on you "But that means you must know what you want to index first"
Dremel indexes *every* column by putting it into a separate columnstore index.
You say columnstore is not index?
then read the following statement:
Higher compression ratio. One study3 reports that columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values,especially if the cardinality of the column (variation of possible column values) is low, it’s easier to gain higher compression ratios than row-based storage
Do those key word used in there give you a hint?
Cardinality - is a characteristic of what?
"Grouping similar values into one holding references to the original positions of the values" - isn't it what they name "indexing"?
Columnar store is an index by its nature, by design.
By loading the data into columnar store you create index on every column involved in the process.
And to finish with those fairy tales about full scans, from the same sourse:
Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of Dremel, it simply doesn’t support any update operations.
What?
Why it's such a big deal to add another value to existing columnar store?
If we do full scans anyway - would extra 10 bytes make any noticeable difference?
Yes, update is a problem, because it would require an index to be rebuilt.
The Columnar Store index, and the Data Distribution indexes which define which Columnstores to scan for the values in your query.
This is why *ALL* big data storage engines are read-only.
Modifying data online would simply destroy all the heavy indexing they rely on.
therefore new data loads always happen asynchronously, off-line, on predefined schedule.
And only when the load is finished the "current" columnstore index is dropped and replaced with newly built one.
_____________
Code for TallyGenerator