RE: Is Big Data good for Data Professionals?

SSC Guru

Points: 110206

March 28, 2017 at 6:16 pm

#1935411

xsevensinzx - Tuesday, March 28, 2017 4:35 PM
Sergiy - Tuesday, March 28, 2017 2:37 PM
> NOBODY adds indexes to MY database tables without my knowledge.BigTable *is* an index.By loading your data into BigTable you are populating an index.Maybe without realising it.It's funny we even discuss it.Do you really think that BigQuery actually scans those petabytes of data every time you run a query?
Umm. When you query in Google BigQuery for example, it's doing a full scan of all the columns specified in your query. This is the complete opposite of what you want with a index. You don't want to do full scans. I surely hope you are not trying to compare the algorithms Google is doing behind the scenes as the same as an Index. Regardless of the filters applied with your query, it's snagging all the data within the specific columns you specified and then on the way up, aggregates and filters those results to you.
Full table scans are the complete opposite of what you want to achieve with an index in a traditional RDBMS. However, if you're trying to aggregate large volumns of data that are not easily indexable or known to be used in queries like a true ad-hoc platform, BigQuery and various other platforms are the right tools for you.
You can surely try to say, "oh, just add a index and we can easily query 10 billion records in a single table." But that means you must know what you want to index first. Some of us are using our data for ad-hoc analysis where nothing is really consistent. The traditional RDBMS is not a good platform for those types of workloads. The data is very large, hard to index and even harder to do large table reads without throwing more resources at the problem. Instead, we use platforms like BigQuery, Redshift and so forth that break those HUGE READS into smaller problems within lightning fast columnstores.
Speaking on SQL Server specifically, it does not have the same power as PostgreSQL (also used by Redshift). It does not allow us to cluster servers together and do proper distributed processing within a true MPP infrastructure (i.e.: Greenplum for example does). Trying to query large sets of unindexed data and aggregate it across billions of records across multiple users at the same time is hard to support on a single machine. Memory and Processing is shared with a single box versus a share nothing environment where each query has it's own dedicated resources.

That's a nice tale.
Yes, and pixie dust lets you fly, sure.

Let's refer to the masters of the game:
https://cloud.google.com/files/BigQueryTechnicalWP.pdf

Columnar Storage
Dremel stores data in its columnar storage, which means it separates a recordinto column values and stores each value on different storage volume, whereastraditional databases normally store the whole record on one volume.

Here is the answer on you "But that means you must know what you want to index first"
Dremel indexes *every* column by putting it into a separate columnstore index.
You say columnstore is not index?
then read the following statement:

Higher compression ratio. One study3 reports that columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values,especially if the cardinality of the column (variation of possible column values) is low, itâ€™s easier to gain higher compression ratios than row-based storage

Do those key word used in there give you a hint?
Cardinality - is a characteristic of what?
"Grouping similar values into one holding references to the original positions of the values" - isn't it what they name "indexing"?
Columnar store is an index by its nature, by design.
By loading the data into columnar store you create index on every column involved in the process.

And to finish with those fairy tales about full scans, from the same sourse:

Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of Dremel, it simply doesnâ€™t support any update operations.

What?
Why it's such a big deal to add another value to existing columnar store?
If we do full scans anyway - would extra 10 bytes make any noticeable difference?

Yes, update is a problem, because it would require an index to be rebuilt.
The Columnar Store index, and the Data Distribution indexes which define which Columnstores to scan for the values in your query.

This is why *ALL* big data storage engines are read-only.
Modifying data online would simply destroy all the heavy indexing they rely on.
therefore new data loads always happen asynchronously, off-line, on predefined schedule.
And only when the load is finished the "current" columnstore index is dropped and replaced with newly built one.

_____________
Code for TallyGenerator