Large table, clustered columnstore index and NO perf. improvements.

  • Hello colleagues, we have SQL Server 2017 standard (14.0.3238.1) with 136GB vRAM and 20 vCPU on Windows Server 2019 standard.

    We have a table,

    [ID_vard] [int] IDENTITY(1,1) NOT NULL,
    [id_p] [int] NOT NULL,
    [id] [int] NOT NULL,
    [uid] [int] NOT NULL,
    [gor] [varchar](255) NOT NULL,
    [apt] [varchar](255) NOT NULL,
    [fed_ok] [varchar](255) NOT NULL,
    [sum] [real] NOT NULL,
    [sumr] [real] NOT NULL,
    [sumeuro] [real] NOT NULL,
    [sumreuro] [real] NOT NULL,
    [sumnat] [real] NOT NULL,
    [sumrnat] [real] NOT NULL,
    [kg] [real] NOT NULL,
    [date_] [smalldatetime] NOT NULL,
    [apt_c] [int] NOT NULL,
    [reg] [varchar](255) NOT NULL,
    [ok] [varchar](255) NOT NULL

    with ~900 millions of record, this is read-only table, we don't have here a massive changes, only new sales date per each month (30 new millions rows per month)

    And we have next query, execution time is 50 seconds with Clusered columnstore index.

    With Clustred index i can have 40 seconds. I assume, i can dramatically improve query performance with Columnstore index, but it doesn't look, probably i don't understand what need to be changed, the problem is that select just 1 period (month) took around 45 minutes, so this is really problem. What can i do to improve query performance?

    Thank you.

    SELECT TOP 1000000 dbo.IMS_APT.ID_vard,
    dbo.IMS_APT.id_p,
    dbo.IMS_APT.id,
    dbo.IMS_APT.uid,
    dbo.IMS_APT.date_,
    dbo.IMS_APT.apt_c AS sumr,
    dbo.IMS_APT.sumeuro,
    dbo.IMS_APT.sumreuro,
    100 * dbo.IMS_APT.sumnat AS sumnat,
    100 * dbo.IMS_APT.sumrnat AS sumrnat,
    dbo.IMS_APT.kg,
    dbo.IMS_APT.gor,
    dbo.IMS_APT.fed_ok,
    dbo.IMS_APT.apt_c,
    dbo.IMS_APT.reg
    FROM dbo.IMS_APT
    where date_='2022-01-01'

    Query plan - https://www.brentozar.com/pastetheplan/?id=BJcY3bClj

  • Columnstore is good for aggregation SQL's - not for plain retrieval of data but 45 min still looks very bad.

    on your case I would be interested in seeing how it behaves if you do a insert into a temp table instead of plain retrieval.

    And can you also add (and show us the output) set statistics IO on before executing both SQLS (as is and insert into table)

    and.... SQL Standard limits the MAXDOP of columnstores to 2 - so this alone may be making it a lot slower than the normal clustered which may be getting more parallel tasks.

  • on your case I would be interested in seeing how it behaves if you do a insert into a temp table instead of plain retrieval.

    And can you also add (and show us the output) set statistics IO on before executing both SQLS (as is and insert into table)

    it took 3 secods 🙂

    Hmm, let me check real process.

    Table 'IMS_APT'. Scan count 2, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 13517, lob physical reads 0, lob read-ahead reads 0.

    Table 'IMS_APT'. Segment reads 3, segment skipped 0.

    (1000000 rows affected)

    (8 rows affected)

    (1 row affected)

    Completion time: 2022-09-13T17:42:48.3382509+03:00

  • You want the unique clustered index on ( date_, ID_vard ), assuming that you do primarily query the table by date_, and not on just ID_vard.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • You should look at partitioning the table by period.

  • Jonathan AC Roberts wrote:

    You should look at partitioning the table by period.

    Maybe.. maybe not.

    one of my main big tables (clustered columnstore) is partitioned as well - but each partition has 3 Billion rows (e.g. 3 times what the OP has on its full table) - and I have a whole bunch of partitions on the table already.

    All queries on this table always filters the individual partition - and with a significant "processing of data" each time it only takes 15-20 mins to generate the output (over 200 Million rows) onto another table.

    following on the maybe not... if the data is always inserted onto the table  fully each month then filtering on the "DATE_" should do columnstore segment elimination - which is rather good on its own to filter down the data. in a way it works like if the table was partitioned.

    one of the possible issues for the OP is how the data was/is loaded in the first place - as loading badly can lead to data being left on deltastore or for segments to be incomplete (e.g. not filled to the maximum).

    for other queries - filtering on a columnstore is as good as the spread of the data on segments - and this is not always something that one can achieve in an optimal way.

    to the OP - WHY REAL datatypes ... this are rather bad and very limited on its accuracy..

  • ScottPletcher wrote:

    You want the unique clustered index on ( date_, ID_vard ), assuming that you do primarily query the table by date_, and not on just ID_vard.

    Can you clarify how the OP would be able to have 2 clustered indexes on the table? and perhaps when the table had a normal clustered index that already had DATE_ as the first column?

  • frederico_fonseca wrote:

    ScottPletcher wrote:

    You want the unique clustered index on ( date_, ID_vard ), assuming that you do primarily query the table by date_, and not on just ID_vard.

    Can you clarify how the OP would be able to have 2 clustered indexes on the table? and perhaps when the table had a normal clustered index that already had DATE_ as the first column?

    Obviously there can never be two clustered indexes on a table.  I'm strongly urging the OP to use a standard (rowstore) clustered index and page compression rather than using a columnstore clus index.  IF the data is usually processed by date, which seems extremely likely for this table.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Look at the first line in the query you're using...

    SELECT TOP 1000000 dbo.IMS_APT.ID_vard,

    What makes you think that even a columnstore index is going to change the amount of time it takes to display 1 Million rows on the screen in SSMS?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply