Daily aggregation of Azure Blob Storage by tier (created/tier-change/deleted)

  • Hello all,

    I’m looking for advice on how to derive a daily snapshot table from a large fact table in SQL Server that tracks Azure Blob Storage metadata.

    In production this table can have tens of millions of rows, and its structure cannot be changed. To make the problem reproducible, I’ve created the simplified version below.

    DROP TABLE IF EXISTS fileStorageData;
    GO

    CREATE TABLE fileStorageData (
    FileId INT,
    StorageTier INT,
    UtcCreateDate INT, -- YYYYMMDD
    UtcTierChangedDate INT, -- YYYYMMDD, or -1 if never changed
    UtcDeletedDate INT, -- YYYYMMDD, or -1 if not deleted
    SizeBytes BIGINT
    );

    CREATE UNIQUE INDEX fileStorageData_U01
    ON fileStorageData (
    FileId,
    StorageTier,
    UtcCreateDate,
    UtcTierChangedDate,
    UtcDeletedDate
    );

    INSERT INTO fileStorageData
    VALUES
    (1, 1, 20210128, -1, -1, 4784),
    (1, 2, 20210128, 20210201, -1, 4784),
    (1, 5, 20210128, 20210601, -1, 4784),
    (1, 2, 20210128, -1, 20250101, 4784),

    (15, 1, 20210128, -1, -1, 9862),
    (15, 2, 20210128, 20230201, -1, 9862),
    (15, 5, 20210128, 20240601, -1, 9862),
    (15, 2, 20210128, -1, 20250101, 9862);

    SELECT *
    FROM fileStorageData;

    Column semantics

    • Each FileId can produce multiple rows as it moves through storage tiers.
    • UtcCreateDate is the creation date of the file.
    • UtcTierChangedDate is the date from which the given StorageTier becomes effective.

      • -1 means the tier has never changed (initial tier).

    • UtcDeletedDate is the deletion date.

      • -1 means the file still exists.

    • Dates are daily grain only (UTC, no time).
    • SizeBytes does not change per file.

    Requirement

    I need to produce a derived dataset (e.g. exported to Parquet) that shows, for every UTC date:

    • Storage tier
    • Total number of files in that tier on that date
    • Total size (bytes) of those files

    Example output shape:

    DateKey    StorageTier   FileCount   TotalBytes
    --------- ------------ ---------- -----------
    20210408 1 (Hot) 1000 1234567
    20210408 2 (Cold) 2000 3456789
    20210408 5 (Archive) 300 4567890

    Business rules

    • The result must be daily (no parameters, no “as-of” date variable).
    • A file contributes to exactly one tier per day.
    • If a file never changed tier (UtcTierChangedDate = -1), it remains in its original tier.
    • If tier changes exist, the latest tier whose effective date is = the day applies.
    • If a file has a deletion date, it must not contribute on or after that date.
    • The solution must scale to very large row counts.
    • The base table structure cannot be altered.

    I do have a standard Date dimension table available if needed.

    What is the correct SQL pattern to derive this daily snapshot table from a fact table shaped like the above, while correctly handling:

    • tier effective dates,
    • deletions,
    • and large data volumes?

    Any guidance or examples would be greatly appreciated.

    Thank you!

     

     

  • Something like this should do it:

    DECLARE @date date;
    SET @date = GETDATE();

    SELECT CONVERT(varchar(8), @date, 112) AS DateKey, StorageTier,
    COUNT(DISTINCT FileId) AS FileCount, SUM(SizeBytes) AS TotalBytes
    FROM dbo.fileStorageData
    WHERE (UtcCreateDate = @date OR UtcTierChangedDate = @date) AND (UtcDeletedDate < @date)
    GROUP BY StorageTier
    ORDER BY StorageTier

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Thank you for the response. I am not sure it will work for various reasons:

    1. It implies I need to run your query number of times as I have data for e.g. 9 years times 365 days.
    2. WHERE (UtcCreateDate = @date OR UtcTierChangedDate = @date) AND (UtcDeletedDate < @date) - if my @date is greater than UtcCreateDate and both tier changed date and delete date are -1 i.e. the file never changed tier nor deleted since stored, this WHERE clause will miss it.

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply