Partitioning Data on First Character

  • Evening Guys,

    Let's say I have a table with a couple of billion rows of data. That I want to partition. There is no date column that I can use but I think a reasonable split of data would be by the first character of a varchar. Though I am not certain.

    Is there a way using the table statistics I can prove the range for a given first character. As I am trying to avoid doing a count(*) group by subtring (field 1, 1)

    This is experimental.

    Whilst I want partition elimination I don't want to make one partition a hotspot (like adding a date column would)

    Also, has anyone partitioned on a sequence? Is this even possible? Mapping each number in the sequence to its own partition?

    Cheers
    Alex

  • I think the problem with using the statistics is that a range (represented by a row in DBCC SHOW STATISTICS or sys.DM_DB_STATS_HISTOGRAM) can easily cross letter boundaries.  I did some analysis on a table with only 1.5 million rows and found it made 201 ranges, but some starting letters didn't have their own range: (b,c,f,j,q,v,w) even though I know there are values that start with those letters.  Since statistics may only be sampled data, it wouldn't be 100% accurate even if it did have a separate range for each letter.  The histogram data only shows you the high value in the range, so it's also not clear what portion of a range that crosses letters would be for each letter.
    as an FYI, here's what I did:
    create table #stathist (range_hi_key varchar(40), range_rows int, eq_rows int, distinct_range_rows int, avg_range_rows int)
    insert into #stathist
    EXEC sp_Executesql N'DBCC SHOW_STATISTICS (''dbo.Person'', IX_Person_Name) WITH HISTOGRAM;'

    select RIGHT(range_hi_key,1) AS letter, SUM(range_rows) AS row_cnt
    from #stathist
    group by RIGHT(range_hi_key,1)
    order by RIGHT(range_hi_key,1)

    compared to:
    select RIGHT(LastName,1) AS letter, count(*) AS row_cnt
    from dbo.Person
    group by RIGHT(LastName,1)
    order by RIGHT(LastName,1)

  • Why are you partitioning the data?

    Partitioning is mostly for data management, fast loading, fast deleting, index rebuilds at a partition level, compression at a partition level.
    It does not generally make queries against the table faster and, if the queries don't filter on the partition (as would be the case here) often make the queries slower.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I am partitioning for two reasons:

    firstly so that entire partitions can be phased out, that is put on slower storage (non-ssd) until that partition is deleted. Though saying that the partition might not get deleted but moved. That has not been determined yet. Hence this is mostly an experiment.

    The reason I am partitioning on the first character is the rows are most logically split by age using a letter A-Z. Why this is the case is before my time. But A to Z is what is being used. The Closer to A the older the row.

    But this letter is not an isolated char its the first letter of a varchar. Hence my question.

    Of course I could derive and persist a column for this.

  • Do you have any requirement for the availability of the data while it is being moved?  I believe the data would be unavailable while being moved from the faster SSD storage to the slower disk storage.  Partition switching wouldn't help here because that requires both tables to be in the same filegroup
    https://technet.microsoft.com/en-us/library/ms191160(v=sql.105).aspx

    Does the maintenance of the data require that first letter to change to "age" the data?  Considering you have a couple billion rows and only 26 letters I'm wondering if there's another criteria besides just the first letter?  Updating that letter would cause the row to move between partitions but you may need to do it in small batches.
    Here's probably the most comprehensive article I've seen on partitioning and maintenance.  It was originally written for 2005 but still holds much value today:
    https://technet.microsoft.com/en-us/library/ms345146(v=sql.90).aspx

  • alex.sqldba - Tuesday, September 19, 2017 10:51 AM

    Evening Guys,

    Let's say I have a table with a couple of billion rows of data. That I want to partition. There is no date column that I can use but I think a reasonable split of data would be by the first character of a varchar. Though I am not certain.

    Is there a way using the table statistics I can prove the range for a given first character. As I am trying to avoid doing a count(*) group by subtring (field 1, 1)

    This is experimental.

    Whilst I want partition elimination I don't want to make one partition a hotspot (like adding a date column would)

    Also, has anyone partitioned on a sequence? Is this even possible? Mapping each number in the sequence to its own partition?

    Cheers
    Alex

    Quick question, is the table a heap?
    😎

  • No. Clustered on a bigint surrogate.

  • alex.sqldba - Wednesday, September 20, 2017 3:53 AM

    I am partitioning for two reasons:

    firstly so that entire partitions can be phased out, that is put on slower storage (non-ssd) until that partition is deleted. Though saying that the partition might not get deleted but moved. That has not been determined yet. Hence this is mostly an experiment.

    The reason I am partitioning on the first character is the rows are most logically split by age using a letter A-Z. Why this is the case is before my time. But A to Z is what is being used. The Closer to A the older the row.

    But this letter is not an isolated char its the first letter of a varchar. Hence my question.

    Of course I could derive and persist a column for this.

    I guess my question would be, does that character change in any given row? If so, then partitioning on that is a really bad idea because of the data movement between partitions that will cause.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • No. That character is static and stable.

    Ultimately it is 1 character per year and we're currently on J, which will turn into K on 1 January. But K may not turned into L a year after that.

    The reason for this is legacy where by the year didn't / doesn't always follow the calendar... Sometimes it was fiscal and other times it was another made up date that they marked as a milestone.

    So the letter denotes a block of data that can refer to a years worth of data at a minimum, and so far up to 21 months of data. Theoretically it could refer to a small amount than a year though that hasn't happened yet nor do they foresee it.

  • I think you'll have to add a derived column to partition on, and, if you want performance improvements, your queries will have to filter on that derived column, not the varchar that it's derived from.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • alex.sqldba - Wednesday, September 20, 2017 10:30 AM

    No. That character is static and stable.

    Ultimately it is 1 character per year and we're currently on J, which will turn into K on 1 January. But K may not turned into L a year after that.

    The reason for this is legacy where by the year didn't / doesn't always follow the calendar... Sometimes it was fiscal and other times it was another made up date that they marked as a milestone.

    So the letter denotes a block of data that can refer to a years worth of data at a minimum, and so far up to 21 months of data. Theoretically it could refer to a small amount than a year though that hasn't happened yet nor do they foresee it.

    Now you have me interested in what happens when you're at "Z" and the year roles over. 

    My recommendation would be to take the opportunity to fix things now by adding a DATE or DATETIME column to represent the things you're talking about even within the legacy data.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • GilaMonster - Wednesday, September 20, 2017 10:58 AM

    I think you'll have to add a derived column to partition on, and, if you want performance improvements, your queries will have to filter on that derived column, not the varchar that it's derived from.

    And that little phrase goes practically unnoticed and is so important when implementing partitioned objects !

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I presume the original designer never expected to ever reach Z, or didn't care. Or perhaps by the time one gets to Z you have archived/deleted A. The back story I found out last night was that this was never a SQL Server database to begin with. It started its life as a proprietary format written in VB3, before going through a few iterations of other databases (Filemaker was mentioned) before finally ending up living on an old SQL 6.5 Instance. Which later got detached and re-attached to SQL7 and we received it after it was converted to 2005. So it's been around.

    And, we are totally fixing it when it gets absorbed into our normal CRM over the next few months. It was more a thought experiment than anything else.

  • 1. You could just have the whole field that you are using as the first character as the partition key. You can then setup your partition scheme to be A, B, C, etc. I am not sure how much data is in your varchar field, if it's a large field then this approach would cause your partition aligned indexes to be larger as that varchar field would be included.

    2. You could create a computed column and use that as your partition key, but this might require you to change the the queries you are using to get the partition elimination in your plan.

    3. Old school: use a partitioned view. It sounds like the number of partitions would be somewhat limited so this might work. This might be a solution if you are only actively loading / updating data to one of the tables in the partitioned view. You get a similar type of partition elimination in your queries but in this case it would be table elimination. You could then have a separate filegroup for each table to meet the requirement you have to tier your storage and remove the table(s) from the view to retire older data.

    #3 can be a little cumbersome if you are actively loading data in all the tables and can temp you to use dynamic queries to update multiple tables ( something we try to avoid ). But you also have some flexibility to add indexes only to specific tables in the partitioned view.

    Stew Davis

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply