Algorithm to combine 2 integer into one value and then disintegrate back into two

  • Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

  • The only question worth asking here at the moment is WHY?   You are already storing dates as integers and depriving yourself of all kinds of cpu time wasted converting back and forth into date values.   What possible benefit do you get by combining two entities into one?   Do you have any idea how much more cpu you will waste just separating them?   Also, combining two integers into a bigint is a wash with respect to storage space, as ints are 4 bytes and bigints are 8.   So that's why I'm asking what the objective here is.   What possible benefit is there here?    Seems to me that you are unlikely to get anything out of this except pain.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I guess I didn't explain my case properly. The two date values in integer will not go away from table , I just want to add a new column which I can use as partitioning key.Obviously I don't want the new column to be BIGINT which is why I've posted this question to find a way to accommodate this value in INT.

    As far as CPU time is concerned, there is no converting and re-converting required . Every insert in this fact table runs on a single combination of periodytd and period with a  size of about 1.5 million which is what the size of partition I'm planning to keep. Before every insert DML , the program will know the value of period and periodytd and hence can calculate the value of partition key just once before running the DML.

    The objective here is to break a huge table into smaller ,meaningful and manageable chunks which would only make sense in my case if I create partitions using two of the available column values instead of one.
    The only reason why I'm interested in creating this key via an algorithm/formula (and not hash value) is because I want this value to be human readable or at least separable using a custom script.There is no way this combinations and separation will run on every row. 
    I could keep this combination of values in a separate table and use its identifier as the partition key, but it looks to me as a very bad way of achieving this.

  • So what's stopping you from using two columns instead of just one, and not wasting an additional column at all ?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    If both numbers are 'dates' in the recent past, both number can be made a lot smaller. Depending on the oldest date in the specific column or in both columns a date can be choosen te be the 'zero' date. For each date you want to code the number of days since the zero date can be used. For the second date, the number of days since the first date can be used. Depending on the first date and the maximum length of the period, your numbers should be a lot smaller.

    The number of days since 2000-01-01 is 6851 days. (today is number 6852). So if all dates are after 2000-01-01 the size of the number is limited. Same for the length of the period.

    DATEDIFF can be used to determine the 'number' of days since a specific day.
    select DATEDIFF(dd,'2000-01-01',getdate())

    This only suggests how to combine the two 'dates' into a single integer (or string). This does not take into account the problem you want to address. 

    Ben

  • sgmunson - Thursday, October 4, 2018 7:49 AM

    So what's stopping you from using two columns instead of just one, and not wasting an additional column at all ?

    Are you suggesting me to define a partition function on 2 columns ? Can you do that ?

  • ben.brugman - Thursday, October 4, 2018 7:58 AM

    _rohit_ - Wednesday, October 3, 2018 8:27 PM

    Hi,
    I may not be the first one to ask this, but I'm looking to find a way to combine two integers 20180131 and 20160731 into one integer value and then break that single value back into 2 values.
    i have a huge table ( >500 mil) that i want to partition using these 2 columns PERIODYTD ( eg. 20180131) and PERIOD(20160731) . If i concatenate these 2 values into something like 2018013120160731, the new field will have to be a BIGINT and will unnecessarily eat up space.

    I could generate a hash value using checksum and then partition on that but that value will be human unreadable and i won't be able to find  PERIODYTD and PERIOD values from it without going and looking a row into the table.

    Regards,
    R

    If both numbers are 'dates' in the recent past, both number can be made a lot smaller. Depending on the oldest date in the specific column or in both columns a date can be choosen te be the 'zero' date. For each date you want to code the number of days since the zero date can be used. For the second date, the number of days since the first date can be used. Depending on the first date and the maximum length of the period, your numbers should be a lot smaller.

    The number of days since 2000-01-01 is 6851 days. (today is number 6852). So if all dates are after 2000-01-01 the size of the number is limited. Same for the length of the period.

    DATEDIFF can be used to determine the 'number' of days since a specific day.
    select DATEDIFF(dd,'2000-01-01',getdate())

    This only suggests how to combine the two 'dates' into a single integer (or string). This does not take into account the problem you want to address. 

    Ben

    Thanks ben, i i thought of something similar later today after i posted this question.
    I'm thinking of using day of year in place of month and day and stripping first two digits of year. So 20180131 would become 18031where 18 is for 2018 and 031 is day of year.
    So 20180131 and 20170131 would be 1803117031 which can fit into INT.

  • _rohit_ - Thursday, October 4, 2018 8:03 AM

    sgmunson - Thursday, October 4, 2018 7:49 AM

    So what's stopping you from using two columns instead of just one, and not wasting an additional column at all ?

    Are you suggesting me to define a pertition duntion on 2 columns ? Can you do that ?

    Apparently, the only way to do it is to combine both columns.  Wasn't aware of that before.   Now that I am, how many partitions are you planning to have?  What kind of division of your data are you looking to achieve?   Maybe you'll only need one of them...   How many of your 500 million+ rows are you hoping will be in any given partition?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • _rohit_ - Thursday, October 4, 2018 8:09 AM

    So 20180131 and 20170131 would be 1803117031 which can fit into INT.

    I still would go with the number of days since 2000-01-01 or even a later date. And depending on the period length the number of days of the period, this makes the system independend of date presentations systems and more compact.
    If the dates are all beyond 2010-01-01 I would go for that date. 10 000 is enough for 27 years, 32000 is enough for 87 years. And this is a relative small integer.

    Number of days in an integer is a fairly clean method. Using a date representation in a integer is a less clean method, because using the number of days is more compact and can always be converted both ways. Using using the number of days within a year can not always converted in both directions. The number 400 can not be converted to a day within a year.
    So using a integer to represent the number of days since a specific date is compacter and more robust than using a 'representation' and presenting this in a date. As in your example what does 18456 give as an result when converted back. 

    Ben
    (Again no comment how to use this for partitioning).

  • ben.brugman - Thursday, October 4, 2018 8:33 AM

    _rohit_ - Thursday, October 4, 2018 8:09 AM

    So 20180131 and 20170131 would be 1803117031 which can fit into INT.

    I still would go with the number of days since 2000-01-01 or even a later date. And depending on the period length the number of days of the period, this makes the system independend of date presentations systems and more compact.
    If the dates are all beyond 2010-01-01 I would go for that date. 10 000 is enough for 27 years, 32000 is enough for 87 years. And this is a relative small integer.

    Number of days in an integer is a fairly clean method. Using a date representation in a integer is a less clean method, because using the number of days is more compact and can always be converted both ways. Using using the number of days within a year can not always converted in both directions. The number 400 can not be converted to a day within a year.
    So using a integer to represent the number of days since a specific date is compacter and more robust than using a 'representation' and presenting this in a date. As in your example what does 18456 give as an result when converted back. 

    Ben
    (Again no comment how to use this for partitioning).

    So, basically, you are recommending a date data type...   It should bother you....in some way...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Yes, you can combine them into one integer and get them back.  The int value will most likely be negative, I don't know if that matters for you.

    The algorithm is:
    1) convert each date to the number of days since 1900-01-01 (or 1950-01-01 or other base date; the further you push forward the base date, the longer you can go, and the more positive key values you will have)
    2) convert those values to varbinary(2)
    3) concatenate the two varbinary(2) values into a single varbinary(4) value
    4) convert that varbinary(4) to an int, which will be the date key value
    5) to get original values back, reverse the process:
    convert the int value to varbinary(4)
    spit the varbinary(4) into two varbinary(2) values
    convert the varbinary(2) values to ints
    add those int values to the base date to get the original dates back.
    This should work OK for all dates thru 20781231, as far as I can tell. 
    If you make the base date 1950 rather than 1900, that would take you well into the 22nd century.

    Here's sample code that shows the process, using 1950 as the base date:

    SELECT *
      --date1, date2, date1_days, date2_days,
      --date_key, date1_recovered, date2_recovered
    FROM ( VALUES
      (CAST('20180131' AS date),CAST('20160731' AS date)),
      ('20821231','20821231')
    ) AS dates(date1, date2)
    CROSS APPLY (
      SELECT DATEDIFF(DAY, '19500101', date1) AS date1_days,
       DATEDIFF(DAY, '19500101', date2) AS date2_days
    ) AS alias1
    CROSS APPLY (
      SELECT CAST(date1_days AS varbinary(2)) AS date1_binary,
       CAST(date2_days AS varbinary(2)) AS date2_binary
    ) AS alias2
    CROSS APPLY (
      SELECT CAST(date1_binary + date2_binary AS int) AS date_key
    ) AS alias3
    CROSS APPLY (
      SELECT DATEADD(DAY, CAST(SUBSTRING(CAST(date_key AS varbinary(4)), 1, 2) AS int), '19500101')
       AS date1_recovered,
       DATEADD(DAY, CAST(SUBSTRING(CAST(date_key AS varbinary(4)), 3, 2) AS int), '19500101')
       AS date2_recovered
    ) AS alias4

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • as a alternative method 

    constraints
    - PERIODYTD date is always greater or equal to PERIOD
    - difference between the 2 dates is less or equal 3647 -- just shy of 10 years (9 years, 12 month, 27 days)
    -- PERIODYTD must be >= 1900-01-01  and less than 2487-12-17 -- these values can change but interval will be same


    declare @ytd int = 20180131
    declare @per int = 20160731
    declare @Dt3 int
    set @Dt3 = 2180113

    select t.ytd
      , t.per
      , conv2.*
      , rev2.*
    from (select @ytd as ytd
        , @per as per
    ) t

    outer apply (select datediff(day, '1900-01-01', convert(date, convert(varchar(8), @ytd, 112))) as ytd_p1
          , datediff(day, convert(date, convert(varchar(8), @per, 112)), convert(date, convert(varchar(8), @Ytd, 112))) as per_p1
    ) conv1
    outer apply (select conv1.ytd_p1 * 10000 + conv1.per_p1 as final -- value to use as partition column
    ) conv2
    -- convert back
    outer apply (select conv2.final % 10000 as per_rev1
          , (conv2.final - (conv2.final % 10000)) / 10000 as ytd_rev1
    ) rev1
    outer apply (select convert(int, convert(varchar(8), dateadd(day, rev1.ytd_rev1, '1900-01-1'), 112)) as ytd_rev_final
          , convert(int, convert(varchar(8), dateadd(day, -1 * rev1.per_rev1, dateadd(day, rev1.ytd_rev1, '1900-01-1')), 112)) as per_ref_final
    ) rev2

    I have to say I do not see what you trying to accomplish by partitioning like this. Partitioning on the single date should be enough.
    But maybe you would like to enlighten us on the reasoning behind your decision.

  • I guess I don't understand the desire to partition by two columns, never mind the PeriodYTD column.  It would appear that the PERIOD column is based on month ending dates (which I see as another problem that we'll leave for another discussion).  So, what is the PERIOD column?  Does it contain the end date for every month?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sgmunson - Thursday, October 4, 2018 9:46 AM

    So, basically, you are recommending a date data type...   It should bother you....in some way...

    Can you explain why this should bother me ?

    I do understand that; if possible use the existing types. But if there is a requirement for compact datastorage of two dates in a single column, then using an integer with a small and sufficient range for 'datenumbers' is a way to go. (Why the TP wants this, that is something I have not considered).

    Ben

  • frederico_fonseca - Thursday, October 4, 2018 5:35 PM

    as a alternative method 

    constraints
    - PERIODYTD date is always greater or equal to PERIOD
    - difference between the 2 dates is less or equal 3647 -- just shy of 10 years (9 years, 12 month, 27 days)
    -- PERIODYTD must be >= 1900-01-01  and less than 2487-12-17 -- these values can change but interval will be same


    declare @ytd int = 20180131
    declare @per int = 20160731
    declare @Dt3 int
    set @Dt3 = 2180113

    select t.ytd
      , t.per
      , conv2.*
      , rev2.*
    from (select @ytd as ytd
        , @per as per
    ) t

    outer apply (select datediff(day, '1900-01-01', convert(date, convert(varchar(8), @ytd, 112))) as ytd_p1
          , datediff(day, convert(date, convert(varchar(8), @per, 112)), convert(date, convert(varchar(8), @Ytd, 112))) as per_p1
    ) conv1
    outer apply (select conv1.ytd_p1 * 10000 + conv1.per_p1 as final -- value to use as partition column
    ) conv2
    -- convert back
    outer apply (select conv2.final % 10000 as per_rev1
          , (conv2.final - (conv2.final % 10000)) / 10000 as ytd_rev1
    ) rev1
    outer apply (select convert(int, convert(varchar(8), dateadd(day, rev1.ytd_rev1, '1900-01-1'), 112)) as ytd_rev_final
          , convert(int, convert(varchar(8), dateadd(day, -1 * rev1.per_rev1, dateadd(day, rev1.ytd_rev1, '1900-01-1')), 112)) as per_ref_final
    ) rev2

    I have to say I do not see what you trying to accomplish by partitioning like this. Partitioning on the single date should be enough.
    But maybe you would like to enlighten us on the reasoning behind your decision.

    I don't think the integer concat approach really works too well when trying to recover the dates.  For example, try the above code with:

    declare @ytd int = 20220131 --future dates, but diff btwn dates < 7 yrs
    declare @per int = 20280731
     --future dates, but diff btwn dates < 7 yrs

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 1 through 15 (of 39 total)

You must be logged in to reply to this topic. Login to reply