How to generate new Int ID for multiple columns

  • I am working on staging data for loading a tabular model and I have to merge transactions and the parent account number into my Fact table. But because the Transactions and accounts could have matching IDs in other source systems I want to renumber them. In addition to improving the performance of the model with a smaller Int key instead of a key made up of a mix of 30 some characters.

    So if in my source table I have

    TXID AccountID

    3fsd3 12

    fdgd 12

    23da 12

    34fa 13

    324c 13

    vcvdfg 44

    How do I get it to be

    TXID AccountID NewTXID NewAcctID

    3fsd3 12 1 1

    fdgd 12 2 1

    23da 12 3 1

    34fa 13 4 2

    324c 13 5 2

    vcvdfg 44 6 3

    I have been tinkering with the Row_Number() Over (order By TransactionID) -T-SQL and it works fine for creating the TXID. But I can't seem to figure out the Partition part of the T-SQL to only create a NewACctID only when the account ID changes.

    The only way I have been able to do that is to create two tables with a incremental ID column, load Transactions into one, accounts into the other, and then merge them to get the IDs. But I am hoping there is way to do it in the TQL.

  • I believe you'd be looking for the DENSE_RANK or RANK functions to do what you're trying to do with the Account ID's, keep in mind that will break if you ever get files on different days that have different accounts in them.

    Are you actually noticing performance issues that make you think switching to an int is necessary?

  • You can try this:

    declare @t table (TXID varchar(10), AccountID int)

    insert @t

    (TXID, AccountID)

    values

    ('3fsd3', 12), ('fdgd', 12), ('23da', 12), ('34fa', 13), ('324c', 13), ('vcvdfg', 44)

    select t.TXID,

    t.AccountID,

    row_number() over (order by t.AccountID, t.TXID),

    dense_rank() over (order by t.AccountID)

    from @t t

    Don Simpson



    I'm not sure about Heisenberg.

  • I am new to tabular modeling and from what I have read is that queries shouldn't take longer than a few seconds. My transaction table is just short of a 200 million rows and the Account table is about 6 million. The keys are varchar and quite long and from what I have read in the Performance tuning guide, if I can use get the system to value encoding instead of hash, I'll get better performance. They should also take up less space.

  • The Dense_rank is perfect. It allowed me to run it against all my columns. I even replaced my Row_number with it.

    SELECT t.TransactionId, dense_rank() over (order by T.TransactionID) as TXKey,

    t.FacilityAccountID, dense_rank() over (order by t.FacilityAccountID) as FAKey,

    T.attendingProvider, dense_rank() over (order by t.attendingProvider) as AttPrvKey,

    T.ReferringProvider, dense_rank() over (order by t.ReferringProvider) as RefPrvKey,

    t.PostDate

    from Transactions T

    Is there a reason to use Row_number instead of dense_rank on the first column. It appears dense_rank did the same thing.

  • Phillip.Putzback (4/29/2016)


    Is there a reason to use Row_number instead of dense_rank on the first column. It appears dense_rank did the same thing.

    When TXID is unique, they will do the same thing. I suspect that row_number probably has a little less overhead.

    Don Simpson



    I'm not sure about Heisenberg.

  • Is there a reason to use Row_number instead of dense_rank on the first column. It appears dense_rank did the same thing.

    ROW_NUMBER does not take ties into consideration, RANK and DENSE_RANK do.

    To better understand the difference between ROW_NUMBER, RANK and DENSE_RANK consider the following queries (note my comments):

    -- Insert unique values into table

    DECLARE @numbers TABLE(SomeNumber int);

    INSERT @numbers VALUES (1), (2), (3), (4);

    -- ROW_NUMBER, RANK and DENSE_RANK are all the same

    SELECT

    SomeNumber,

    [ROW_NUMBER] = ROW_NUMBER() OVER (ORDER BY SomeNumber),

    [RANK] = RANK() OVER (ORDER BY SomeNumber),

    [DENSE_RANK] = DENSE_RANK() OVER (ORDER BY SomeNumber)

    FROM @numbers;

    -- Now add some duplicate values

    INSERT @numbers VALUES (2), (2), (3), (3);

    -- Now let's see how each function produces different values

    SELECT

    SomeNumber,

    [ROW_NUMBER] = ROW_NUMBER() OVER (ORDER BY SomeNumber),

    [RANK] = RANK() OVER (ORDER BY SomeNumber),

    [DENSE_RANK] = DENSE_RANK() OVER (ORDER BY SomeNumber)

    FROM @numbers;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (5/4/2016)


    ROW_NUMBER does not take ties into consideration, RANK and DENSE_RANK do.

    I would phrase that differently. They all take ties into consideration: RANK and DENSE_RANK treat ties as being indistinguishable and assign them all the same value whereas ROW_NUMBER treats ties as distinct and orders them in a non-deterministic manner before assigning distinct values.

    The way you phrased it could potentially be interpreted as ROW_NUMBER discards ties, which is most definitely NOT the case.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (5/4/2016)


    Alan.B (5/4/2016)


    ROW_NUMBER does not take ties into consideration, RANK and DENSE_RANK do.

    I would phrase that differently. They all take ties into consideration: RANK and DENSE_RANK treat ties as being indistinguishable and assign them all the same value whereas ROW_NUMBER treats ties as distinct and orders them in a non-deterministic manner before assigning distinct values.

    The way you phrased it could potentially be interpreted as ROW_NUMBER discards ties, which is most definitely NOT the case.

    Drew

    True that.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks for all the tips. Since the tables are so wide that I am importing into a stage table I am having to break down columns that I want keys on into there own tables and creating the keys off of an Identity column since using dense_rank was killing the server when trying to run a table with a dozen columns that I want keys for.

  • Did you consider sequence tables? I've not used them myself but would they have done what you wanted?

    https://msdn.microsoft.com/en-us/library/ff878091.aspx

    Jez

  • I'll have to hunt down some other examples because I really don't get what the sequence object is used for in their examples.

    Thanks.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply