Join operation over String non-index fields

  • Hi all,

    I have to perform some joins between a Fact table (with millions of records) and some contextual tables. My problem is that some tables have an auto incremental field as the PK, but this key is not useful for join operations. The business key is stored in a string field; moreover, this field is not defined as unique and is a String field. When the value of the field is unknown you get a "U". In the following image an example of this situation can be seen:

    My first approach was to create new tables with the business key as the index and an ETL to load this table, but the managers of the project have discard this option and I need a clever solution to overcome this drawback.

    Any comments or suggestions will be appreciated.

    Kind Regards,

    Paul

    Paul Hernández
  • Hi Paul

    Apart from the fact that a simple join between the two tables on konsumfeld

    SELECT ...

    FROM Tableb b

    LEFT JOIN Tablea a ON a.konsumfeld = b.konsumfeld

    would eliminate rows from tableb where konsumfeld = 'U', I can't see what the problem is. Why would you want to make another copy of the table in order to index konsumfeld?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Thanks so much for your quick answer.

    My problem is I have to perform 3 Joins with 2 tables with the same situation and 1 table with a numeric key. I don't have control over the source tables, but the business keys always hold numeric values and they are defined as Strings. For a fact table with 482.866 records the query takes about 6 minutes to execute. I don't know how powerful the server is but is this a reasonable amount of time? This fact table contains only delta data (records for 1 day) but imagine the historical table that has more than 165 million of rows.

    Paul Hernández
  • You're welcome Paul.

    paul.pauldj54 (7/30/2012)


    ...the business keys always hold numeric values and they are defined as Strings...

    A column containing values with leading zero's can only be a string.

    ...For a fact table with 482.866 records the query takes about 6 minutes to execute. I don't know how powerful the server is but is this a reasonable amount of time? ...

    I'd say this is slow. Can you post the actual execution plan of the query? Save it as a .sqlplan file and post it here as an attachment.

    Is the business key column the same datatype in all of the tables?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Attached you can find the execution plan (I've never used this feature, it seems to be very cool).

    If you look at the first figure (from my 1st post) there is a column called "KUNDE_KEY" which is a business key, and a numerical field.

    You are absolutely right, leading-zero values must be strings, I haven't notice 😀

    Now I'll dig deep into execution plans :alien:

    Thanks for your time,

    Paul

    Paul Hernández
  • Hi Paul, can you post the Actual plan please, rather than the estimated plan? There are often differences between the two - for instance, the estimated plan you've just posted indicates that your output from this query is 56 million rows.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Attached the Actual Excecution Plan,

    For other readers, I found this link very useful to understan execution plans from SQL SERVER: http://www.simple-talk.com/sql/performance/execution-plan-basics/

    Kind Regards,

    Paul

    Paul Hernández
  • Hi Paul, thanks for the plan.

    Points:

    1. No clustered indexes on any of the 4 tables, a suitable clustered index on DeltaFAKT_DETAIL would probably help the expensive SORT in the plan. I'd start with the following columns in that clustered index, in the order shown:

    FD.FILIALE_KEY, -- join column

    FD.BRANCHE, -- join column

    FD.KONSUMFELD, -- join column

    FD.KUNDE_KEY,

    FD.EK_DATUM,

    FD.ABT,

    FD.UN_3

    I'd also change the GROUP BY to match this order:

    GROUP BY

    FD.FILIALE_KEY, --

    FD.BRANCHE, --

    FD.KONSUMFELD, --

    FD.KUNDE_KEY,

    FD.EK_DATUM,

    FD.ABT,

    FD.UN_3,

    FI.FILIALE,

    FI.FILIALE_KTEXT,

    FI.FILIALE_VERKAUFSREGION,

    FI.FILIALE_SAP_KEY,

    BR.BRANCHE,

    BR.BRANCHE_BEZ,

    BR.BRANCHE_KBEZ,

    KON.KONSUMFELD,

    KON.KONSUMFELD_BEZ,

    KON.KONSUMFELD_KBEZ;

    Indexes on the other tables would promote more efficient joins, but their rowcounts are so low that the improvement might not be noticeable.

    2. There's a 1:1 join between FILIALE and DeltaFAKT_DETAIL; however the 482,866 row product of this join, once joined with BRANCHE, has 10x the number of rows. Is this correct? Is there a join predicate missing here?

    3. Likewise with the join to table KONSUMFELD; the rowcount goes up by a factor of 10.

    4. The final output set of 537,859 rows is sufficiently similar to the rowcount of DeltaFAKT_DETAIL to raise an eyebrow.

    I'd begin by checking that the join predicates are correct.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Chris, you're simply amazing!!!

    I've learned a lot from you today.

    I'll study carefully your last reply, implement the changes and then write back to the forum to share my results.

    Kind Regards,

    Paul

    Paul Hernández
  • That's very kind Paul but I'm just a jobbing TSQL developer! Thanks anyway.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris:

    I've found one of the problems. I was joining data from BRANCHE and KONSUMFELD using the columns branche and konsumfeld respectively. These columns are not unique, that's the reason for the increment in the row numbers of the resultset. After some tests I realize that there is another column, named UN_3, which should be use together with branche (the same case for konsumfeld) to obtain a unique key.

    This database was created by a 3rd party and is still in a development phase and I don't have enough information and metadata. At the moment I have to discover everything by myself (and my best friend, the SQL Management Studio), and my colleagues from sqlservercentral.

    I think also that the indexes are disabled because they can cause low performance during loadings.

    Thanks for your help

    Kind Regards

    Paul Hernández
  • Excellent investigative work, Paul.

    It's not uncommon for indexes to be disabled during loads - but best practice is to enable them again afterwards 😉

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply