Difference between equal directly and use SUBSTRING

  • Hello,

    I have a script that was running really long, and I have found why it was running so long, but I don't understand why.

    In the WHERE clause I have the following code:

    AND SUBSTRING(CDHDR_NEW.OBJECTID, 1, LENGTH_EBELN) = MM_EKKO_STAGING.EBELN

    This code is logically correct, even when actually in this particular case "LENGTH_EBELN" field is always the same number = 10, so it can be changes to 10. We do not change it because in the future might happen that this is not correct, and there might be higher variability of numbers, not only 10.

    This logic is that first symbols from OBJECTID should be the same as EBELN, where OBJECTID can have up to 20 symbols. EBELN can have length maximum of 10 symbols. In this case is always 10, but in some cases that might not be true.

    When I use the code as it is, the statement is so slow, and when I say slow, it ran for almost 17 straight days.

    Yesterday, without any logic, we just decided to add into the code this line, so both conditions were active in the WHERE Clause, and for our surprise, results came into almost 3 minutes. We are still analyzing if the results are correct, but seems that the results is correct is the same in both cases, with the only difference that one ran for 17 days, and the other 3 minutes.

    AND CDHDR_NEW.OBJECTID = MM_EKKO_STAGING.EBELN

    Somewhere I found that substring is not necessary, when you compare 2 values from tables, and one of the values has just 10 symbols, and the other has more for example 20 symbols, because SQL Server automatically compares only 10 symbols, I have no idea if that's correct, but using this lane in the code, time spent on the query was really ingressive.

    How to understand what happened here, are those 2 lanes together somehow working better, or it is just as simple as I explained above..? If it is necessary, I can put the whole code, but based on the nature of the topic, I don't think it's necessary.

    Thank you,

    Best Regards,

    Martin Chuchukov

  • Using a function on a column generally stops it being SARGABLE. (ie It will not be able to use any indexes.)

    You should use:

    AND CDHDR_NEW.OBJECTID LIKE MM_EKKO_STAGING.EBELN + '%'
  • Martass wrote:

    Somewhere I found that substring is not necessary, when you compare 2 values from tables, and one of the values has just 10 symbols, and the other has more for example 20 symbols, because SQL Server automatically compares only 10 symbols, I have no idea if that's correct, but using this lane in the code, time spent on the query was really ingressive.

    this is a rather incorrect assumption and can be easily proven to be wrong.

    SQL Server will ALWAYs compare the whole contents with exception of trailing spaces which are ignored (except when doing a LIKE comparison where they are considered - see https://learn.microsoft.com/en-US/sql/t-sql/language-elements/string-comparison-assignment?view=sql-server-ver16)

    drop table if exists #t1
    drop table if exists #t2

    select *
    into #t1
    from (values ('this is a frog')
    , ('this is not a frog')
    , ('this is also a frog')
    , ('this is also not a frog')
    ) t(contents)

    select *
    into #t2
    from (values ('this is a')
    , ('this is')
    , ('this is also')
    ) t(search)

    -- this one returns zero rows - according to OP all the rows starting with "this is" should be retrieved as the contents of the second row on #t2 matches all the rows on #t1
    select *
    from #t1 t1
    inner join #t2 t2
    on t1.contents = t2.search

    -- using a substring, where the size of the search string is used will return all rows alongside the pattern that caused the search.
    -- this case is NOT SARGEABLE meaning a full table/index scan will take place

    select *
    from #t1 t1
    inner join #t2 t2
    on substring(t1.contents, 1, len(t2.search)) = t2.search

    -- using a LIKE operator will return all rows that match the pattern
    -- this case is SARGEABLE meaning a index range scan can take place (SQL May decide to do a full scan based on statistics)
    -- the lower search string is excluded to fully exemplify the search

    select *
    from #t1 t1
    inner join #t2 t2
    on t1.contents like t2.search + '%'
    and t2.search <> 'this is'

     

  • Ken McKelvey wrote:

    Using a function on a column generally stops it being SARGABLE. (ie It will not be able to use any indexes.)

    You should use:

    AND CDHDR_NEW.OBJECTID LIKE MM_EKKO_STAGING.EBELN + '%'

    Our DBA told us that we need to ignore and don't use LIKE as much as possible, because is causing a lot of problems, and it's "expensive". This is the main reason why we are using the SUBSTRING, even when in our DBA eye is expensive too, but "a little bit" better than LIKE.

    Also, the main question if I just use:  CDHDR_NEW.OBJECTID = MM_EKKO_STAGING.EBELN, will it compare the first 10, or it is not going to find any results? Right now I am not able to do the test, but later I will test it myself as well.

    In the SUBSTRING query index was used, but there was before the join some "sort" which took forever.

    Thank you

    Martin

     

  • substring vs like is only "better" if the like contains a wildcard at the start of the string.. e.g. like '%abc%'

    on the case above both substring and like will do a full scan - and the like then needs to further search each row for a pattern.

    if the string being compared only has a end wildcard then IT IT better than using a substring as it can filter the data on the index access stage.

    so on the second case substring will do a full scan, like will do a range lookup (this potentially reducing IO and logical reads) - the string compare bit will be nearly the same for both substring and like on this case.

    so your DBA needs to either come clear to you on what type of LIKE cases you can use, or they need to learn the difference between the two types.

  • frederico_fonseca wrote:

    this is a rather incorrect assumption and can be easily proven to be wrong.

    SQL Server will ALWAYs compare the whole contents with exception of trailing spaces which are ignored (except when doing a LIKE comparison where they are considered - see https://learn.microsoft.com/en-US/sql/t-sql/language-elements/string-comparison-assignment?view=sql-server-ver16)

    Hello, after deeper analysis, I have found why on both conditions the given result is the same, and that's because the other input conditions under where clause are filtering only OBJECTID's that are 100% same as EBELN from the other table. So my assumption was wrong, because of a "lucky correct" structure setup of the tables.

    frederico_fonseca wrote:

    so your DBA needs to either come clear to you on what type of LIKE cases you can use, or they need to learn the difference between the two types.

    Well, that was never specified, only "general recommendation" without any specifications, as you specified, but yes, I believe that what are you saying will be the same if I ask them for specifications.

    ---

    So the only solution to make it faster, and yet still be sure it's correct seems to be that we will recreate the table with adding 2 new columns, where we will generate additional column with OBJECTID (with length same as EBELN), where the comparison can go 1:1 without LIKE or SUBSTRING. And also we do the same substring with the year, so to make it faster, we will get rid of the substring by creating new column which will be used for it. The script once created will be a script that will be used at least once a week, more likely once a day, and the script will never change, so it's worth it to invest a little bit of time to add additional columns and make my life easier.

    • This reply was modified 1 month ago by  Martass.
  • Martass wrote:

    Ken McKelvey wrote:

    Using a function on a column generally stops it being SARGABLE. (ie It will not be able to use any indexes.)

    You should use:

    AND CDHDR_NEW.OBJECTID LIKE MM_EKKO_STAGING.EBELN + '%'

    Our DBA told us that we need to ignore and don't use LIKE as much as possible, because is causing a lot of problems, and it's "expensive". 

    Using LIKE with the wildcard at the end of the string is fine as the optimizer will be able to use an index for the first bit of the string it knows about. Wildcards at the start of a string will obviously not be able to use an index and resort to a table scan. SUBSTRING will also do a table scan. You can see this in the query plan.

     

  • From the conversation it sounds like the OP has a string field that is actually holding multuiple pieces of information.  The issue here is actually one of poor database design.  One field= 1 piece of data.  It is much easier to concatenate the fields together on output for a string comparison that it is to shred the string (using substr or LIKE)  to find matches

    If you have no control over the incoming data, you could always try adding computed columns to split the string into its component elements.

    This will also have the benefit that these fields can be indexed

     

  • This is SAP data. EBELN is the purchasing document number which in my experience is always 10 characters long, assuming the transfer to SQL Server doesn't strip the leading zeroes. The CDHDR.OBJECTID is 180 character field in a table that tracks changes to data, and it's a generic field that applies to a multitude of objects. For EBELN it will always match the 10 characters, so there's no need for a substring or like.

  • aaron.reese wrote:

    From the conversation it sounds like the OP has a string field that is actually holding multuiple pieces of information.  The issue here is actually one of poor database design.  One field= 1 piece of data.  It is much easier to concatenate the fields together on output for a string comparison that it is to shred the string (using substr or LIKE)  to find matches If you have no control over the incoming data, you could always try adding computed columns to split the string into its component elements. This will also have the benefit that these fields can be indexed

    Hi, the logic behind the code is not mine, and the person come up with it says that's how is supposed to be.

    I have analyzed it myself and because those tables are standard tables from SAP, seems that the values in them are exactly the same as on the other tables, so in my opinion like/substr is not needed. If i understood how the tables CDHDR and CDPOS are filled in SAP, is that any kind of change done is SAP (business data change I believe) is saved on those tables, where what's saved is not somehow changed, but only save as new record and that's it. So with other words, it is not possible to have record which cannot be found using simple comparison (of course we need to compare the relevant table and field), and there is no need of a comparison using like/substring.

    I am in discussion with the person responsible for the logic of the WHERE/JOIN clause. As workaround if cannot agree, we are going to add additional field into the table where OBJECTID will be copied with length same as EBELN (10 symbols), even when in my opinion this is absolutely unnecessary activity.

    Ross McMicken wrote:

    This is SAP data. EBELN is the purchasing document number which in my experience is always 10 characters long, assuming the transfer to SQL Server doesn't strip the leading zeroes. The CDHDR.OBJECTID is 180 character field in a table that tracks changes to data, and it's a generic field that applies to a multitude of objects. For EBELN it will always match the 10 characters, so there's no need for a substring or like.

    Yes, it's SAP data, and it is extracted 1:1 into the SQL, so there is everything exactly the same as it is in SAP to the last symbol. Only transparent tables are extracted. I am responsible for the creation of the SQL, not so much for the logic, another person comes with the logic, and I am only coding it, because I don't have much knowledge about SAP, how the tables are linked, and so on. Thank you for the clarification, after analysis on the tables in the SQL Server, I came to the same conclusion as you stated - EBELN always matches 10 characters, and I have the feeling that each record into CDHDR and CDPOS is match to some TABLE.FIELD, and there is no need of any kind of like/substring transformation before comparing it.

    Thank you for the clarification, and I hope soon we will come up with solution, because seems that here is more of a knowledge/logic problem to be fixed instead of technical problem.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply