Finding Lowercase Characters

  • Comments posted to this topic are about the item Finding Lowercase Characters

  • Hi there. Interesting question. Here are some notes:

     

    1. At the start of the question it mentions that the database is using the Latin1_General_100_CI_AS_CS collation. Those last two letters are transposed. The actual collation name is: Latin1_General_100_CI_AS_SC.
    2. Regarding the following statement in the explanation: "The reason Latin1_General_BIN was used in this scenario is due to the fact that binary collations are case-sensitive and enforce a binary sort order."

      This is a very common misconception, but binary collations are in fact not case-sensitive due to not allowing for combining characters, full width characters, and a few other scenarios. For full details, please see: No, Binary Collations are not Case-Sensitive

    3. Regarding the following statement in the explanation: "The following collations would also be viable:..."

      In cases where a binary collation does make sense to use for comparisons, then it's entirely irrelevant which binary collation is used as they're all simply comparing bytes (even the BIN2 collations that compare code units are still doing the same binary comparison across all BIN2 collations).

    4. In this particular scenario, using a binary collation is not ideal as it really only works with the US English alphabet. The preferred solution for finding lower-case letters is the following:
      WHERE LEFT(FirstName, 1) = LOWER(LEFT(FirstName, 1)) COLLATE Latin1_General_100_CS_AS_SC;

    Take care,

    Solomon....

     

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Hi Solomon,

    Your solution does also find those FirstNames that does not really have a lowercase version, i.e. where FirstName starts with a digit or a dash or a dot...

    I would probably try something like:

    where left(su.name,1) = lower(left(su.name,1)) collate SQL_Latin1_General_CP1_CS_AS
    and left(su.name,1) <> upper(left(su.name,1)) collate SQL_Latin1_General_CP1_CS_AS

    My database (Azure SQL database) is using SQL_Latin1_General_CP1_CI_AS, so should I be using SQL_Latin1_General_CP1_CS_AS?

    Best regards, Henrik

     

     

  • Before seeing the answers your's is what I was going for

  • Interesting question.

    You beat me to it, Solomon. Your explanation sums it up nicely...

    ____________________________________________
    Space, the final frontier? not any more...
    All limits henceforth are self-imposed.
    “libera tute vulgaris ex”

  • Solomon Rutzky wrote:

    Hi there. Interesting question. Here are some notes:

     

      <li style="list-style-type: none;">

    1. At the start of the question it mentions that the database is using the Latin1_General_100_CI_AS_CS collation. Those last two letters are transposed. The actual collation name is: Latin1_General_100_CI_AS_SC.

     

      <li style="list-style-type: none;">

    1. Regarding the following statement in the explanation: "The reason Latin1_General_BIN was used in this scenario is due to the fact that binary collations are case-sensitive and enforce a binary sort order."

      This is a very common misconception, but binary collations are in fact not case-sensitive due to not allowing for combining characters, full width characters, and a few other scenarios. For full details, please see: No, Binary Collations are not Case-Sensitive

     

      <li style="list-style-type: none;">

    1. Regarding the following statement in the explanation: "The following collations would also be viable:..."

      In cases where a binary collation does make sense to use for comparisons, then it's entirely irrelevant which binary collation is used as they're all simply comparing bytes (even the BIN2 collations that compare code units are still doing the same binary comparison across all BIN2 collations).

     

      <li style="list-style-type: none;">

    1. In this particular scenario, using a binary collation is not ideal as it really only works with the US English alphabet. The preferred solution for finding lower-case letters is the following:
      WHERE LEFT(FirstName, 1) = LOWER(LEFT(FirstName, 1)) COLLATE Latin1_General_100_CS_AS_SC;

     

    Take care,

    Solomon....

    Hey Solomon,

    Thank you for the feedback! That very appropriately named article sheds some light on that misconception. Definitely saving this for future reference.

    I'll see what I can do about getting that typo fixed and the explanation updated in the interim.

  • Just curious how many of us use case insensitive fields?

    412-977-3526 call/text

  • Robert Sterbal wrote:

    Just curious how many of us use case insensitive fields?

    I generally have no need for case sensitive data.  However, if I did, I'd likely NOT make the whole server or even a given database case sensitive... just the columns that needed it.  Yes, I know that has some other ramifications especially when it comes to TempDB but I'd rather put up with that little nuance than have the whole server or even database be case sensitive.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Henrik Staun Poulsen wrote:

    Your solution does also find those FirstNames that does not really have a lowercase version, i.e. where FirstName starts with a digit or a dash or a dot...

    I would probably try something like:

    where left(su.name,1) = lower(left(su.name,1)) collate SQL_Latin1_General_CP1_CS_AS
    and left(su.name,1) <> upper(left(su.name,1)) collate SQL_Latin1_General_CP1_CS_AS

    Hello Henrik. Excellent point, thanks for catching that. Yes, I think that I would also approach it that way (as long as it's ok to exclude characters from alphabets that do not have the concept of upper-case/lower-case).

    The specific collation to use would be the case-sensitive version of the collation already being used by that column. In this case, @btylerwhite did use the correct collation of Latin1_General_100_CS_AS_SC.

    Also keep in mind that when working with non-Latin alphabets, it helps to use the most recent version of the desired collation (i.e. a 100 version instead of a 90 or non-versioned) as the newer versions have more upper-case to lower-case mappings which are used by the UPPER and LOWER functions. In some cases, using the Japanese_XJIS_140_* series is best as those are the most recent and have the most mappings (e.g. Japanese_XJIS_140_CS_AS ). Even if you aren't working with Japanese characters, the upper-case/lower-case mappings are correct for all cultures except for Turkish and one other that uses the same dotted upper-case "I" and dotless lower-case "i".

     

    Henrik Staun Poulsen wrote:

    My database (Azure SQL database) is using SQL_Latin1_General_CP1_CI_AS, so should I be using SQL_Latin1_General_CP1_CS_AS? 

    Whether or not to use a case-sensitive, or even binary, collation for the instance and/or database levels depends on you and your application. There is no correct answer. Case-insensitive collations are easier to interact with as you can use PascalCase for your object names (i.e. the correct approach 😉 ) yet not worry about the casing when writing ad hoc queries / debugging. However, that also allows for some sloppiness if the casing used in the code does not match the actual casing of the objects. Binary and case-sensitive collations enforce consistent casing between code and objects but are less forgiving when you might be in a hurry (i.e. debugging) and allow for a different form of sloppiness: objects having similar names that only differ in casing (e.g. Employee vs employee ).

    The bigger issue, for me, would be the use of the older SQL Server collations (i.e. names starting with SQL_ ) as they have fewer mappings (not important for US English, but certainly can be for other languages), fewer characters with defined sort weights, and they allow for performance degradation when VARCHAR columns are indexed but then compared to NVARCHAR literals or variables (for full details on that, please see:  Impact on Indexes When Mixing VARCHAR and NVARCHAR Types ). Hence, when working on SQL Server 2012 or newer, I would recommend using the Latin1_General_100_ series, preferably one ending in _SC when not using a _BIN* collation.

     

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • hi Solomon,

    I'm using Azure SQL Database, and it's default collation is SQL_Latin1_General_CP1_CI_AS. It is not possible to choose another collation for the database. That also goes for TempDB, and hence I have to be very careful about having other collations in my varchar columns, as your blog post point out.

    Thank you for the nice words,

    Henrik

     

  • Henrik - you can change its collation - just not as easy as doing it on Prem

    see https://www.mssqltips.com/sqlservertip/5530/how-to-change-an-azure-sql-database-collation/

  • As the database is using a case-sensitive collation, I don't see why the 2nd option wouldn't work.

  • Hi Frederico Fonseca,

    Yes, that might change the collation of my database, but not that of TEMPDB. So any query using a #temp table would need to be checked (and possibly revised) for any collation conflicts.

    I once had a test server with 3 different collations in Master, Msdb and mydatabase. That was "fun". It ended up with a re-install of SQL Server.

    Best regards, Henrik

     

  • Henrik Staun Poulsen wrote:

    Hi Frederico Fonseca,

    Yes, that might change the collation of my database, but not that of TEMPDB. So any query using a #temp table would need to be checked (and possibly revised) for any collation conflicts.

    I once had a test server with 3 different collations in Master, Msdb and mydatabase. That was "fun". It ended up with a re-install of SQL Server.

    Best regards, Henrik

    yes I am aware of those issues with On Prem - in Azure I don't know if it works the same as I don't have any setup - if you do maybe you could test it.

    regardless if any code using temp tables is created using collate default database it will work fine.

    in any case I was just pointing out that you can indeed change the user database collation which you said could not be done.

  • hi Frederico,

    <<I was just pointing out that you can indeed change the user database collation

    Yes, Agree. Sorry, English is not my first language.

    Best regards, Henrik

     

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply