Datatype performance - in theory there should be a difference, but in practice is there?

  • There's a great article here (http://sqlfool.com/content/PerformanceConsiderationsOfDataTypes.pdf) about WHY you should pick the appropriate scale of a datatype, but in this day and age of fast(er) disks, fast(er) cpu's, compression does it really make any PRACTICAL difference (and 'The Cloud')? I acknowledge that in theory it should - that's not the question, so please don't go on about the number of pages that get read / written.

    I guess there are really 2 questions here, the first one relates to choosing the correct data type, the second relates to choosing the correct scale of the datatype.

    a)

    When developing a data warehouse solution why not just convert all fields to varchar(max) - reminder PRACTICAL performance difference

    b)

    When developing a data warehouse solution where data is sourced from existing system, why not declare all character strings as varchar(8000), all integer values as bigint and datetime values as datetime2(7). Floats and numerics have their own issues!

    pcd

  • pcd_au (8/1/2016)


    There's a great article here (http://sqlfool.com/content/PerformanceConsiderationsOfDataTypes.pdf) about WHY you should pick the appropriate scale of a datatype, but in this day and age of fast(er) disks, fast(er) cpu's, compression does it really make any PRACTICAL difference (and 'The Cloud')? I acknowledge that in theory it should - that's not the question, so please don't go on about the number of pages that get read / written.

    I guess there are really 2 questions here, the first one relates to choosing the correct data type, the second relates to choosing the correct scale of the datatype.

    a)

    When developing a data warehouse solution why not just convert all fields to varchar(max) - reminder PRACTICAL performance difference

    b)

    When developing a data warehouse solution where data is sourced from existing system, why not declare all character strings as varchar(8000), all integer values as bigint and datetime values as datetime2(7). Floats and numerics have their own issues!

    pcd

    Quick thoughts;

    a) converting everything to varchar(max) makes no sense as any data type specific logic will demand a conversion to the appropriate data type, i.e. date logic etc. Further this would increase the size of the data which in return will increase both memory io and disk io load. Cannot understand why anyone would do that.

    b) Increasing the size of the data (as mentioned in a)) will increase the load on the server with no benefits to the process. I have seen 3-4 times reduction in sizes and large performance improvements by using the correct data types.

    😎

  • Also, choosing the appropriate data type (and scale) helps prevent absurd data from getting into your database - you know, dates that aren't proper dates, phone numbers with text in them and so on.

    John

  • Don't SQL Server has dynamic datatypes. Like varchar2, which normally consumes only the space which we entered as data in tables. Eventhough we declare as varchar2(50), and inserted 'test', it should not take the space required for varchar2(50), it should consume the space for characters 'test'.

    Thank You.

    Regards,
    Raghavender Chavva

  • Yes, although there is a two-byte overhead for the variability, so storing "Test" would cost you 6 bytes. But if, for example, you were storing just state codes "AL", "NY", you wouldn't use varchar(250), first because of that two-byte cost, and second because you don't want values like "Alaska" and "New York" being entered.

    John

  • John Mitchell-245523 (8/2/2016)


    Also, choosing the appropriate data type (and scale) helps prevent absurd data from getting into your database - you know, dates that aren't proper dates, phone numbers with text in them and so on.

    John

    Given that this is a data warehouse application, all the data is sourced from existing databases, so in theory it should be correct. That said I did work on an Oracle system once that tried to claim the transaction happened in the 74th month! If the source has text in a phone number field, then I want to know about it as it becomes part of the quality report.

    pcd

  • Eirikur Eiriksson (8/2/2016)


    pcd_au (8/1/2016)


    There's a great article here (http://sqlfool.com/content/PerformanceConsiderationsOfDataTypes.pdf) about WHY you should pick the appropriate scale of a datatype, but in this day and age of fast(er) disks, fast(er) cpu's, compression does it really make any PRACTICAL difference (and 'The Cloud')? I acknowledge that in theory it should - that's not the question, so please don't go on about the number of pages that get read / written.

    I guess there are really 2 questions here, the first one relates to choosing the correct data type, the second relates to choosing the correct scale of the datatype.

    a)

    When developing a data warehouse solution why not just convert all fields to varchar(max) - reminder PRACTICAL performance difference

    b)

    When developing a data warehouse solution where data is sourced from existing system, why not declare all character strings as varchar(8000), all integer values as bigint and datetime values as datetime2(7). Floats and numerics have their own issues!

    pcd

    Quick thoughts;

    a) converting everything to varchar(max) makes no sense as any data type specific logic will demand a conversion to the appropriate data type, i.e. date logic etc. Further this would increase the size of the data which in return will increase both memory io and disk io load. Cannot understand why anyone would do that.

    b) Increasing the size of the data (as mentioned in a)) will increase the load on the server with no benefits to the process. I have seen 3-4 times reduction in sizes and large performance improvements by using the correct data types.

    😎

    The observation about logic causing table scans is what I am alluding to in relation to page compression, faster cpu, disk etc. But my question is 'Does it make a real practical noticeable difference?'. People go on about stored procedures being cached in memory and not have to be recompiled. Frankly if the query is going to take 3 minutes to run and 4 nano seconds to compile who cares about the compilation.

    I'm not interested in '3-4 times reduction in size', but I am interested in 'large performance improvements by using the correct data types' - do you have any more info on this bit?

    pcd

  • Given that this is a data warehouse application, all the data is sourced from existing databases, so in theory it should be correct. That said I did work on an Oracle system once that tried to claim the transaction happened in the 74th month! If the source has text in a phone number field, then I want to know about it as it becomes part of the quality report.

    Ah yes, famous last words! That should work, until the time that someone tries to fix an issue by doing a manual update that inadvertently violates the business rules. Enforcing data integrity in the data layer is the best way to stop that from happening... and it's free!

    John

  • pcd_au (8/2/2016)


    Eirikur Eiriksson (8/2/2016)


    pcd_au (8/1/2016)


    There's a great article here (http://sqlfool.com/content/PerformanceConsiderationsOfDataTypes.pdf) about WHY you should pick the appropriate scale of a datatype, but in this day and age of fast(er) disks, fast(er) cpu's, compression does it really make any PRACTICAL difference (and 'The Cloud')? I acknowledge that in theory it should - that's not the question, so please don't go on about the number of pages that get read / written.

    I guess there are really 2 questions here, the first one relates to choosing the correct data type, the second relates to choosing the correct scale of the datatype.

    a)

    When developing a data warehouse solution why not just convert all fields to varchar(max) - reminder PRACTICAL performance difference

    b)

    When developing a data warehouse solution where data is sourced from existing system, why not declare all character strings as varchar(8000), all integer values as bigint and datetime values as datetime2(7). Floats and numerics have their own issues!

    pcd

    Quick thoughts;

    a) converting everything to varchar(max) makes no sense as any data type specific logic will demand a conversion to the appropriate data type, i.e. date logic etc. Further this would increase the size of the data which in return will increase both memory io and disk io load. Cannot understand why anyone would do that.

    b) Increasing the size of the data (as mentioned in a)) will increase the load on the server with no benefits to the process. I have seen 3-4 times reduction in sizes and large performance improvements by using the correct data types.

    😎

    The observation about logic causing table scans is what I am alluding to in relation to page compression, faster cpu, disk etc. But my question is 'Does it make a real practical noticeable difference?'. People go on about stored procedures being cached in memory and not have to be recompiled. Frankly if the query is going to take 3 minutes to run and 4 nano seconds to compile who cares about the compilation.

    I'm not interested in '3-4 times reduction in size', but I am interested in 'large performance improvements by using the correct data types' - do you have any more info on this bit?

    pcd

    You will have to understand how the SQL Server works, which workload is expensive and which factors bear more weight than others, data sizes and IO do have a very sizeable impact on performance.. Narrower data sets, narrower indices etc. do improve the performance to a greater proportion than the actual data size, especially when working on large analytical sets like DWs. I guess your lack of interest in the data size subject is portraying an ignorance on the overall performance subject.

    😎

  • Eirikur Eiriksson (8/2/2016)

    You will have to understand how the SQL Server works, which workload is expensive and which factors bear more weight than others, data sizes and IO do have a very sizeable impact on performance.. Narrower data sets, narrower indices etc. do improve the performance to a greater proportion than the actual data size, especially when working on large analytical sets like DWs. I guess your lack of interest in the data size subject is portraying an ignorance on the overall performance subject.

    😎

    Interesting observation about my 'lack of interest in the data size subject', given you know nothing whatsoever about my background. I'll leave it at 'I started working with SQL Server in 1989, and have used it just about every working day since then.'. It's precisely my lack of ignorance about the subject that has promoted this question. Applications I develop tend to not have performance issues because I do understand the internal mechanisms of SQL Server, data modelling, indexing and performance tuning. It's the client that is making these requests and I'm looking for evidence that these things are important. Hence I'm reaching out to people such as yourself who have experience in developing sub optimal systems to get some metrics that I can take back to the client on why we should NOT restrict ourselves to a small subset of data types.

    You input is much appreciated.

    pcd

  • I think that an example can show one aspect of the problem. Take a look at the code bellow. I'm not modifying the data, I'm just modifying the data type from nvarchar(50) to nvarchar(max). Run the code and check the memory allocation that each of the queries got. There is a noticeable difference. I've seen few cases at the past that queries had to wait for memory grant because when tables were created, they were created with varchar(max) just to be on the safe side but in reality the strings were less than 100 bytes. This example shows only the query's memory aspect, but you should also take into considerations that using the wrong data types can cause many more problems

    use AdventureWorks

    go

    --Check the query's memory grant (from properties window or from pop up window)

    select top 500 *

    from Sales.SalesOrderDetail

    order by CarrierTrackingNumber desc

    go

    --Modify a column from nvarchar(50) to nvarchar(max). Notice that I'm not modifying any value

    alter table Sales.SalesOrderDetail alter column CarrierTrackingNumber nvarchar(max)

    go

    --Check the query's memory grant (from properties window or from pop up window)

    select top 500 *

    from Sales.SalesOrderDetail

    order by CarrierTrackingNumber desc

    go

    --Returning the column's size to the original one

    alter table Sales.SalesOrderDetail alter column CarrierTrackingNumber nvarchar(50)

    go

    --------------------------------------------------------------
    To know how to ask questions and increase the chances of getting asnwers:
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • John Mitchell-245523 (8/2/2016)


    Given that this is a data warehouse application, all the data is sourced from existing databases, so in theory it should be correct. That said I did work on an Oracle system once that tried to claim the transaction happened in the 74th month! If the source has text in a phone number field, then I want to know about it as it becomes part of the quality report.

    Ah yes, famous last words! That should work, until the time that someone tries to fix an issue by doing a manual update that inadvertently violates the business rules. Enforcing data integrity in the data layer is the best way to stop that from happening... and it's free!

    John

    John you get no argument from me on where integrity should be enforced, but as I mentioned there is no UI in this case, it's all sourced from existing systems. This is just a consumer of other people bad data & designs.

    I get fed up when I see tables designed with a surrogate 'ID' column, declared as a Primary Key, and no unique index on the alternate / natural keys.

    I get fed up when I see tables designed with no referential integrity enforced and then have to spend my time and clients $ unwinding something that should never have happened in the first place.

    I have lost count of the number of battles I have lost where some (usually permanent) member of the team insists on removing RI because the process runs much quicker, when the process takes 6 hours and we have an 8 hour window.

    I could go on, but I'd better not!

  • Adi Cohn-120898 (8/2/2016)


    I think that an example can show one aspect of the problem. Take a look at the code bellow. I'm not modifying the data, I'm just modifying the data type from nvarchar(50) to nvarchar(max). Run the code and check the memory allocation that each of the queries got. There is a noticeable difference. I've seen few cases at the past that queries had to wait for memory grant because when tables were created, they were created with varchar(max) just to be on the safe side but in reality the strings were less than 100 bytes. This example shows only the query's memory aspect, but you should also take into considerations that using the wrong data types can cause many more problems

    use AdventureWorks

    go

    --Check the query's memory grant (from properties window or from pop up window)

    select top 500 *

    from Sales.SalesOrderDetail

    order by CarrierTrackingNumber desc

    go

    --Modify a column from nvarchar(50) to nvarchar(max). Notice that I'm not modifying any value

    alter table Sales.SalesOrderDetail alter column CarrierTrackingNumber nvarchar(max)

    go

    --Check the query's memory grant (from properties window or from pop up window)

    select top 500 *

    from Sales.SalesOrderDetail

    order by CarrierTrackingNumber desc

    go

    --Returning the column's size to the original one

    alter table Sales.SalesOrderDetail alter column CarrierTrackingNumber nvarchar(50)

    go

    Love this example. I have already convinced the client that varchar(max) is not a good idea and explained LOB allocations etc, so that ones covered.

    pcd

  • Notice that it will use more memory even if you'll use another length and not just nvarchar(max). The reason is that SQL Server has no statistics about the actual size of the data in varying size columns, so it makes an assumption about the average size for those columns and then it uses that assumptions for the memory request. If your client will decide to use varchar(8000) and not varchar(max), he'll still be using for some queries much more memory then needed.

    Adi

    --------------------------------------------------------------
    To know how to ask questions and increase the chances of getting asnwers:
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Quick tought.

    Test for performance instead of trying to guess.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply