Sort Alphanumeric Values in SQL Server

By: Jeffrey Yao | Comments (5) | Related: > TSQL

Problem

In my work I sometimes need to import csv files and one column of the csv file is an alphanumeric string, with values like the following:

Code
Abc3.1xy2m
Abc10.12xy1a
Abc1.2xy31b
Abc1.10xy31c
Abc1.0xy1b
Abc10.2xy2a
Abc3.1xy11m

Once the data is loaded into a table, the values will be presented in a report, where the column [Code] needs to be sorted as follows. Each numeric part of the file has significance, so they data needs to sort as follows:

Code
Abc1.0xy1b
Abc1.2xy31b
Abc1.10xy31c
Abc3.1xy2m
Abc3.1xy11m
Abc10.2xy2a
Abc10.12xy1a

Using an ORDER BY in SQL Server will give the following results which is not the same as above:

Code
Abc1.0xy1b
Abc1.10xy31c
Abc1.2xy31b
Abc10.12xy1a
Abc10.2xy2a
Abc3.1xy11m
Abc3.1xy2m

How can this be done with T-SQL?

Solution

When sorting alphanumeric strings, we will extract all numbers and have two types of strings:

Strings composed of non-digits, let's call this S-string, we may have multiple S-strings.
Strings composed of digits 0 to 9, i.e. such string can be converted to numbers, and let's call this N-string, and we may have multiple N-strings.

Sorting Strings in SQL Server with T-SQL

For example, "Abc1.2xy31b" and "Abc1.10xy31c" can be decomposed to the following strings (S1 means 1^st S-string and N3 means the 3^rd N-string)

Original	S1	N1	S2	N2	S3	N3	S4
Abc1.2xy31b	Abc	1	.	2	xy	31	b
Abc1.10xy31c	Abc	1	.	10	xy	31	c

When we compare these two strings, in the N2 column, we know that as a number, 2 is smaller than 10, so for these two strings, "Abc1.2xy31b" should be smaller than "Abc1.10xy31c". But if the strings are compared directly, because string "2" is bigger than string "10", "Abc1.2xy31b" should be bigger than "Abc1.10xy31c".

Let's see this in T-SQL.

use tempdb
go

drop table if exists #t; -- applicable to sql server 2016+gogo

create table #t (a varchar(100));
go

insert into #t (a) values ('Abc1.2xy31b'), ('Abc1.10xy31c')
go 
select * from #t order by a asc

If we run the query, we will get the following, i.e. "Abc1.2xy31b" is at the bottom of the returned result set via the "order by a asc", meaning "Abc1.2xy31b" is considered "bigger" in value compared to "Abc1.10xy31c".

This is NOT what we want, we want the following result set instead when we use "order by a asc", because the second set of numeric values are "2" and "10" and "2" should come before "10".

Sorting Rules Defined

Our sorting rule will be such that when in comparison, all alphanumeric strings will be decomposed into S strings and N strings in their original sequence, and if S strings are the same, then the N strings need to be compared and we will compare the numeric values of N strings.

Now let's look at a simple comparison of the following two strings:

10-abc
2-abc

According to our rule, if we sort the two string in ascending order, we should have this, but this is not possible in T-SQL in its natural way.

2-abc
10-abc

On the other hand, if we change the string as follows, i.e. pad 0 to the left of the numeric string to make the numeric strings be the same length:

0010-abc
000002-abc

If the strings are like this, then when we use the order by clause in T-SQL to get the desired results:

0002-abc
0010-abc

The demo T-SQL code is as follows:

-- this will return result that we DO NOT want 
select a from (values ('10.abc'), ('2.abc')) T(a) order by a asc; 
  
-- but the following will return the result EXPECTED 
select a from (values ('0010.abc'), ('0002.abc')) T(a) order by a asc;

comparison result btw padding and non-padding

SQL Server T-SQL User Defined Function to Order Numbers within a String

So the algorithm to make "10.abc" and "2.abc" strings to be in the order expected via T-SQL can be designed as follows.

For each N string inside the source string, we will pad 0 to the left of the N string to make the N string length to be bigger than the largest N string length inside the source string.

For example, if we have an alphanumeric string like the following:

Abc_1.2.34_xy56789-mn

The longest N string of this source string is "56789", if we pad 0 to each of the N string to make each N string to have a length of 6, we will get the following new source string:

Abc_000001.000002.000034_xy056789-mn

This is like standardizing all strings for comparison, and once all strings are "standardized", we can sort the strings to get the expected result.

So the key here is to come up with a solution to standardize all source strings. I created a user defined function (UDF), it will accept three parameters:

1^st is the source string,
2^nd is length of the numeric string after padding
3^rd parameter is the letter to be padded. This can be omitted as we can use '0' as the default, but I left it there so we can have more flexibility, for example, we can use '9' as default, and this will cause the sorting in descending order.

Here is the UDF, it will search through the source string @src, and pad specified letter @letter to the left of N string to make the final N string to be of length = @plen.

use tempdb -- change to your own db or master
go

drop function if exists dbo.udf_ExpandDigits;
go

-- function: standardize alphanumeric string by left padding 0 to the numeric string inside @src
-- we assume source string to be less than 1K in length (you can change it if needed)
-- example
-- select dbo.udf_ExpandDigits('a1-bc23-def456-ghij', 5, '0') will return
-- a00001-bc00023-def00456-ghij
 
create function dbo.udf_ExpandDigits(@src varchar(1024), @plen int, @letter char(1))
returns varchar(max)
as 
begin
   if @plen >= 100
      return @src;
   declare @p int, @p2 int, @num varchar(100);
   declare @ret_val varchar(max)='';
   if (PATINDEX('%[0-9]%', @src) =0 )
      set @ret_val = @src;
   else
   begin
      set @p = patindex('%[0-9]%', @src);
 
 
      while(@p > 0)
      begin
         set @p2=patindex('%[^0-9]%', substring(@src, @p, 1000)) 
         if (@p2 > 0)
         begin
            set @num=substring(@src, @p, @p2-1);
            set @ret_val += left(@src, @p-1) + case when len(@num) < @plen then right(replicate(@letter, @plen) + @num, @plen) else @num end; ;
            set @src = substring(@src, @p+@p2-1, len(@src));
            set @p = patindex('%[0-9]%', @src);
         end
         else
         begin
            set @num = substring(@src, @p, len(@src));
            set @ret_val += left(@src, @p-1)+ case when len(@num) < @plen then right(replicate(@letter, @plen) + @num, @plen) else @num end;
            set @src ='';
            break;
         end
 
      end -- while (@p > 0)
      if len(@src) > 0
         set @ret_val += @src;
   end -- else
   return @ret_val;
end -- function

go 
-- example:
-- select dbo.udf_ExpandDigits('a1-bc23-def456-ghij', 5, '0')

The key details of the algorithm is described as below:

Search the position of the first numeric letter and if not found, just return the original @src string
If a numeric value (i.e. 0 – 9) is found, mark its position as @p
Starting from @p, find the next non-numeric letter, if found, mark its position as @p2 and then retrieve the number between @p and @p2-1, pad the letter @letter to the left of this extracted number so the whole number's length is @plen. Reset @src string to start from the end of the number and then repeat step 3
If not found, i.e. from @p to the end of the @src string, it is a whole number, then extract this number, pad the letter @letter to the left of this extracted number so the whole number's length is @plen, and set @src to empty string and exit the loop of step 3.

Test the User Defined Function to Order Numbers within a String

We will use the data below and do the sorting test. Here is the code to create the sample data:

use tempdb -- script run in a SQL Server 2016
go

drop table if exists #t;
go
 
create table #t (id int identity, code varchar(50));
go

-- populate the sample data
insert into #t (code) values
  ('Abc3.1xy2m')
, ('Abc10.12xy1a')
, ('Abc1.2xy31b')
, ('Abc1.10xy31c')
, ('Abc1.0xy1b')
, ('Abc10.2xy2a')
, ('Abc3.1xy11m');
go

If we do a query with order by [code] asc:

select id, [code] 
from #t 
order by [code] asc;

We get the following result, which is not what we want:

However if we do the following (computed column [Formatted_Code] is added for viewing converted data, but can be omitted):

select id, [Code], Formatted_Code=dbo.udf_ExpandDigits([code], 3, '0') 
from #t 
order by Formatted_Code asc

We can see the column [Code] is ordered the way we want.

We can modify the query to the following (i.e. getting rid of the [Formatted_Code] column):

select id, [Code] 
from #t 
order by dbo.udf_ExpandDigits([code], 3, '0')  asc;

and we get the following which is the result we want:

Summary

In this tip, we discussed how to order numeric values within alphanumeric strings using T-SQL, this technique is obviously not very efficient because the function needs to process the string value of each row in an internal loop, so when a table has millions of rows, this method will be very time-consuming.

But this alphanumeric string ordering is critical in some DBA daily operations, such as when we have multiple scripts stored in multiple folders, and the script names or folder names contain alphanumeric strings to indicate the execution sequence of each script. So ordering these scripts by full names (i.e. path + file name) for deployment is the first step for a successful deployment.

Next Steps

To improve the function mentioned in the tip, it is better to use a Regular Expression to find and extract numbers in a string. However, SQL Server T-SQL currently does not support Regular Expression, so we may implement a CLR function to utilize the powerful RegEx capability in T-SQL and this will surely improve the performance of our current T-SQL version function.

Please read the following related articles and share your own ideas about alphanumeric string sorting in T-SQL.

About the author

Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Thursday, August 25, 2022 - 11:09:09 AM - Steve	Back To Top (90415)
Thanks for the solution, when you add with schemabinding to the udf this can be used as persisted computed column so you have no performance impact during queries

Thursday, September 12, 2019 - 11:41:39 AM - jeff_yao	Back To Top (82385)
Jimbo99, thanks for your comments. In real world, this kind of alphanumeric strings are very common. Sometimes, as the receiver at the end of the consuming chain, it is impossible to ask such alphanumeric string owners to change anything. The only thing we can do is to solve the issue within our power. I just hope t-sql someday can be as powerful as Oracle PL/SQL which contains RegEx support.