In the context of my developer machine, I had log files I wanted to parse through. I setup a log library to output the results on a test server to a sql table instead of text files. However, this meant that my "log viewers" that handled regex parsing weren't in the picture at this point. I wanted to parse out some columns from a section of message text, and thought about CLR as a possible tool to help this.
Ideally, I wanted to feed the results for analysis easily into power bi, and avoid the need to create code to import and parse out fields. Since I knew the regex values I wanted, I thought this would be a good chance to try out some CLR functionality for the first time with SQL Server 2016 + CLR Regex parsing.
I ran across SQL# and installed. The install was very simple, just downloaded a SQL script and ran it, adding a final "reconfigure" statement to ensure everything was good to go.
SQLSharp (SQL#)
I used the free version which provided great regex parsing functionality.
Simple to use
Constructing the following query parsed the results easily, with no extract coding/import process required.
declare @StuckGum as table ( | |
message varchar(1000) not null | |
); | |
insert into @StuckGum | |
( message ) | |
-- Data generated by http://bit.ly/29GsjgG thanks for such a great resource for cool randomized data | |
values | |
( 'A media personality named Hou Pu-Sho needs a 4:37: team to abduct and extract Arsem Tchinova, the head of newLink Unlimited. Further, security is already on alert because of a previous botched run.' ) | |
, ( 'A charming corporate agent named Dara Yenkotov needs a 2:41: team to abduct and extract Take Kuri, the head of Paronki-Puchkora Multinational. Further, security is already on alert because of a previous botched run.' ) | |
, ( 'A guarded media personality named Tobias McRaven needs a 8:37: team to abduct and extract Nouchi Sumi, the head of priSoft Consolidated. However, he cannot provide any usual form of compensation.' ) | |
, ( 'A data broker named Ientik Sova needs a team 8:38: to assassinate a rival named Nake Tomi. However, the team finds itself double-crossed.' ) | |
, ( 'A hawkish data broker named Denisse Bechtel needs a 9:33: team to break into the network of OI Multinational. However, the team finds itself double-crossed.' ) | |
, ( 'A guarded data broker named Kada Ekomi needs a 2:183: team to abduct and extract Lincoln Watrous, a media officer of Dee-Kui Amalgamated. In addition, the team is randomly targeted by a mischievous hacker.' ) | |
, ( 'A government official named Ghislaine Tronstad needs a team 5:163: to break into the network of IBH Amalgamated and assassinate an advanced artificial intelligence. However, the job is a trap.' ) | |
, ( 'A foreign government agent named Merrill Arleth needs a 8:92: team to deliver a black box to Zhieu Corporation. Further, the team has only thirty minutes to complete the job.' ) | |
, ( 'A cryptic hacker named Necole Tower needs a team 6:46: to break into the network of Veillon-Naki Multinational and steal military data. Further, the security surrounding the target is unusually high.' ) | |
, ( 'A corporate agent named Lena Mova needs a team 2:181: to assassinate a rival named McKinley Vangelos. However, the job is a trap.' ); | |
-- Text I'm trying to match is the #:###: | |
-- Like statement doesn't work due to number being 1-2 at beginning followed by colon with 1-3 numbers followed by colon. | |
-- This requires the ability to match a varying length of numbers, which like can't do. | |
select | |
ParsedNumber = SQL#.RegEx_CaptureGroup(sg.message, '.*(\d{1,2}:\d{1,3}):.*', 1, null, 1, -1, '') | |
,sg.message | |
from | |
@StuckGum as sg; |
Performance
This was just an isolated 1000 record test, so nothing exhaustive. I compared it to a table function that parsing strings (could probably be optimized more). For the purpose of running a simple log parsing search on 1000 rows it did pretty good!
For better work on parsing of strings, there are detailed postings out there by Aaron Bertrand, Jeff Moden, and others. My scope was specifically focused on the benefit for a dba/developer doing adhoc-type work with Regex parsing, not splitting delimited strings. The focus of most of the articles I found was more on parsing delimited string. However, I'm linking to them so if you are researching, you can be pointed towards so much more in-depth research on a related topic.
Thoughts
The scope of my review is not covering the proper security setup for CLR with production, CLR performance at high scale, or anything that detailed. This was primarily focused on a first look at it. As much as I love creative SQL solutions, there are certain things that fit better in code, not SQL. (heresy?) I believe Regex/advanced string parsing can often be better handled in the application, powershell, or other code with access to regex libraries.
In the case of string parsing for complex patterns that are difficult to match with LIKE pattern matching, this might be a good resource to help someone write a few SQL statements to parse out some log files, adhoc ETL text manipulation, or other text querying on their machine without having to add additional work on importing and setup.