How can I check file extension efficiently?

  • The data I am processing is normal in reality but a bit messy in SQL world.

    I have a table which contains a field to be parsed out possible filename and then its extension, sounds easy, right? no! here is some of the records for the field:

    e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof

    c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg

    c:\users\asdf\music\the cranberrieso need to argue\02 i can't be with you.mp3

    e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html

    Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party; prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping

    01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70; 02/05/72; 02/05/75; 02/06/64; 02/06/66; 02/06/72; 02/07/68; 02/07/75; 02/09/66; 02/09/68; 03/03/75; 03/03/79; 03/04/68; 03/04/74; 03/05/82; 03/07/54; 03/07/75; 03/08/73; 03/08/86; 03/10/65; 03/11/57; 03/12/79; 03/14/42; 04/02/65; 04/03/85; 04/04/60; 04/04/64; 04/04/69; 04/05/38; 04/05/46; 04/05/48; 04/06/60; 04/07/65; 04/10/53; 04/10/72; 04/11/63; 04/11/78; 04/12/64; 04/12/71; 04/18/81; 04/28/37; 05.03.1957; 05/01/66; 05/01/74; 05/02/81; 05/04/1955; 05/04/55; 05/04/75; 05/05/30; 05/06/50; 05/06/78; 05/08/64; 05/08/75; 05/10/66; 05/12/65; 05/15/68; 06/06/63; 06/06/64; 06/07/77; 06/10/66; 06/12/1923; 07/01/36; 07/04/86; 07/06/54; 07/07/56; 07/07/66; 07/09/67; 07/12/62; 07/17/70; 08/02/38; 08/02/69; 08/08/80; 08/10/74; 08/11/65; 09 AUG 1966; 09/02/01; 09/02/51; 09/02/64; 09/02/65; 09/03/86; 09/05/78; 09/05/86; 09/09/85; 09/12/84; 09/19/88; 1 Apr. 1978; 1 Aug. 2007;

    Please note each line above indicates a single row for that field.

    You can see not all the rows are filenames, my task is:

    1. filtered out only records contain filename

    2. for those records, further find out what fall in a black list that most likely will be updated from time to time, currently the list contains 3/4 letters extensions:

    *.wmv

    *.MPG*

    *.mkv

    *.mpeg

    *.ogg

    *.m4v

    *.3gp

    *.AVI*

    *.jpeg

    *.tiff

    *.asf

    *.wav

    *.ogv

    *.tif

    *.flv

    *.MP3

    *.jpg

    *.wma

    *.oga

    The table is generated hourly, weekly analysis is required and for a week it contains roughly 2-3 millions records.

    I'm seeking your enlightening to see what would be the best approach to this, any suggestion is appreciated

  • what I come out is something like the following script and I am really not satisfied with it, let alone it processes 3 letter extension only and I need to also cover 4 letter extension.

    select left(reverse(left(reverse(breach_content),charindex('.',reverse(breach_content))-1)), 3), *

    from reporting_dailydlpdetail_20150730

    where charindex('.',reverse(breach_content))-1 <> -1

    and

    left(reverse(left(reverse(breach_content),charindex('.',reverse(breach_content))-1)), 3)

    in

    ('bmp',

    'aac',

    'mov',

    'swf',

    'wmv',

    'MPG*',

    'mkv',

    'ogg',

    'm4v',

    '3gp',

    'AVI*',

    'asf',

    'wav',

    'ogv',

    'tif',

    'flv',

    'MP3',

    'jpg',

    'wma',

    'oga')

  • Maybe this will give you some ideas/alternatives:

    SELECT

    breach_content,

    RIGHT(breach_content, CASE WHEN SUBSTRING(breach_content, LEN(breach_content) - 3, 1) = '.' THEN 3 ELSE 4 END) AS extension

    FROM (

    SELECT 'e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof' AS breach_content UNION ALL

    SELECT 'c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg' UNION ALL

    SELECT 'c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3' UNION ALL

    SELECT 'e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html' UNION ALL

    SELECT 'Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;' UNION ALL

    SELECT 'prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping' UNION ALL

    SELECT '01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;'

    ) AS test_data

    WHERE breach_content LIKE '%[a-z]:%\%.%' AND

    '.' IN (SUBSTRING(breach_content, LEN(breach_content) - 3, 1), SUBSTRING(breach_content, LEN(breach_content) - 4, 1))

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (7/31/2015)


    Maybe this will give you some ideas/alternatives:

    SELECT

    breach_content,

    RIGHT(breach_content, CASE WHEN SUBSTRING(breach_content, LEN(breach_content) - 3, 1) = '.' THEN 3 ELSE 4 END) AS extension

    FROM (

    SELECT 'e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof' AS breach_content UNION ALL

    SELECT 'c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg' UNION ALL

    SELECT 'c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3' UNION ALL

    SELECT 'e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html' UNION ALL

    SELECT 'Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;' UNION ALL

    SELECT 'prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping' UNION ALL

    SELECT '01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;'

    ) AS test_data

    WHERE breach_content LIKE '%[a-z]:%\%.%' AND

    '.' IN (SUBSTRING(breach_content, LEN(breach_content) - 3, 1), SUBSTRING(breach_content, LEN(breach_content) - 4, 1))

    Much more neat than mine, thank you very much Scott

  • Can you explain '%[a-z]:%\%.%'? My first time to use it, thanks. Basically the sample data is sample only, the real data could be randomly and I have no idea what that could be, but still need to filter to exclude those non-file

  • halifaxdal (7/31/2015)


    Can you explain '%[a-z]:%\%.%'? My first time to use it, thanks. Basically the sample data is sample only, the real data could be randomly and I have no idea what that could be, but still need to filter to exclude those non-file

    Sure.

    '%[a-z]:%\%.%'

    % = any character(s), or none at all. So LIKE '%' would match anything.

    [a-z] = a character from a to z, that is, any letter

    : = exactly a colon

    % = again, any character(s), or none at all

    \ = exactly a backslash

    % = "

    . = exactly a dot

    % = "

    So, to put it all together, to be included in the result, the string must have:

    a letter &

    followed immediately by : &

    followed by \ after any number of chars &

    followed by . after any number of chars

    Thus, these strings would match:

    c:\emp\abc.1

    d:\abc.template

    user:\file1.ext --tricky, but r: is a letter directly followed by a colon, with ANY char(s) before it

    If you want to prevent this last match, you'll need two separate conditions:

    LIKE '[a-z]:%\%.%' --meaning the string must start with a letter and a colon

    OR LIKE '%[^a-z][a-z]:%\%.%' --EDITed for clarity:meaning anywhere in the string but the char immediately before "letter:" must NOT be a-z

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (7/31/2015)


    halifaxdal (7/31/2015)


    Can you explain '%[a-z]:%\%.%'? My first time to use it, thanks. Basically the sample data is sample only, the real data could be randomly and I have no idea what that could be, but still need to filter to exclude those non-file

    Sure.

    '%[a-z]:%\%.%'

    % = any character(s), or none at all. So LIKE '%' would match anything.

    [a-z] = a character from a to z, that is, any letter

    : = exactly a colon

    % = again, any character(s), or none at all

    \ = exactly a backslash

    % = "

    . = exactly a dot

    % = "

    So, to put it all together, to be included in the result, the string must have:

    a letter &

    followed immediately by : &

    followed by \ after any number of chars &

    followed by . after any number of chars

    Thus, these strings would match:

    c:\emp\abc.1

    d:\abc.template

    user:\file1.ext --tricky, but r: is a letter directly followed by a colon, with ANY char(s) before it

    If you want to prevent this last match, you'll need two separate conditions:

    LIKE '[a-z]:%\%.%' --meaning the string must start with a letter and a colon,

    OR LIKE '%[^a-z][a-z]:%\%.%' --meaning anywhere in the char immediately before the letter: must NOT be a-z

    Thanks Scott, I understand now.

  • halifaxdal (7/31/2015)


    Thanks Scott, I understand now.

    Understand even more...

    https://msdn.microsoft.com/en-us/library/ms179859.aspx

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/31/2015)


    halifaxdal (7/31/2015)


    Thanks Scott, I understand now.

    Understand even more...

    https://msdn.microsoft.com/en-us/library/ms179859.aspx

    Thanks

  • Here is a solution that will handle any length of extension. I added a *.ssmsproj entry to test a longer extension.

    SELECT *, LEFT(RIGHT(td.breach_content, r.filename_index-1), r.filename_index - r.extension_index - 1), RIGHT(td.breach_content, r.extension_index-1)

    FROM (

    VALUES('e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof')

    , ('c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg')

    , ('c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3')

    , ('e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html')

    , ('Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;')

    , ('prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping')

    , ('01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;')

    , ('f:\SSMS\test_project.ssmsproj')

    ) AS td(breach_content)

    CROSS APPLY (VALUES(CHARINDEX('.', REVERSE(breach_content)), CHARINDEX('\', REVERSE(breach_content)))) AS r(extension_index, filename_index)

    WHERE r.extension_index > 0

    ANDr.filename_index > r.extension_index

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Thanks Drew, your script failed as it also picks up records like:

    DyKE; GaY; K!{%^r; K'r; K([[}-r; K>;[|?[r; K[)r; K[-[#,[r; K[:R; K[;$[r; K[R; K[r; K^r; K{#r; K{&[R; K{&[r; K{:}>r; K{;r; K{>r; K|r; K}";R; K}$R; K}$r; K}%){"r; K}([R; K}*[r; K}<[R; K}>r; PEE; S%M; S&[m; S&}$M; S=M; S["['[(<M; S[#=m; S[%M; S[(m; S[){-@},[m; S[-{'m; S[.M; S[<[m; S[M; S[[m; S[m; S[{[m; S[|[M; S`[m; S{$[m; S{%[M; S{%m; S{;[M; S{@M; S}#[m; S}&&{'[M; S}([M; S}.m; k#[r; k)!R; k*[`[r; k-R; k.[r; k/R; k/[r; k:[r; k;[.[&r; k[,`[\})[R; k[:r; k[=R; k[>[R; k[@r; k[R; k[^{,[![r; k[r; k_r; k{*;r; k{,[R; k{-[;[)R; k{=[r; k{?R; k} [R; k}"[r; k}#r; k}$R; k}$[r; k}${.[r; k}<[\[R; k}?}(r; k}@/[r; k}@[R; pEE; s"[,{>M; s#m; s%m; s)[:{),[?m; s)[m; s-{?[?-{;M; s;.[m; s;M; s>m; s@[*[M; s[!}'m; s["(}*M; s[.M; s[<=[m; s[=[M; s[>M; s[M; s[m; s[}[m; s[~m; s\[m; s^m; s`M; s`[m; s{#[m; s{$[M; s{)}([m; s{.[m; s{>{*[M; s}'m; s}'} [<[&}?[M; s}([m; s})&[M; s}/}$M; s}>[m; s}@M; s}@[M

    PEE; S&[M; S-m; S<[m; SM; S[&[m; S[([m; S[({'M; S[)M; S[=m; S[M; S[m; S[~[:[?M; S[~{(M; S[~|[m; S_{'m; Sm; S{$[m; S{'[m; S{,[M; S{.@}-[m; S{:!{-M; S{<m; S}"[M; S}-[m; S}>M; S}>[=[\[ m; S}>[M; nUe; nuE; pEE; s#[m; s#},M; s#};.;["#m; s$[m; s$m; s&[m; s'[,[m; s(M; s,[M; s,[m; s;[M; s;[{m; s@}<}![M; sM; s[ [m; s[![m; s[#[M; s[.m; s[/[m; s[<[M; s[?M; s[@}>,})[=m; s[M; s[m; s[~[m; s]m; s`}%M; sm; s{*[=m; s{,}.m; s{=["[m; s{@`[![m; s{@{?[M; s}*[.m

    9mm; GAy; Gun; K [R; K!|`>[r; K#[r; K%[R; K%[r; K([,)}:{@R; K<R; K[!R; K['R; K[*r; K[R; K[][_[R; K[_}@{@R; K[`< [r; K[`[<[R; K[r; K[{{$-[r; K{"R; K{*[R; K{*r; K{,r; K{:[)][r; K{=R; K{>R; K}%[R; K},R; K}={-[r; NuE; PEE; S!M; S"[m; S$m; S%<[m; S%[m; S%}?m; S(M; S);M; S)[M; S=}%M; S=}:{"[M; S?@M; SM; S[%[M; S[*,[m; S[=[(M; S[?['m; S[M; S[^[m; S[m; Sm; S{"[:[m; S{#@}-M; S{&[,=[m; S{,[m; S{<M; S} }>m; S}$[|[-m; S}([M; S}--[m; S};M; S};[m; S}=[M; S}>M; gaY; k [r; k![r; k%[r; k'[)[R; k)[_{&[%)R; k*[r; k=r; k?R; k@\R; k[%r; k[.[R; k[;{@} ,)}'}=-[~}!R; k[=[:r; k[@[*R; k[R; k[\[r; k[r; k\[r; k{%[R; k{(R; k{([R; k{)[r; k{;}#}#R; k{>[~&{#r; k{>r; k{@$} [r; k}![R; k}![r; k},R; k}.R; k}='{)/[R; k}@R; k}@}, [R; nUE; pEE; peE; s =M; s!m; sM; s[$)[m; s['[<m; s[*[m; s[/[M; s[:[{[|[M; s[?{&m; s[M; s[m; sm; s{#[M; s{&M; s} M; s}*m; s}-[M; s}-[m; s}=[M

  • halifaxdal (7/31/2015)


    The data I am processing is normal in reality but a bit messy in SQL world.

    I have a table which contains a field to be parsed out possible filename and then its extension, sounds easy, right? no! here is some of the records for the field:

    ...

    ...

    The table is generated hourly, weekly analysis is required and for a week it contains roughly 2-3 millions records.

    I'm seeking your enlightening to see what would be the best approach to this, any suggestion is appreciated

    The usage of LIKE comparison operator has already been suggested above. What I'm demonstrating below is an alternate approach of how to join your table with a set of blacklisted words and phrases.

    First, I'll setup a table based on my assumption of how it basically works.

    create table BreachLog

    (

    breach_content varchar(8000) not null

    );

    insert into BreachLog values

    ('e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof'),

    ('c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg'),

    ('c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3'),

    ('e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html'),

    ('Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;'),

    ('prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping'),

    ('01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;');

    Next, I'll load a set of black listed words into a reference table. These 'words' are actually LIKE pattern matching expressions. You can tweak the members of this table however you need.

    create table BlackList

    (

    black_word varchar(80) not null primary key

    );

    insert into BlackList values

    ('%.wmv'),('%.MPG%'),('%.mkv'),('%.mpeg'),('%.ogg'),('%.m4v'),

    ('%.3gp'),('%.AVI%'),('%.jpeg'),('%.tiff'),('%.asf'),('%.wav'),

    ('%.ogv'),('%.tif'),('%.flv'),('%.MP3'),('%.jpg'),('%.wma'),('%.oga');

    Now you can simply inner join BreachLog with BlackList using the LIKE operator, and the members of the BlackList table can be maintained going forward using an application.

    SELECT breach_content, black_word

    FROM BreachLog

    JOIN BlackList on BreachLog.breach_content LIKE BlackList.black_word;

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • halifaxdal (8/4/2015)


    Thanks Drew, your script failed as it also picks up records like:

    I disagree. My script succeeded marvelously. It found records that you never would have found with the other approach, which is exactly what I claimed it would do. You can't expect code you get from the Internet to work perfectly without adapting it to your own specific situation.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply