• halifaxdal (7/31/2015)


    The data I am processing is normal in reality but a bit messy in SQL world.

    I have a table which contains a field to be parsed out possible filename and then its extension, sounds easy, right? no! here is some of the records for the field:

    ...

    ...

    The table is generated hourly, weekly analysis is required and for a week it contains roughly 2-3 millions records.

    I'm seeking your enlightening to see what would be the best approach to this, any suggestion is appreciated

    The usage of LIKE comparison operator has already been suggested above. What I'm demonstrating below is an alternate approach of how to join your table with a set of blacklisted words and phrases.

    First, I'll setup a table based on my assumption of how it basically works.

    create table BreachLog

    (

    breach_content varchar(8000) not null

    );

    insert into BreachLog values

    ('e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof'),

    ('c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg'),

    ('c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3'),

    ('e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html'),

    ('Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;'),

    ('prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping'),

    ('01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;');

    Next, I'll load a set of black listed words into a reference table. These 'words' are actually LIKE pattern matching expressions. You can tweak the members of this table however you need.

    create table BlackList

    (

    black_word varchar(80) not null primary key

    );

    insert into BlackList values

    ('%.wmv'),('%.MPG%'),('%.mkv'),('%.mpeg'),('%.ogg'),('%.m4v'),

    ('%.3gp'),('%.AVI%'),('%.jpeg'),('%.tiff'),('%.asf'),('%.wav'),

    ('%.ogv'),('%.tif'),('%.flv'),('%.MP3'),('%.jpg'),('%.wma'),('%.oga');

    Now you can simply inner join BreachLog with BlackList using the LIKE operator, and the members of the BlackList table can be maintained going forward using an application.

    SELECT breach_content, black_word

    FROM BreachLog

    JOIN BlackList on BreachLog.breach_content LIKE BlackList.black_word;

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho