Random Word Generation for Data Scrubbing

  • SQLBlimp

    SSCertifiable

    Points: 6322

    Comments posted to this topic are about the item Random Word Generation for Data Scrubbing

  • Eirikur Eiriksson

    SSC Guru

    Points: 182359

    Thanks for this article John, interesting approach.
    😎

    One thought though, you might want to convert the scalar UDF's to inline functions, big difference in performance when handling large sets. I posted a test harness for similar functionality few years back and the results showed almost 10 times difference.

  • Ulysses_Infinity

    SSCrazy

    Points: 2586

    If considering doing something similar in production you might want to ensure your 'words' source doesn't include things like the F and C bombs 🙂

  • Jonathan AC Roberts

    SSCoach

    Points: 16923

    redgate have a tool for generating test data that looks quite good. I don't know how much it costs but it would save a lot of effort.
    https://www.red-gate.com/hub/product-learning/sql-data-generator/how-to-generate-various-forms-of-realistic-data-for-testing-development-and-prototypes

  • William Soranno

    SSCommitted

    Points: 1578

    Nice article. I have been working on a similar concept for our environments.
    I did notice a disparity between you sql for the table and your paragraph describing it.
    In the sql, you have the PK as the column TheWord. In the paragraph, it says the integer identity is the PK.

    Bill Soranno
    MCP, MCTS, MCITP DBA
    Database Administrator
    Winona State University
    Maxwell 143

    "Quality, like Success, is a Journey, not a Destination" - William Soranno '92

  • Jeff Moden

    SSC Guru

    Points: 994844

    Great job, John.  Can you share the link for the dictionary download that you used?  I ask because most word lists seem to be rather unabridged when it comes to highly offensive words and know that you're rather diligent and might have found one that doesn't contain such words.

    I also agree on the idea of using Inline Table Value Functions (iTVF) as an Inline Scalar Function (iSF) rather than using regular Scalar Functions for performance especially if you have large tables to anonymize.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems
    Create a Tally Function (fnTally)

  • SQLBlimp

    SSCertifiable

    Points: 6322

    Hi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www-01.sil.org/linguistics/wordlists/english/

    Thanks
    John.

    :Wow: :blush: (Red faced)

  • Eirikur Eiriksson

    SSC Guru

    Points: 182359

    SQLBlimp - Tuesday, May 1, 2018 7:42 AM

    Hi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)

    The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
    😎

  • SQLBlimp

    SSCertifiable

    Points: 6322

    Eirikur Eiriksson - Tuesday, May 1, 2018 7:52 AM

    SQLBlimp - Tuesday, May 1, 2018 7:42 AM

    Hi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)

    The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
    😎

    Hi - I noticed and corrected.  Still freaked out by the nasty list.

    Thanks
    John  :blush:

  • SQLBlimp

    SSCertifiable

    Points: 6322

    Eirikur Eiriksson - Tuesday, May 1, 2018 1:30 AM

    Thanks for this article John, interesting approach.
    😎

    One thought though, you might want to convert the scalar UDF's to inline functions, big difference in performance when handling large sets. I posted a test harness for similar functionality few years back and the results showed almost 10 times difference.

    Thanks!  I will look into that today, and perhaps post revised functionality into this thread.

    Thanks
    John.

  • Mike DiRenzo

    SSCrazy

    Points: 2013

    What a great article - hands down.  Thank you.  I too am working on something similar.  I was using word-counts and Loreum Ipsum replacements.  I think this approach is far better. I need to ensure that the dictionary is sanitized.  I did get sort of lost in all the replies surrounding this and at the moment I am still not sure where to get a sanitized list.

  • Scott-144766

    SSCarpal Tunnel

    Points: 4218

    The "redacted" zip file requires further work. For example, words beginning with "fellat" or "whor", not to mention the N-bomb.

    --
    Scott

  • Jeff Moden

    SSC Guru

    Points: 994844

    SQLBlimp - Tuesday, May 1, 2018 7:54 AM

    Eirikur Eiriksson - Tuesday, May 1, 2018 7:52 AM

    SQLBlimp - Tuesday, May 1, 2018 7:42 AM

    Hi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)

    The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
    😎

    Hi - I noticed and corrected.  Still freaked out by the nasty list.

    Thanks
    John  :blush:

    It's the same list that Princeton uses so don't feel bad.  What would be really cool is if we all got together and submitted lists of words that need to be removed so that it would help everyone.  Of course, we wouldn't just past them on the forum.  Perhaps attaching readily consumable text files with the label of NSFW.txt would help.  They could be either the words or words using % wildcards to make it easy do delete from an original list.

    My first word would be "XML" :D:D:D

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson

    SSC Guru

    Points: 182359

    Jeff Moden - Tuesday, May 1, 2018 9:48 AM

    SQLBlimp - Tuesday, May 1, 2018 7:54 AM

    Eirikur Eiriksson - Tuesday, May 1, 2018 7:52 AM

    SQLBlimp - Tuesday, May 1, 2018 7:42 AM

    Hi. I screwed up! Emailing Steve Jones to replace the zip file with the sanitized version. The original list was from http://www01.sil.org/linguistics/wordlists/englishThanksJohn. (Red faced)

    The link http://www01.sil.org/linguistics/wordlists/english does not work but http://www.sil.org/linguistics/wordlists/english does, redirects to http://www-01.sil.org/linguistics/wordlists/english
    😎

    Hi - I noticed and corrected.  Still freaked out by the nasty list.

    Thanks
    John  :blush:

    It's the same list that Princeton uses so don't feel bad.  What would be really cool is if we all got together and submitted lists of words that need to be removed so that it would help everyone.  Of course, we wouldn't just past them on the forum.  Perhaps attaching readily consumable text files with the label of NSFW.txt would help.  They could be either the words or words using % wildcards to make it easy do delete from an original list.

    My first word would be "XML" :D:D:D

    ...followed by "JSON" 😀
    😎

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply