Brush Up on Your ETL Skills

  • Comments posted to this topic are about the item Brush Up on Your ETL Skills

  • IP address, cookies, gender, zip codes, even religious views to name a few.

    87% of American adults could be accurately and uniquely identified using just three data points — date of birth, gender, and a five-digit zip code — using publicly available census data, a sobering statistic that highlights why such robust pseudonymization measures are needed, particularly in light of large-scale data breaches such as the Equifax security incident.

    In my world, it's a lot of ETL. Luckily, as I'm working mostly with data scientist, both the machine learning and ETL can co-exist in Python. I think even before then, I was still eager to use Python over T-SQL and SSIS just for the mere fact that you can setup distributed processing in Python very easy with ETL pipelines where each data stream can be processed in a share nothing environment while also working holistically together. The only issue is that some of these scripting languages are not the fastest tool in the box when compared to other options, but generally work out in the end because they can scale horizontally where others can only scale up.

  • For GDPR Article 17 "Right of Erasure" may require you to have a mechanism to delete all forum posts and private messages for a particular user.
    I don't think  "Right of Erasure" would cover articles written by someone in the unlikely event of an author requesting erasure.
    "Right of Erasure" does not trump the legal requirement to keep financial records for the legally mandated time period
    If Redgate haven't done so already it is wise to get advice from legal.

    A subject access request for subscribers would cover anything in their profile but as the site provides the mechanism to see this it is effectively self service.  If there is nothing beyond what people can self-serve then it may be as simple as having an explicit GDPR page that states how a requester can retrieve their own data.

    Article 20 "Right to data portability" is an interesting one.  It doesn't limit its scope but I think historically it re-enforces consumers rights to swap energy suppliers, broadband/mobile providers and now banking providers.  In the context of SQLServerCentral it could be a mechanism to download a subscriber's profile by that subscriber.

    Another interesting wrinkle is what do you do when not all your data is in SQL Server?  Does something like Apache Presto (implemented in AWS as Athena) provide an answer to this and serendipitously to a general business problem?

    As general advice to people facing GDPR I would say take a good hard look at any company file shares, email in-boxes, drop-box/One Drive type accounts, work-stations, Sharepoint etc.  In the SQL Server world we have a structured data store with a defined retention strategy, purge, archive and backup.  On company file shares and mail-boxes there is God knows what, God knows where and in God knows what format.
    If your HR department takes a scan of your passport when you first join the company then they need to have defined processes in place to purge those images when they are no-longer in use.  Unless you have some form of auditing software such as http://www.groundlabs.com which has the capability to perform OCR on images it is going to be very hard to identify what your exposure and risk is.

  • You say "ETL specialist", I say "Data Janitor".  😀

  • We use SSIS for all of our ETL.  I would prefer that SSIS be used more for EL and not (T)ransform.  Setting up the SSIS to Extract and Load the data to a 'work' table, then using SQL to transform the data. In everything I've done so far in my career I haven't found any 'Transform' that I couldn't do in SQL.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • David.Poole - Wednesday, February 21, 2018 1:48 AM

    For GDPR Article 17 "Right of Erasure" may require you to have a mechanism to delete all forum posts and private messages for a particular user.
    I don't think  "Right of Erasure" would cover articles written by someone in the unlikely event of an author requesting erasure.
    "Right of Erasure" does not trump the legal requirement to keep financial records for the legally mandated time period
    If Redgate haven't done so already it is wise to get advice from legal.

    Article 20 "Right to data portability" is an interesting one.  It doesn't limit its scope but I think historically it re-enforces consumers rights to swap energy suppliers, broadband/mobile providers and now banking providers.  In the context of SQLServerCentral it could be a mechanism to download a subscriber's profile by that subscriber.

    Maybe. Our business is providing answers to people. It's possible an entity could get a right of erasure, but I doubt it for the things we share. We'd want a legal decision to the contrary.

    For Article 20, the profile is a good example where we might need to provide that for someone, though I doubt we'd get a request. We keep fairly little information here that isn't public.

  • chrisn-585491 - Wednesday, February 21, 2018 6:08 AM

    You say "ETL specialist", I say "Data Janitor".  😀

    Yep...

  • "Right of Erasure" is interesting (maybe not)! One of my best friends died suddenly and very, very unexpectedly a few years ago. However, he lived on on the Internet for a long time as his family struggled to get him removed from social media like FriendsReunited and LinkedIn. They were less bothered about professional forums similar to this one...

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply