RE: Brush Up on Your ETL Skills

One Orange Chip

Points: 25560

February 20, 2018 at 9:24 pm

#1980580

IP address, cookies, gender, zip codes, even religious views to name a few.

87% of American adults could be accurately and uniquely identified using just three data points — date of birth, gender, and a five-digit zip code — using publicly available census data, a sobering statistic that highlights why such robust pseudonymization measures are needed, particularly in light of large-scale data breaches such as the Equifax security incident.

In my world, it's a lot of ETL. Luckily, as I'm working mostly with data scientist, both the machine learning and ETL can co-exist in Python. I think even before then, I was still eager to use Python over T-SQL and SSIS just for the mere fact that you can setup distributed processing in Python very easy with ETL pipelines where each data stream can be processed in a share nothing environment while also working holistically together. The only issue is that some of these scripting languages are not the fastest tool in the box when compared to other options, but generally work out in the end because they can scale horizontally where others can only scale up.