Bash for ETL pre-processing

Question

Post reply

Bash for ETL pre-processing

David.Poole

SSC Guru

Points: 76353
More actions
August 14, 2016 at 11:49 pm

#104360

Comments posted to this topic are about the item Bash for ETL pre-processing
LinkedIn Profile

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

fschreuder Old Hand Points: 340 More actions · Answer 1

Very helpful and informative. Thank you!

I need to invest more time studying the program options as some of my ignorant wishes are just a flag away 😉

Personally I use UnxUtils as a quick head-start found on sourceforge [1]; unzip it to your toolbelt-folder, add it to %PATH% and you're ready to go.

The included versions of the tools are rather old; some over 10 years 😀 yet their functionality is mostly fine.

Awk is not part of it but GAwk is available in binary form, again, on sourceforge [2]. At the moment of writing this has version 3.1.6 while its source [3] is at 4.1.

As Sed needs some investment, time/energy wise (as does Awk), it might be advisable to find a more recent version here[4] than the one included in UnxUtils. A windows binary appears in line with latest available version.

Cheers & have fun!

/edit: fix typo

[1]:http://unxutils.sourceforge.net/

[2]:http://gnuwin32.sourceforge.net/packages/gawk.htm

[3]:http://git.savannah.gnu.org/cgit/gawk.git

[4]:http://gnuwin32.sourceforge.net/packages/sed.htm

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 2

Heh... let's talk about "open minded"...Now all we have to do is get people over the fear of using xp_CmdShell and over the love affair with SSIS and you might be on to something.

On the word "heterogeneous" insofar as data goes, I find that the word is serious overused, frequently improperly used, and frequently used for being a whole lot lazy. And you have to trust me on this... you wouldn't want me to use "heterogeneous" methods for managing your checking or credit card accounts. You need correctly structured data even if it may come from different sources. Great tools like BASH can help with things like finding and removing wayward headers and footers but the rest of the data had better be well structured.

I've seen the train-wreck known as "heterogeneous" data in the form of documents received from banks in lieu of sending structured data. Their excuse is that the documents are what are sent to the customers. Well guess what? Those documents are generated by computer... from a database!!!! Stop being lazy sots and send us the properly structured data that you used to generate the bloodly letters with instead of forcing us to scan, OCR, and knit pic the docs to try to figure out where the hell the account or loan number is and what the action taken was. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David.Poole SSC Guru Points: 76353 More actions · Answer 3

What I see in corporate data defies rational analysis.

Data that should be on a strict need-to-know basis available for general consumption
Data that should be generally available locked up tighter than fort Knox
Make-do-and-mend reverse engineering processes to obtain base data, often with a team to keep the process running
Revenue generating data applications treated as pariahs while vanity projects are treated like the anointed ones
Stated desire for agility and flexibility while protecting soul destroying bureaucracy
Estimates for work that are at best a figure plucked from the air by people completely divorced from the process

I'm beginning to think that common sense as a resource is becoming rarer.

LinkedIn Profile

Ed Wagner SSC Guru Points: 287026 More actions · Answer 4

Nice article, David. Like Jeff pointed out, if people have an irrational fear of xp_cmdshell, what are they going to think about Bash? I don't have this fear and I find your article very intriguing. It also brought back memories (some good, some not) of Unix days.

David.Poole (8/15/2016)
I'm beginning to think that common sense as a resource is becoming rarer.

Agreed. I'm reminded of Lowell's signature about common sense actually being considered a superpower. 😉

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

mike.brooks SSC Rookie Points: 38 More actions · Answer 5

We use bash at my current client for pre-processing files. We receive files from various sources and we use shell scripts to normalize aspects of the files (delimiters, line endings, etc.). We have also used the linux split command to break up very large, one-off files into numerous smaller files for easier processing into our tight ETL windows. I would definitely use this approach again on future projects.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 6

David.Poole (8/15/2016)
What I see in corporate data defies rational analysis.
Data that should be on a strict need-to-know basis available for general consumption
Data that should be generally available locked up tighter than fort Knox
Make-do-and-mend reverse engineering processes to obtain base data, often with a team to keep the process running
Revenue generating data applications treated as pariahs while vanity projects are treated like the anointed ones
Stated desire for agility and flexibility while protecting soul destroying bureaucracy
Estimates for work that are at best a figure plucked from the air by people completely divorced from the process
I'm beginning to think that common sense as a resource is becoming rarer.

And they call that lack of common sense, "heterogeneous" data. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

DEK46656 SSCrazy Points: 2111 More actions · Answer 7

Yeah, the good old days…

In the past I did a lot of work on Solaris / UNIX systems to extract data and prepare it for database loading. I was dealing with cellular system ROP (Report Only Printing i.e. logs) data that normally would go to a printer, but instead was being captured in the Solaris “management” system. The format of the record was such that some common header type data was present, but then changed depending on what was being reported. Record separator was multiple blank lines, after which the first fields of the new record occurred. I had to interpret new lines similar to field separators, and depending on what was in key locations in the record determined what the record type was.

My 2 primary tools were KSH and AWK, though I did use GREP, HEAD, TAIL, and (sometimes) SED. The whole piping thing was all through the code, redirecting to separate files, not just STDIN, STDOUT, STDERR. Recently, while reading “Enterprise Integration Patterns” I discovered that the whole approach I had was considered a design pattern called “pipes and filters”.

Of course, to really get good at AWK, you needed to get good at “regular expressions”: a complete language onto itself that technically isn’t programming, but it should have been.

Beer's Law: Absolutum obsoletum
"if it works it's out-of-date"

Eric M Russell SSC Guru Points: 125623 More actions · Answer 8

When it comes to revceiving data feeds from external clients or 3rd party providers, it's best when an auto-certification process breaks the file transfer when the file isn't formatted properly. That way the burden is on them to fix it. It's conceptually similar to a scenario where a restraunt manager sniffs the fish as arrives on the loading dock and refuses to accept the shipment if it's isn't fresh.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Ed Wagner SSC Guru Points: 287026 More actions · Answer 9

Eric M Russell (8/15/2016)
When it comes to revceiving data feeds from external clients or 3rd party providers, it's best when an auto-certification process breaks the file transfer when the file isn't formatted properly. That way the burden is on them to fix it. It's conceptually similar to a scenario where a restraunt manager sniffs the fish as arrives on the loading dock and refuses to accept the shipment if it's isn't fresh.

Heh - Funny you say this Eric. We have that exact arrangement in a number of scenarios. Also, if one record in the file is invalid, the whole file is suspect. In short, we won't be fixing "just that one or two rows" for anyone. These agreements have paid off every time and saved countless hours on hunting down problems when the supplier changes their format.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Scott Abrants SSCommitted Points: 1503 More actions · Answer 10

Good article David

I like your use of some snippets of real world examples and how you used what you wrote about to help solve them.

Now I agree with Jeff as well. I think managers are being sold some fluff when it comes to the whole "there is no need for structured data anymore" and "having a schema is more of a pain to maintain and provides little value".

Thanks for your work David.

Eric M Russell SSC Guru Points: 125623 More actions · Answer 11

Ed Wagner (8/15/2016)
Eric M Russell (8/15/2016)
When it comes to revceiving data feeds from external clients or 3rd party providers, it's best when an auto-certification process breaks the file transfer when the file isn't formatted properly. That way the burden is on them to fix it. It's conceptually similar to a scenario where a restraunt manager sniffs the fish as arrives on the loading dock and refuses to accept the shipment if it's isn't fresh.
Heh - Funny you say this Eric. We have that exact arrangement in a number of scenarios. Also, if one record in the file is invalid, the whole file is suspect. In short, we won't be fixing "just that one or two rows" for anyone. These agreements have paid off every time and saved countless hours on hunting down problems when the supplier changes their format.

Ideally, there is an arrangement where you provide the client an application that facilitates the data certification, zip compression, and secure file transfer. If the data isn't formatted or populated correctly, then it never even gets transmitted across the wire to your network. It does, however, produce immediate feedback in the form of a report identifying the columns or rows that require corrective action.

Of course, you provide this innovative tool to the client because you care about making ~~your~~ their life easier. 🙂

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

David.Poole SSC Guru Points: 76353 More actions · Answer 12

I have become convinced that a robust data model is more important in schema on read DBs, not less.

Having worked for two information aggregators I'm used to data flying at me in different formats and layouts. For the most part it does conform to a schema of some description that can be mapped to an ANSI Compliant RDBMS.

At a fundamental level and irrespective of technology we are modelling real world objects that have defined features and attributes.

I can see the value in rapid prototyping with relaxed disciplines to act as a business POC. Once the innovation has proven its business viability then model it properly.

Beyond free format surveys and cases where I might be tempted by an EAV model I'm not sure what problem Document DBs are supposed to solve

LinkedIn Profile

David.Poole SSC Guru Points: 76353 More actions · Answer 13

I have become convinced that a robust data model is more important in schema on read DBs, not less.

Having worked for two information aggregators I'm used to data flying at me in different formats and layouts. For the most part it does conform to a schema of some description that can be mapped to an ANSI Compliant RDBMS.

At a fundamental level and irrespective of technology we are modelling real world objects that have defined features and attributes.

I can see the value in rapid prototyping with relaxed disciplines to act as a business POC. Once the innovation has proven its business viability then model it properly.

Beyond free format surveys and cases where I might be tempted by an EAV model I'm not sure what problem Document DBs are supposed to solve

LinkedIn Profile

Scott Abrants SSCommitted Points: 1503 More actions · Answer 14

Great points and that is coming from read world experience not from some NoSQL marketing hype engine.

Thank you for your insight!