Data movement is a fundamental piece of a data engineer’s duties, and recently I’ve been thinking about the art of data movement. What are some of the most important pieces that a data engineer needs to think about when confronted with data ingestion? There is of course data exporting as well, and in that case, similar thought processes should be undertaken. If you’re exporting data for consumption, then you also need to be thinking about how someone will consume that data.
An important stance as a data engineer with any experience knows is that we’re always questioning whether data is as it appears. Really, this is a good perspective to have for anyone in the world! Are things as they appear? What information do I have to validate that it is factual? If my presumption is x, validate that that is in fact true! But there’s an additional step – what if it seems true at first, but is not true or sometimes varies with time. So, even when information is what it seems at first, how we can we put validations and warnings in place for when information starts to stray or isn’t delivered in the way we expect in the future. There is a lot to consider.
In this article, I want to step through the process of items to consider when you’re considering ingesting data. Then in Part 2, we’ll move on to technical considerations for ingestion as well as potential validations and warnings to consider putting in place that will let you know when data starts to veer. As a data engineer who has spent a lot of time working with data validation and ingestion, the other important piece to ingestion is to not have it take a lot of my time. I heavily lean towards “set it and notify” strategy that notifies only when you’re self-correcting (a variation on set it and forget it!!). But we absolutely need to be aware of when data is veering off from expectation and how to handle that, even if handling that is just awareness and/or letting our client know.
Then in Part 3, I’ll talk about creating a dynamic loading system (built off of the original idea from "Dynamic ETL with SSIS" so that future data loads require less of our time, thus allowing us to focus on the what instead of the process of ingestion.
Here are the first items I think about when I’m confronted with data ingestion and we’ll talk about all of these:
- Where is the data coming from?
- What is its purpose?
- What is its frequency?
- What is its format?
- What are the size expectations?
- How quickly does it need to be available?
- Does the data build on itself?
- How long will we expect to receive this data and how long will it continue to be useful?
- How long should the files be retained after load or in the system we’ve loaded it into?
- How are we aware when there are issues with data/file drops?
- Who do we reach out to when there are data issues?
- How can I ingest this data in a non-manual, set it and self-correct fashion? (Part 2)
- How can I load data dynamically for the future so that loading data is easy to implement, modify, test and verify? (Part 3)
The Data is Coming, the Data is Coming!!
Many of us are familiar with the process. A project manager approaches us about new data that we need to ingest and they are looking to you to bring that data in. Simple enough, right? Before any data is ingested, it’s imperative to know what the need is for this data. Hopefully your PM has had some upfront discussions about what you’re getting and why. If not, it’s imperative at this point to pepper them with questions to get more information. Volunteer to be on any calls as needed to gather more information.
Ideally, before data is received, there’s some understanding about what this data is and how it will be used and integrated in. In today’s age of data, there is unlimited data to be extracted. But we need to be asking ourselves first, what value will this data give my organization/client? And at what cost? As a data engineer, I’m always wanting to understand the bigger picture. We are not just here to take data from point A to point B. We’re also here to understand the bigger problems and solutions that exist for our organizations and customers.
Here are some exploratory questions, but you may have more as you work through the data ingestion process:
- Where is the data coming from?
- How clean is it?
- How will it be useful to our application/organization, etc?
- Will it be integrated with existing data to answer additional questions?
- Who will want to know about this data (users, exports, internal business people, etc) and how will they interface with it?
- What questions will this data answer?
- What information do we want to track over time about this data?
Make sure all these questions are documented. It’s not uncommon for many of the answers to change over time. Data can become more or less useful with time. What can we learn from this data?
You Say Data Format, I say Data Dictionary
I must be old school because it seems fewer people say “Data Dictionary” anymore. Maybe it’s because no one really uses dictionaries anymore? Who knows? Any way you say it, after you’ve figured out what you’re getting, it’s time to think about the format that you’re getting it in. This is often wrongly the first step that people take. It’s hard not to want to know what’s in the box! Open it, open it! And you should definitely do that if you’re given any kind of sample – but then verify.
Ideally what we want here is the organization that is sending it to us to say – “here is what you’re getting” in a formal way. If you’re dealing with a third-party data company, this can come in a detailed 200 plus page document or in some more dirty data extractions, there can be no data information at all. If you are getting very little information about your data, it’s time to put your cautious data shoes on and tread lightly. If you only have a sample and you’re meant to work off that, put together your own data expectations document and send it back to the project manager and if possible, the client/group/organization that you’re getting the data from to verify.
Assumptions can be a killer and it’s always best to lay out what your assumptions are to give everyone an opportunity to adjust them if needed. This also gives you something to fall back on when or if those assumptions are incorrect. With dirty data that doesn’t have clear definitions, the documentation of that data may need to be changed and modified with expectations as you become more aware of the data you’re getting.
But wait – there’s more! It’s not just the format of the data, there’s a lot more than that. Here are some additional questions that need to be answered in conjunction with data format:
- How often will I receive the files?
- Where will they be dropped?
- Will they be deleted from the drop location and if so, how often?
- What time of day will they be dropped?
- What are the file naming conventions?
- What will be the average size of files?
- Are there file size fluctuations that I should be aware of?
- How quickly does it need to be available?
- Does the data build on itself (ie does the order of how files are applied matter)?
- What are the data types and data field sizes for each field? Do they map to what may be pre-existing in our system?
- What is the business definition behind each field?
- Are all fields required or are some of them extraneous?
- Is the file delimited or fixed length?
If you’re a data engineer who has been working with data ingestion, as you ask these questions and get answers, you’re inevitably going to start formalizing an ingestion strategy in your mind. I liken this to how if someone asks you to drive to your local grocery store, in your mind, you’ll be traversing all the roads it would take to get there in your mind. When we have experience building ingestion pipelines, we automatically start thinking about how we’re going to build the roads for that data. This is a good thing, but it’s also important to not start to get ahead of yourself. Our expectations on a beautiful peaceful stream of data running through our meadow can quickly become a waterfall of bad data that floods our meadow. A 50MB file that somehow then turns into gigs of data flowing through is a danger if you never expected that large a volume of data ingestion.
The only thing I would try and steer in a conversation about the data here is dates. If it’s not already in place, make sure your file names come in with a date format either in the folder names they are saved to and/or with a date in the file name. File dates change as files get moved around so it’s important to have something representative either in the folder structure or naming of the file and preferably both. It’s often the case that a file dated on one day, holds data updates from the day before. So understanding the relationship between dates on files and what that means for data is important. File naming also tells other people who are looking at the files, what that file is. The more detailed the better.
Also, ideally the data itself will have dates in it. Especially if it’s a file process where data builds on itself. If a file drop doesn’t happen, and another file comes in for the next day, it’s a lot easier to be self-correcting when you have dates to work with. It’s not the end of the world if you don’t have dates because they can potentially be grabbed in other ways, but ideally you want your data source to be telling you when that data was created and/or updated so it can be reflected in your system if needed. If you have any say in how the data is received to you, make sure that there are representative dates ideally with times.
As you work through the questions above, the next question should always be “And what if that’s not true?”. So, for example, the client says – the files will be dropped on the SFTP site by 10am in the morning PST. Now you ask - what should we do if they are not? Is there a window or time limit where notifications about missing files need to happen? Work through the worst-case scenarios at this point while in the background always thinking – how do I make this problem less obtrusive and/or self-correcting? For example, a daily file doesn’t come in – can I continue to run the job associated with the import? Can I run the next day’s import without that file? Again, we’re aiming for self-correcting here. If we know what action needs to be taken when something expected does not occur, we can put another action in place to deal with it.
We need to have faith and trust in the data in our systems and be strong gate keepers. The way we do that is by asking a lot of questions up front and documenting all our assumptions. This can all be done before we’ve even received any kind of sample file. By asking all these questions up front, we have everyone involved thinking about the use, format and expectations of the data which makes the next steps of the technological pieces more stable. You do not want to have to come back to make changes in your process because of incorrect assumptions about the data. This can lead to technological designs that don’t fit well with the data expectations. In the next article, I’ll talk about technical considerations for data ingestion.