In part one of this series, we discussed how to prepare for the ingestion of data. In this second part of the Art of Data Ingestion, we'll discuss technical considerations for ingesting data. Once you've worked through the questions of what data it is you will be receiving, it's time to think about the technical horizon for the data.
My preferred data to ingest, if I had my dream world of data ingestion (queue the birds and dancing data strolling through a nice meadow), would be from a third-party system that specializes in data extraction. For businesses that sell data, the data is their product. They care what it looks like and how it’s formatted. They want you to use it, so they'll give you a giant document with definitions and layouts and you can get started right away. It can be very nice to work with a third-party company whose whole business is making sure you’re getting accurate and timely data in the format that they guarantee. Their salaries depend on it.
But as someone who worked with the data from one of these companies for years, don’t think that everything is going to come up roses. Issues arise even from the best and most organized of companies. What they will appreciate are the clients that notify them of issues right away because those clients validate the data BEFORE ingestion. What they don’t want is a company suing them because of data that just went in willy-nilly and caused all kinds of havoc in their systems. You do NOT want your company to be that one.
Laying the scene
The data you receive can be an important file that is extremely integral to your business. It could also be a file that your business is not quite sure what it will do with yet. If it’s the latter, more investigation is needed. That doesn’t mean that you won’t start ingesting it. What is of utmost importance is to bring that data in for some basic and exploratory analysis.
If you’re in an exploratory phase, you may want to segregate that data in another environment where it won’t have to play with production data. Or if it’s associated with production data and needs that context, perhaps you can create a separate environment where they integrate. When you have less knowledge about the data, you tread very lightly in your expectations and watch it carefully, documenting observations and consistencies over time.
This can be a good time to go back to your client or project manager to discuss what it is you're seeing and make sure that it's aligned with your expectations. Exploratory analysis involves time and hopefully you've arranged some time because data can have much more context when given over time. Ideally, you don't want to add data to your system until you know what it's value will be. That being said, data ingestion is never really a one and done process. Understandings and requirements can change over time. The goal is to have your ingestion process be as agile as you can make it and continue to observe the trends and information that the data is telling you.
Sometimes, there is too much production data to easily have a separate data environment that mimics your production environment exactly. In that case, hopefully, you have a pared-down staging version of your production databases. A last-case scenario is to bring the small subset into production so analysis can be done there.
It’s important at this point that exploratory analysis is a part of any estimation on data ingestion. Expectations need to meet requirements and this can only happen when the data is validated against expectations. This can happen with sample files, but as we’ll learn, even over time data can veer from expectation. Make sure to have a good subset of sample files to work off of. Set rules need to be established with the expectation that they will be constantly verified over time - luckily you've done that after reading part one.
Here come the Files!
As we get started, we need to ensure we have tools, naming, a queue design, and indexing. We will discuss each of these below.
For data exploration, it can be nice when the data is coming in flat. But each data format has its upsides and downsides. JSON and XML can represent relationships and hierarchical data structures, which can be great at representing the data structures, but tricky to break down at times. Thus, it's import to get the data format design straight up front. Azure Data Factory, Data Bricks, and Azure Synapse can handle a wide array of different files. Microsoft cloud tools have become very robust and easy to use. You'll need to think about what the technology is that you'll use to import your data. That tool will most likely depend on what's available to you and your organization. Ideally you would use the very robust cloud tools that Microsoft has available, but perhaps you have to use SSIS because you're not in the cloud yet. Deciding your ingestion tool can be complicated if you have lots of options or it can be simple. Get familiar with your tool and mock up a process for bringing the files in to validate the data and the tool and be ready to explain why you've chosen that tool.
Naming & Pathing
Microsoft has some recommendations on file path formatting and those recommendations are to create subject/area/object groupings and under that go by Year/Month/Day. This is becoming more the norm because of partitioning. Even if you're not partitioning the data logically, it can be a good idea to follow the norm physically because it allows other developers and data engineers to understand your file structure and find files more quickly. There are no hard and fast rules on file storage and movement. It can be an organizational or even a personal preference. I will say that if everything is going into one folder and never leaving, it can be hard to find files once it starts to fill up. If you are using a utility to browse, it can be slow to load all the files that are viewable to you. These naming conventions (such as raw, curated, cleansed or gold, silver and bronze) are becoming more the norm as we move towards a shared knowledge of data movement. The folder naming structure can help us define and more easily understand the process. Microsoft's data lake process can be a good one to emulate.
If files get manipulated it’s important to keep the original as well as manipulations done to the file so that you can do comparisons. In a very simplified way, I like the idea of moving from an “Unprocessed” folder to a “Processed” folder using whatever naming convention works best for your team. That way, you can more easily see what hasn’t been processed and what has. But you absolutely HAVE to make sure that the file is only moved after it’s truly been validated as processed in whatever way this means to you.
I have had the experience that moving files across folders can be difficult for testing. You run a test pipeline, files are moved and then you want to re-run it. You can spend a lot of time manually moving files back and forth. One way around this is to allow over-write both of the file and with the SQL metadata table you use for tracking file processing. You can then do a check in the database – has this file already been loaded? If not, load it, if so, ignore it. You may want to put code or additional attributes into your process for tracking when you're testing.
If you are getting files incrementally, and you have a problem with a production file and want to test it in development, having the ability to load that file while bypassing all the files you missed since the last backup will make problem solving more efficient. Or being able to automatically load all files in one batch that are missing if the incremental load is important. Getting this solution right is a cross between organization, clarity and supporting your testing efforts so that it’s not onerous to test.
Data Queue Design
I’m a strong proponent of setting up a table with metadata about the file process for processing within the database to track files that are expected, the dates they were processed, the duration and if errored, what stage the process was in along the way. This is essentially a table that tracks the process in your database. Besides keeping you informed about what you've loaded, it can be used to tell your process what to load (what it's expecting) and this then allows you to notify someone when a file you're expecting is not received.
Think of your ingestion as a manufacturing line – if there’s a problem you want to get right to it, fix it and then evaluate what happened and prevent the problem from happening again. If the error happened in step 3, you want your metadata table to tell you where the problems was and give you the most up to date information. This doesn't just have to apply to data ingestion. If the data ingestion piece is the first step in a process, create your metadata table more like a process queue where it tracks step status along the way. This allows you answer any type of question you might have about the process. What's the longest step? How much data did we receive in December? How many files weren't propped for download in time for processing over the last year? The more detail the better. The database is the best place to analyze data, and it’s the best place to analyze your data ingestion. I want my database to expect the files coming to it, tell me when they aren’t there or have issues (in a nice way – not by erroring out) and to give me an idea of file sizing, process duration, and successful completion.
I DO NOT want to be fishing around through SSIS packages or ADF pipelines trying to find out what exactly happened. So, I’m a strong proponent of KIDD (keep it in the database dummy – ok, this is not a thing, I just made it up). The tools that Microsoft has created to work with data movement are great and so much easier to use now in the cloud then ever before. But it doesn’t need to be complicated. Databases have been tracking data for many decades now and giving us information to solve business problems. Data ingestion and data movement are no different and I believe their components and metadata about them are best also stored in the database. So one thing I keep in mind as I use these tools, is that I want them to function and do their part, but I want the information they working off of and the information they are creating to be stored where I can easily access it.
There are a lot of articles on indexing. It's a deep subject. Generally, it’s best to follow Microsoft’s recommendations to not have indexes on tables that are staging tables that will hold data because it can slow down loading. But if you’re merging data from two tables or doing any kind of join on a table, add an index on those fields you’re joining either as part of the table or added after the table has been inserted into. I also always have a surrogate primary key. If I’m not loading a lot of data, I’ll sometimes keep the indexes permanent – just remember to update statistics after load.
ELT vs ETL
ELT vs ETL is like snowboarding vs skiing. They’re both great and each can be better based on the conditions. If you understand the benefits and downsides of both and when it's best to use one over the other, then you’re set for whatever conditions come your way. When you’re dealing with big data or streaming data and you only need a subset of the data, it often does not make sense to implement ELT. In that case you’re bringing in more data than you need and data space costs money. However, you want to be absolutely sure you don’t need it, because the worst case is when you determine that you do and now you need to load a whole boat load of historical files. As a fail safe, you can save that data off to compressed Parquet files and create an external table within Azure Synapse where the data can more easily be extracted. Then it can be queried as required or extracted as needed more easily later.
If I need all the data that’s coming to me or even if it’s a reasonable size, I prefer ELT. I will load it in entirety into a staging database and transform there. Why? Because SQL code is what we know, control and read. I want all my processing to happen in one place where there is high visibility, knowledge and access. It can be a lot easier to run, test, and modify a stored procedure than swim into a complicated SSIS package or ADF pipeline. There are now many tools that offer to transform data, but these tools require time and knowledge to get up to speed. SQL has been around for a long time and will continue to be. Developers, testers, and sometimes project managers feel comfortable querying a database. The benefit of keeping your transformations within SQL is the speed at which you can troubleshoot and review your data. When you're the only person who understands a tool, it's going to be much more difficult for you to go on vacation, and I like vacation.
About ten years ago, I switched an ETL process to ELT and it was a game changer for the testers and for me. It significantly decreased the time I spent on data ingestion and I no longer groaned when a request for a new data import file came in. The process involved just adding the metadata about that file to the database and voila my file got picked up by the SSIS package and loaded into the database. I created a shell structure to use a SSIS package to just load flat data, then dynamically transformed that flat data into it's fields and processed it within the database. I never touched the SSIS package again. Personally, I don’t want to spend a lot of time on data movement. I want to spend time helping clients get value from their data. Luckily, the trend is moving towards less time on data movement! The goal in our data movement process is good organization and code transparency. In Part 3, I’ll go over a simple way you can implement a process for dynamically loading flat files into your database so that when you get a new file to import your time is minimal.
Say Hello to my little Email Friend
Yes, I know, we all love to hate it. But email is the work horse of our work communication. I am a strong proponent of only getting emails when I need to take action. A daily email telling me all is good will be immediately given a rule to go into a folder that I might check once or perhaps fewer times a week. Constant emails are a great way to make noise and be ignored. The best scenario is to get an email when there's a problem that I need to either be aware of or take action on. So think deeply about the problems that could be incurred and what that email communication looks like. The more detail, the better. If you need to take action, then if you're able to, the email should state the action that needs to be taken.
So, you’ve got your data ingestion queue set up and your ELT pipeline. You’ve created your stored procedures to process the data and update the queuing mechanism along the way so status of the process can easily be tracked. You’re well on your way. But wait, on day 10, the daily file is missing. No big deal. You knew this might happen (see Part 1 where we discuss actions for the unexpected outcomes). Your system sends a nice email (heck you can even cc the person whose job it is to make sure you have the file) and tells you that the file is missing. And you knew this would happen, so you also have next steps – usually either wait and stop future processes (not ideal) or no worries, we continue with the next file and have the missed one fall in place when it's received. When the data comes, it will be processed as if it came in earlier and all data will be good. But it is important to be aware of the issue. Personally, I love an email that tells me about a problem that I need to be aware of but is self correcting. "The file is missing, just wanted to let you know! We'll skip over it and reload once it's been received!".
Remember that third-party data provider I told you about earlier? They were really great at giving me daily files on time and as expected. But one thing they weren’t great at was being accurate about whether they were modifying a record or adding it. They would send me a modification designation, but I didn’t even have that record. Sometimes you'll come across issues that you just weren't prepared for - fun little surprises. Data ingestion is a dynamic thing, you may need to add modifications and put new processes in place when unexpected data is received. My solution was to modify my processing and do a compare and if it came in as a modification, but didn’t exist, I switched it to an Add (and of course tracked that change historically so I could see the change).
This is just one example of how we handle assumptions that turn out to not be true. Even large data providers miss dropping files. Make sure you’re handling it so that your data ingestion pipeline isn’t waiting on that file AND be aware of it so action can be taken to get that file to you ASAP. Knowledge is important, intervention is not ideal.
Validate, Validate and More Validate
At my last company I had my pipelines set up dynamically. The process would notify through email if there were issues with missing files and so on. But the part that really saved us so many times, were our production validations. These were essentially business rule validations written in SQL that we would run against the data on a nightly basis. If there were issues, an email would be sent, if not – no email would be sent. This allowed us to see when data was coming in either through the front-end, through processing, or from our ELT loads that weren't meeting our business rule expectations. However, it also had the unknown benefit of alerting us when there were code defects that were causing data issues. Our data validations were our daily pulse on our data quality.
But the key was that every error I would get from these HAD to be actionable – even if that meant (most of the time) writing up a defect to get it fixed. I was even known to come back in and comment something out if no action was being taken on that business rule so that it would be removed from the daily email. I don’t want to hear over and over again about something that actually doesn’t require action. These validations caught major issues before our clients did in production and in development. If you just read through our production validation script, you would actually learn a lot about our business rules and what was expected of the data, so it was a great tool to familiarize someone with the database relationship to application. Every time we had a defect that was data related, we would add to it so it was a working validation that was modified over time. Defects love to show up sometimes unannounced again and again. If you’re data ingestion is expected to look a certain way, be a certain data type, mean a certain thing to someone – create a validation for that. You’ll thank yourself later.
Why not just create constraints in the database to check for inconsistencies? You definitely should do this if you're able. However, it's almost impossible to create constraints for every business rule. We had multiple databases that were interrelated and so for many scenarios we just couldn't create those types of rules across databases. Sometimes (and more frequently now with large databases) if rules weren't obeyed from the beginning, data that doesn't follow the rules is sitting there and action can't be taken to fix it because you need that historical information. A production validation should pop in your mind any time someone says - if this data is set to this, then this other data should be this. As you add more and more validations, the cleaner your data will be and you'll also be constantly reinforcing your business rules which are usually the life blood of any application.
So to reiterate some important parts to the technical ingestion:
- Select your data tool based on what your needs are and what the tool offers
- Choose a logical file naming structure that includes a Year/Month/Day format if possible. Your file name should also have a date structure as well. These organizational processes allow you to quickly find the file you're looking for. It's like your file cabinet!
- Define your folder naming and file movement process as the file moves from one location to the next (and make sure to think about files that error or your testing process where you may want to load a file over and over)
- Set up a database structure that will hold the metadata about your process so that you can track what you are expecting, what you've received and whether it's completed successfully. Making sure to track logical file location through the process.
- Figure out your loading indexing strategy and test your loading times with different index strategies. This can also be a great time to determine process timing so it doesn't conflict with other processes.
- Create a table in your database that you use as a queue for your processing, this is your loading command center where it will inform you of what you've completed and what you expect to happen.
- Create an email response system to alert you when there is important information to know or act on. Define what those errors would be.
- Create a validation system for the data after load and in conjunction with your application to make sure your data expectations and business rules are aligned.
In Part 3 of this series we’ll talk about dynamic data ingestion using an ADF pipeline. This is one potential solution in my “Set it and notify” strategy. It will set up the queuing tables and allow me to add the expected fields that I will receive from files. My pipeline will download the file in raw format (no delineation between fields) and process them into their designated fields based on the table where I’ve added my data dictionary IN the database – no need to modify pipelines or packages.