The majority of what I do is in the cloud. A substantial number of what is exposed to me as data sources is in some form of cloud bucket.
What I have started to do is to set up folders that mimic the buckets I am supposed to use. For example my project will have two main folders
- Src (or something similar)
Within Tests there are usually subfolders with code artefacts that express the tests. There is also a subfolder called data. Within tests/data/ there will be subfolders
"Buckets" will have subfolder named for the actual bucket names that are used by the data pipeline. Within those folders will be a replica of the folder structure and files that will be in that folder structure.
When a test run is instantiated I use a mocking framework to create a local mock of each bucket, this then uploads the files in the bucket folders exactly as the data pipeline would in the real world. For testing a pipeline using AWS the mocking framework is called moto. The AWS SDK library is called boto.
There are obvious size constraints posed by both Github and the mocking framework, and not AWS features are supported. For the most part I can test locally.
By adopting a naming convention I can keep my tests simple. Let us suppose I have four files:
- ActualData/01_sales_data.csv - generated by the test
If my test is written to read a source file, run an equivalent SQL script file and compare the expected and actual data then to add tests just need to drop the required files where they should be. This allows people, whose skills do not yet include writing test code, can still contribute to the body of tests.
If testing cannot be done on my workstation or is a carefully curated shared set of data then that test data can be stored in specific cloud buckets for whatever heavy weight testing is required.