Modern Development

Question

Modern Development

Steve Jones - SSC Editor

SSC Guru

Points: 736078
More actions
October 11, 2023 at 12:00 am

#4296437

Comments posted to this topic are about the item Modern Development

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

David.Poole SSC Guru Points: 75994 More actions · Answer 1

The majority of what I do is in the cloud. A substantial number of what is exposed to me as data sources is in some form of cloud bucket.

What I have started to do is to set up folders that mimic the buckets I am supposed to use. For example my project will have two main folders

Src (or something similar)
Tests

Within Tests there are usually subfolders with code artefacts that express the tests. There is also a subfolder called data. Within tests/data/ there will be subfolders

Buckets
ConfigData
SQLScripts
ExpectedData
ActualData

"Buckets" will have subfolder named for the actual bucket names that are used by the data pipeline. Within those folders will be a replica of the folder structure and files that will be in that folder structure.

When a test run is instantiated I use a mocking framework to create a local mock of each bucket, this then uploads the files in the bucket folders exactly as the data pipeline would in the real world. For testing a pipeline using AWS the mocking framework is called moto. The AWS SDK library is called boto.

There are obvious size constraints posed by both Github and the mocking framework, and not AWS features are supported. For the most part I can test locally.

By adopting a naming convention I can keep my tests simple. Let us suppose I have four files:

Buckets/mybucket/my_prefix/01_sales_data.parquet
SQLScripts/01_sales_data.sql
ExpectedData/01_sales_data.csv
ActualData/01_sales_data.csv - generated by the test

If my test is written to read a source file, run an equivalent SQL script file and compare the expected and actual data then to add tests just need to drop the required files where they should be. This allows people, whose skills do not yet include writing test code, can still contribute to the body of tests.

If testing cannot be done on my workstation or is a carefully curated shared set of data then that test data can be stored in specific cloud buckets for whatever heavy weight testing is required.

LinkedIn Profile

Rod at work SSC-Dedicated Points: 34003 More actions · Answer 2

I really like your suggestion of putting the data used for creating a test database into SQL scripts with INSERT statements! That does make it possible to put it into source control. I'll have to try it when next I get an opportunity.

Kindest Regards, Rod Connect with me on LinkedIn.