Test Your Data Engineering Software
Some testing is obvious. Other types of testing, not so much.
Let’s talk about types of testing applied to data engineering software.
I can hear some of you thinking, “software, Andy?”
Yes.
Data engineering development is software development.
Pipeline Succeeds
First, “succeeds” testing need not apply to a pipeline. It can just as readily apply to a notebook, package, or script. Any artifact that executes and returns an execution status will suffice.
The idea of “succeeds” testing is straightforward. The pipeline (notebook, package, script, etc.) executes and the execution status returned is, well, “success.”
Early in development, having an artifact execution succeed is a noble goal, but it’s certainly not the final goal - and it successful execution may give a data engineer the wrong impression.
Successful data engineering executions require more examination.
<ad>
Join me for “A Day of Fabric Data Factory” and “A Day of Azure Data Factory”! Course deliveries are scheduled into July 2025. Register for any available course before 1 Mar 2025 and use the coupon code “eodsubstack” at checkout to save 10% on any available course delivery! It’s my gift to you just because you’re reading my newsletter.
Become a paid subscriber and save 50% on all my live and bundled training recordings.
</ad>
Pipeline Succeeds
“Wait a minute, Andy. Isn’t the previous heading also ‘Pipeline Succeeds?’”
Yes. Yes it is.
I repeat the heading because pipeline success is not good enough. The pipeline must succeed and accomplish its purpose. Not failing is actually not better than failing.
Why?
Because if the pipeline execution fails, you know something is amiss.
What happens when the pipeline execution succeeds and something is amiss?
You don’t know it.
That is, when the pipeline execution succeeds, you don’t know whether the pipeline achieved its design goal unless you test for the expected results.
An Example of Results Testing
One way to test the results of a pipeline execution is to count source rows, execute the pipeline, and then count the target rows. If the source row count matches the target record count, odds1 are the pipeline loaded the target with all the source rows.
If the pipeline transforms the data - especially if one effect of the transformations is a different number of records out for each record in. Note that such transformation of data may eliminate rows that don’t meet certain criteria, thereby decreasing the number of target rows; but some transformations create output rows, thereby increasing the number of output rows.
I can hear some of you thinking, “How in the world does one detect the correct number of decreased rows, Andy?”
“It depends.”
Value Hashes
One approach to testing data engineering that (potentially) changes the number of output rows compared to input rows is to (carefully) design hashes to detect the correct rows are either making it the target or being eliminated.
Use Cases
Crafting test data is another way to test complex transformations. A good TRA (Transformation Rules Analyst) will document transformation requirements with as much granularity as possible. I really enjoy working with gifted TRAs!
Well-documented transformation requirements lend themselves as examples to test rows that change one and only one transformation rule between test rows. The goal is to produce test rows that fall into one of four categories: pass, don’t pass, don’t know, don’t care. (For more information on states, please see my post titled Some Logic: The Four States of Two-State Logic.)
Consider the preceding paragraphs a partial introduction to what can be a complex topic. Complex testing often requires testing the tests (meta-testing)! In some scenarios, test-development can become a sub-project of the data project.
Conclusion
Software testing is a software development best practice.
And data engineering is software development.
I welcome your thoughts!
In my experience, the Pareto Principle (“the 80-20 rule”) holds for much in data engineering. I mentioned this in a post titled Enterprise Data Warehouse Maintenance Costs More Than Development.
We have tests that errors are correctly trapped (data or pipeline code)
We also have detection and alarms if our pipelines don't run when they are supposed to or run for abnormal durations (too long or too short).
As part of a continuous improvement ethos when problems occur that are as yet not covered by tests we are always asking whether testing could have prevented the problem, if testing us currently possible and whether a different approach would make it testable.