Zero trust data pipeline framework

We have all experienced the pain of data loss or data incorrectness in our batch data pipelines. To address this, I designed a framework for ensuring data correctness at every stage of the data pipeline. I call it the Zero Trust Data Pipeline (ZTDP), inspired by Zero Trust Networks (ZTN). However, unlike ZTN, which focuses on data security, ZTDP is solely focused on data correctness — specifically, the confidence that data flowing through the pipeline remains correct, along with a provable method to demonstrate that correctness.

In this blog post, I want to demonstrate this concept using practical code and a sample data pipeline. For the example, I will use click-rate data.

The framework is divided into three layers:

Data Layer:
The layer in which data exists in its raw format.

Traceability Layer:
In this layer, the data is aggregated, and each subset of the aggregation is assigned a unique hash. The number of rows constributing for the aggregation is constantly tracked.

One application is for anomaly detection:

Layer	Rows In	Rows Out	% Retained	Status	Action
Layer 1	100	100	100%	✅ PASS	`hash=abc123`
Layer 2	100	10	10%	❌ ANOMALY	🚨 INVESTIGATE
Layer 3	10	10	100%	✅ PASS	`hash=xyz789`

Few other use cases:

	Use Case	Layer 1	Layer 2	% Retained	🚨 Alert
1	Aggregation Drop	100 rows	10 rows	10%	`90% data lost!`
2	Join Explosion	1K rows	10K rows	1,000%	`Unexpected duplication!`
3	Filter Drift	500 rows	450 rows	90%	`10% silently dropped daily`
4	ML Feature Drop	10K rows	8K rows	80%	`2K features discarded!`
5	Deduplication Fail	2K rows	1.9K rows	95%	`100 duplicates slipped through`

Diagnosis Layer:
This layer ensures that the subset metrics can be tracked both within and across layers to measure data correctness throughout the pipeline.

90% data drop → Click for histograms + business rules

Layer	Rows In	Rows Out	% Retained	Diagnosis
Layer 1	100	100	100%	`Normal`
Layer 2	100	10	10%	🚨 90% DROP

Click Layer 2 → Auto-Analysis

Amount Range	Rows	Status
$0-$50	██████████ 45	Dropped (<$75)
$50-$100	████ 25	Dropped
$100+	███ 20	PASSED ✓

Business Rule	Rows Failed
`amount < $75`	70
`invalid category`	20
Clean data	10 ✓

This framework provides a deep view of how data moves through the entire pipeline. It encourages us to stop assuming that our data pipeline always works perfectly and instead maintain layered traceability to detect and understand errors. That’s the main idea behind this blog, introducing the concept of a “zero-trust” data pipeline.

One challenge with this approach is the additional investment in parallel compute and storage needed to capture and store metadata. However, when balancing cost against data accuracy, correctness and traceability should always take priority.