Glossary

Data pipeline

A data pipeline is the set of components — code, infrastructure, and scheduling — that moves data from operational source systems into an analytics-ready destination, cleaning and reshaping it along the way.

Last updated

Part of our topic guide on Data Engineering.

A data pipeline is the set of components — code, infrastructure, and scheduling — that moves data from operational source systems into an analytics-ready destination, cleaning and reshaping it along the way. Anywhere an organisation answers questions with data, there is a pipeline behind the answer; if there isn't, someone is moving data by hand.

In practice a pipeline encodes a contract: this data, from these sources, will land in this destination, on this schedule, in this shape, with these quality guarantees. When teams talk about data engineering as a discipline, pipelines are the unit of work engineers spend most of their time on.

What a data pipeline does

A working pipeline takes responsibility for five things:

  • Extract raw data from source systems — CRM, billing, product database, third-party APIs, event streams.
  • Move that data into a staging area in the analytics warehouse or lakehouse without losing rows or scrambling types.
  • Transform the staged data into well-named, well-modelled tables that downstream consumers can use directly.
  • Schedule or stream the work so that the destination is always within an agreed freshness window.
  • Observe itself — surface broken sources, missing rows, schema drift, and runtime failures before downstream dashboards lie.

The first three of those steps used to run in ETL order — extract, transform, then load — with transforms happening on a separate compute layer before the data reached the warehouse. Modern data stacks usually run them in ELT order (the order shown in the list above: extract, load into the warehouse, then transform in-place) because cloud warehouses are now fast enough to handle the transform step themselves.

Batch vs streaming pipelines

The most important architectural decision a pipeline makes is whether to run in batches on a schedule or to stream continuously:

  • A batch pipeline runs on a fixed cadence — hourly, daily, weekly — and reprocesses a defined slice of data each run. It is the right default for analytics use cases where freshness within a few hours is acceptable and idempotency matters more than latency.
  • A streaming pipeline processes each event as it arrives. It is the right choice when stale data is actively wrong — fraud scoring, operational alerting, real-time personalisation — but the engineering cost (event ordering, exactly-once semantics, backfills) is higher.

Most production estates run a mix. A common pattern is a batch pipeline for the bulk of analytics tables, with a streaming pipeline alongside for the handful of metrics that genuinely need second-level freshness.

Pipeline orchestration

Pipelines don't run themselves. An orchestrator is the layer that knows when each step should fire, what its dependencies are, what to do on failure, and where to send alerts when something breaks. Common orchestrators include Airflow, Dagster, Prefect, dbt Cloud, and the orchestration features built into cloud warehouses themselves.

The job of the orchestrator is to make pipelines feel like production services — versioned, observable, recoverable, deployable through CI/CD — rather than a folder of cron-triggered scripts.

Where pipelines fit in the modern data stack

Pipelines are the moving parts in the broader data engineering discipline. They sit between operational source systems and the BI / ML / AI tools that consume curated tables. Done well, they are invisible to downstream users: dashboards just refresh, models just train, and AI applications just get the inputs they need. See our longer guide on what data engineering is for how the role of pipeline-builder fits alongside warehouse design, data quality, and observability.

Building pipeline capability

In our experience training apprentice data engineers, pipeline work is where new engineers learn fastest — the discipline forces them to think about idempotency, schema management, observability, and the contract with downstream consumers all at once. It is also where the gap between someone who has written tutorial pipelines and someone who has shipped real production ones is most visible.

If you're building data engineering capability in-house, our Level 5 Data Engineering apprenticeship puts apprentices on real pipelines from week one — supervised but not insulated.