Data Engineering and AI: Why Pipes Matter More Than Models

Everyone wants to talk about the model. The accuracy, the benchmarks, the context window, GPT-4 vs Claude vs whatever launched this week. And I get it - models are exciting. They’re the visible, tangible part of AI.

But here’s something I’ve learned after years of building AI systems at enterprise scale: the model is almost never the bottleneck. The data is.

The unglamorous truth about AI in production

When companies come to me with AI problems, the conversation usually starts with “our model isn’t performing well enough.” Nine times out of ten, by the end of our first session, we’ve figured out it’s actually a data problem.

The training data is stale and doesn’t reflect current business reality
The features going into the model are computed inconsistently between training and serving
The data pipeline has silent failures that nobody catches until production breaks
The ground truth labels are noisy because they were created by an unreliable process
There’s no monitoring, so model drift goes undetected for months

A great model on bad data is just a very expensive way to be wrong with confidence.

What data engineering actually means in the AI era

Feature stores and training-serving consistency

One of the most common (and costly) mistakes in ML is computing features differently at training time versus inference time. A feature store solves this - it’s a centralized place to define, compute, and serve features consistently.

The RAG pipeline as a data engineering problem

If you’re building with LLMs right now, you’re probably building a RAG system. And guess what? A RAG pipeline is just a very fancy data pipeline. You need to ingest documents, chunk them intelligently, embed them, store them in a vector database, and retrieve the right context at query time. That whole system lives in the data layer, not the model layer.

Stripped of framework gloss, the ingest side is almost indistinguishable from any other ETL job:

for doc in source.stream_changes(since=last_run):     # ingest
    chunks = chunker.split(doc, max_tokens=512)       # transform
    vectors = embedder.embed([c.text for c in chunks]) # enrich
    vectordb.upsert([                                  # load
        {"id": f"{doc.id}:{c.idx}", "values": v, "metadata": c.meta}
        for c, v in zip(chunks, vectors)
    ])
    checkpoint.save(doc.updated_at)

It has sources, schemas, checkpoints, idempotency, monitoring, and late-arriving data. If that reads like a dbt job that also happens to call an embedding API — that’s exactly the point.

What I’ve seen firsthand

At Lululemon, we built a recommendation model that drove a 12% uplift in sales conversions. The model itself was not particularly exotic. What made it work was the pipeline - reliable feature engineering, consistent serving infrastructure, good monitoring. Much of that recency advantage came from treating it as a streaming problem, not a batch one. The data engineering was the product.

If you’re building an AI product and your data layer feels shaky, that’s the thing to fix first. I work with teams on exactly this - architecture reviews, pipeline design, the full picture.

Book a Session

The Intersection of Data Engineering and AI

The unglamorous truth about AI in production

What data engineering actually means in the AI era

Feature stores and training-serving consistency

The RAG pipeline as a data engineering problem

What I’ve seen firsthand

Keep reading

AI for Startups

Real-Time Data Streaming: Why Batch Is No Longer Enough

Filed under

Want to talk through this?