The Intersection of Data Engineering and AI
Why the pipes matter just as much as the model
On this page
Everyone wants to talk about the model. The accuracy, the benchmarks, the context window, GPT-4 vs Claude vs whatever launched this week. And I get it - models are exciting. They’re the visible, tangible part of AI.
But here’s something I’ve learned after years of building AI systems at enterprise scale: the model is almost never the bottleneck. The data is.
The unglamorous truth about AI in production
When companies come to me with AI problems, the conversation usually starts with “our model isn’t performing well enough.” Nine times out of ten, by the end of our first session, we’ve figured out it’s actually a data problem.
- The training data is stale and doesn’t reflect current business reality
- The features going into the model are computed inconsistently between training and serving
- The data pipeline has silent failures that nobody catches until production breaks
- The ground truth labels are noisy because they were created by an unreliable process
- There’s no monitoring, so model drift goes undetected for months
A great model on bad data is just a very expensive way to be wrong with confidence.
What data engineering actually means in the AI era
Feature stores and training-serving consistency
One of the most common (and costly) mistakes in ML is computing features differently at training time versus inference time. A feature store solves this - it’s a centralized place to define, compute, and serve features consistently.
The RAG pipeline as a data engineering problem
If you’re building with LLMs right now, you’re probably building a RAG system. And guess what? A RAG pipeline is just a very fancy data pipeline. You need to ingest documents, chunk them intelligently, embed them, store them in a vector database, and retrieve the right context at query time. That whole system lives in the data layer, not the model layer.
Stripped of framework gloss, the ingest side is almost indistinguishable from any other ETL job:
for doc in source.stream_changes(since=last_run): # ingest
chunks = chunker.split(doc, max_tokens=512) # transform
vectors = embedder.embed([c.text for c in chunks]) # enrich
vectordb.upsert([ # load
{"id": f"{doc.id}:{c.idx}", "values": v, "metadata": c.meta}
for c, v in zip(chunks, vectors)
])
checkpoint.save(doc.updated_at)
It has sources, schemas, checkpoints, idempotency, monitoring, and late-arriving data. If that reads like a dbt job that also happens to call an embedding API — that’s exactly the point.
What I’ve seen firsthand
At Lululemon, we built a recommendation model that drove a 12% uplift in sales conversions. The model itself was not particularly exotic. What made it work was the pipeline - reliable feature engineering, consistent serving infrastructure, good monitoring. Much of that recency advantage came from treating it as a streaming problem, not a batch one. The data engineering was the product.
If you’re building an AI product and your data layer feels shaky, that’s the thing to fix first. I work with teams on exactly this - architecture reviews, pipeline design, the full picture.
Book a SessionKeep reading
AI for Startups
How early-stage founders can actually leverage AI to build faster, leaner, and smarter - without needing a research team or a $10M budget.
5 min readReal-Time Data Streaming: Why Batch Is No Longer Enough
A deep dive into why modern data systems need streaming, when batch still works, and what it actually takes to build pipelines that move at the speed of your business.
7 min readFiled under
Previous
From India to Microsoft: What Nobody Tells You About Building a Career in the US
Next
AI for Startups
Want to talk through this?
Book a session and let's get into your specific situation. No slides, no fluff.