Real-Time Data Streaming: Why Batch Is No Longer Enough
A deep dive into when streaming matters, when it does not, and what it takes to build pipelines that move at the speed of your business
On this page
For most of the history of data engineering, batch processing was the default. You collect data throughout the day, run a job at night, and wake up to yesterday’s insights. That model powered analytics for decades. It still works for a lot of things.
But the expectations of modern products have shifted. When a user abandons a cart, the personalization engine should know immediately - not eight hours later. When a fraud signal appears in a transaction, the decision system needs to act in milliseconds - not in the next batch run.
What real-time streaming actually means
Real-time streaming doesn’t mean instant. It means low-latency. Data flows continuously from source to destination - events are processed as they happen, not collected and processed later in bulk.
Batch processing is like reading yesterday’s newspaper. Streaming is like watching the news as it happens. Both are useful. The question is which one your business actually needs.
When streaming genuinely changes the business
Fraud detection and risk systems
Fraudulent transactions need to be caught before they complete, not in a report the next morning. Real-time streaming lets you build decision systems that evaluate every transaction against a model trained on live behavioral data.
Personalization at scale
The recommendation model I worked on at Lululemon gave us a 12% uplift in sales conversions. A meaningful part of that was freshness - the model could act on what a user had done in the current session, not just their historical behavior. That recency required streaming infrastructure. I wrote about why the data layer, not the model, is almost always the bottleneck in a companion piece.
The shape of a streaming consumer is not exotic — it’s a loop with careful failure handling:
# Pseudocode for a resilient Kafka-style consumer.
for event in consumer.poll(topic="cart_events", group_id="personalizer"):
try:
features = enrich(event) # join with user profile, etc.
score = model.score(features) # low-latency inference
publish("personalization_scores", {"user": event.user_id, "score": score})
consumer.commit(event.offset) # only after successful publish
except TransientError:
continue # will be redelivered
except PoisonPillError as e:
dead_letter.send(event, reason=str(e))
consumer.commit(event.offset) # skip forward, don't block the stream
Notice what is not glamorous about this: commit placement, dead-letter queues, and what exactly “at-least-once” means for a downstream consumer that might act twice on the same event. That operational surface is the real cost of streaming.
When batch is still the right answer
- Financial reporting where daily or monthly aggregates are the unit of analysis
- Model training pipelines where freshness is measured in days, not seconds
- Data warehouse loads where the downstream consumers only need daily snapshots
- Small-scale products where the engineering overhead of streaming outweighs the business benefit
The architecture considerations nobody talks about enough
The hardest part of streaming isn’t the technology - it’s the operational model. Streaming pipelines require monitoring, backpressure handling, dead-letter queues for failed events, and careful thought about exactly-once versus at-least-once processing semantics.
Designing a data architecture or trying to figure out whether streaming is right for your use case? That’s exactly the kind of problem I work through in architecture sessions.
Book a SessionKeep reading
The Intersection of Data Engineering and AI
Why data engineering is the unsung backbone of every successful AI initiative and what it means for the next generation of builders.
6 min readHow I'd Break Into Data Engineering in 2025 If I Were Starting Over
A realistic roadmap for breaking into data engineering in 2025 - from someone who has hired, mentored hundreds of engineers, and built data systems at enterprise scale.
8 min readFiled under
Previous
Agentic AI: What It Actually Means and Why Most Implementations Miss the Point
Next
Building Toya: What I Learned Shipping My First Fintech Product
Want to talk through this?
Book a session and let's get into your specific situation. No slides, no fluff.