I get asked some version of this question at least twice a week: “I want to get into data engineering - where do I start?” I’ve mentored over 400 engineers in various stages of this journey. Here is the honest answer, not the aspirational one.

The data engineering field has changed significantly even in the last two years. The tools that dominated the hiring conversations in 2020 are not the ones that dominate them in 2025. If I were starting over today, here is what I would do differently.

Start with SQL, not Spark

The most common mistake I see from aspiring data engineers is jumping straight to distributed computing frameworks before they can write a complex SQL query in their sleep. SQL is the language of data. Everything else builds on top of it.

Get very good at window functions, CTEs, query optimization, and data modeling. These skills transfer across every tool in the ecosystem. The engineer who can write a clean, efficient SQL query that handles edge cases gracefully is more valuable than one who can spin up a Spark cluster but writes inefficient jobs.

If you can’t read this and say out loud what it returns, you are not ready to interview for a DE role yet:

-- Rank orders per customer and flag the first order in a given month.
with ranked as (
  select
    customer_id,
    order_id,
    order_ts,
    row_number() over (
      partition by customer_id, date_trunc('month', order_ts)
      order by order_ts
    ) as rn_in_month,
    lag(order_ts) over (partition by customer_id order by order_ts) as prev_order_ts
  from orders
)
select
  customer_id,
  order_id,
  order_ts,
  rn_in_month = 1                              as is_first_order_this_month,
  order_ts - prev_order_ts                     as gap_since_last_order
from ranked
where order_ts >= current_date - interval '90 days';

Window functions, partitions, lag/lead, time truncation, boolean projections. Every interview problem is some recombination of this.

The 2025 stack I would learn

Python for data manipulation (pandas, polars), orchestration scripting, and API integration
SQL deeply - not just CRUD, but complex analytics, query planning, and performance tuning
dbt for transformation logic - it’s now a standard tool and the interview questions are real
One cloud platform deeply - Azure or AWS, not both, not shallowly
Airflow or Dagster for orchestration - understanding how pipelines are scheduled and monitored
Basic understanding of streaming concepts (Kafka, event-driven architecture) - you don’t need to be an expert, but you need the vocabulary
Git and basic CI/CD - data pipelines should be version-controlled and deployed like software

The equivalent dbt model — a single .sql file versioned in Git — is what a modern team actually ships:

-- models/marts/customer_order_behavior.sql
{{ config(materialized='incremental', unique_key='order_id') }}

with orders as (
    select * from {{ ref('stg_orders') }}

    {% if is_incremental() %}
      where order_ts > (select max(order_ts) from {{ this }})
    {% endif %}
)

select
    customer_id,
    order_id,
    order_ts,
    row_number() over (
        partition by customer_id, date_trunc('month', order_ts)
        order by order_ts
    ) as rn_in_month
from orders

If you can explain what {{ ref(...) }}, is_incremental(), and materialized='incremental' each do, you already stand out in a junior DE interview.

What I would not do

I would not spend six months working through a comprehensive course before building anything real. The people who get hired fastest are the ones who built something - a pipeline that pulls data from a public API, transforms it with dbt, loads it to a warehouse, and has a simple dashboard on top. That project teaches more than any course. (The other thing that separates strong juniors from the pack is how they ask questions before they build.)

No one has ever hired a data engineer because of their completion certificate. They hired them because of the GitHub repository.

The AI angle you can’t ignore

Data engineering in 2025 means understanding at minimum: how to build and maintain a RAG pipeline, what vector databases are and when to use them, and how LLM inference workloads affect your data infrastructure and your bill. You don’t need to be an ML engineer. But you need to be the person who can build the data layer that makes AI applications work.

On certifications

Cloud certifications can open doors, especially early in a career. I’d get one associate-level cloud certification (Azure, AWS, or GCP) and a dbt Fundamentals cert. Beyond that, the return diminishes. Certifications signal that you can learn - projects signal that you can build.

If you’re actively trying to break into data engineering and want a realistic review of where you are and what to focus on next, I do career mentorship sessions specifically for this transition.

Book a Session

How I'd Break Into Data Engineering in 2025 If I Were Starting Over

Start with SQL, not Spark

The 2025 stack I would learn

What I would not do

The AI angle you can’t ignore

On certifications

Keep reading

The Honest Difference Between a Junior and Senior Data Engineer

The Intersection of Data Engineering and AI

Filed under

Want to talk through this?