- Published on
The Rise of the Model-Driven Data Engineer
- Authors

- Name
- Siavoush Mohammadi

From Writing SQL to Designing Models: How Data Engineering Is Changing
Writing SQL used to be the job. A lot of SQL. Transformations for every source, business logic scattered across hundreds of models, hand-written joins that broke whenever upstream schemas changed. Now it's YAML. Declare what entities should exist, and the system generates transformations. Same results, one-tenth the code, zero maintenance headaches.
Data engineering is transforming - from writing code to declaring intent. Not because new tools make it optional, but because we've learned the hard way that hand-written transformations don't scale. Every data engineer has watched pipelines break at 3 AM because some upstream column changed. We've all debugged SQL only to realize the real problem was fragmented business logic that drifted across twelve different models.
A Different Way to Build Data Systems
In 2020, building a Kafka ingestion pipeline meant writing Python connection code, SQL transformations, Airflow orchestration, more SQL for entities, and metric definitions for BI tools. Three days writing code, two days debugging, and a week later realizing documentation already didn't match what was built. Every. Single. Time.
Today, we declare what we want in YAML: incoming data structure, business entities, relationships, and metrics. Systems generate pipelines, transformations, and data structures. What took weeks now takes days.
Infrastructure as Code already transformed operations from manual configuration to declarative Terraform. Kubernetes shifted deployment from imperative scripts to declarative manifests. OpenAPI moved API development from writing and documenting endpoints separately to declaring them once.
Data engineering is undergoing the same evolution, but with broader scope. Modern data platforms can be declarative everywhere - from ingestion contracts through business entity definitions to metrics for analysis.
We're watching data engineers become model-driven practitioners - people who think in entities, relationships, and contracts rather than code, functions, and scripts. To understand why, we need to unpack what "declarative" actually means in data engineering.
- A Different Way to Build Data Systems
- What "Declarative" Actually Means
- Model-First Architecture: The Three Layers
- The Work Changes
- Where Declarative Appears Today
- Benefits, Tradeoffs, and When to Stay Imperative
- What This Means for Platforms and Teams
- Data Engineering Is Growing Up
What "Declarative" Actually Means
The word "declarative" gets thrown around, but here's what it actually means: describing what you want instead of specifying how to build it. Simple concept, massive implications.
Consider ingestion. Imperative means writing Python that connects to Kafka, reads messages, parses JSON, handles errors, transforms fields, and loads results. You specify every step:
# Imperative approach: Hand-written ingestion code
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'user_events',
bootstrap_servers=['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
try:
data = message.value
user_id = data['payload']['userId']
event_type = data['eventType']
timestamp = parse_timestamp(data['timestamp'])
# ... 50+ more lines of parsing, validation, error handling
except Exception as e:
handle_error(e, message)
Declarative means a data contract in YAML defining what you expect. Systems generate all connection, parsing, transformation, and error handling code:
# Declarative approach: Data contract
version: "1.0"
valid_from: "2025-01-01"
endpoints:
source:
provider: kafka
entity: user_events
schema:
primary_keys:
- user_id
- event_id
columns:
- source_path: payload.userId
target_name: user_id
type: INTEGER
mode: REQUIRED
description: "User account identifier"
- source_path: eventType
target_name: event_type
type: STRING
mode: REQUIRED
description: "Event type (USER_CREATED, USER_UPDATED, etc.)"
- source_path: timestamp
target_name: event_ts
type: TIMESTAMP_MILLIS
mode: REQUIRED
description: "Event timestamp in milliseconds"
The same split runs through every layer. Business logic is where it becomes most visible: instead of SQL that joins source tables, applies rules, and handles slowly changing dimensions, you declare what entities should exist — their attributes, relationships, and rules — and transformations generate from those declarations.
Metrics follow suit. Rather than maintaining calculation logic separately in your BI tool, your API layer, and your documentation, you define a metric once with its calculation, grain, and dimensions. One definition serves every consumption pattern.
Not everything that looks declarative actually is. The difference matters because true declarative architecture's benefits only materialize when declaring intent, not templating code.
Take dbt, which brought software engineering practices to analytics. When you write a dbt model, you write SQL - a SELECT statement referencing other models with {{ ref() }} syntax. Better than ad-hoc scripts, yes. Dependency graph explicit, version controlled, maintainable. But you're still writing transformations step by step: how to join tables, calculate fields, filter records. Improved imperative code, not declarative architecture.
Contrast with declaring a Subscription entity with attributes like start_date, status, acquisition_type, and relationships to Account and Base plan. You're not writing joins or implementing slowly changing dimension logic. You declare what should exist; systems determine how to build it.
SQL with nice references versus entity declarations that generate SQL - these are architecturally different. One improves how you write imperative code. The other makes code an artifact generated from semantic definitions.
Why does separation matter? Separating "what you want" from "how to get it" enables consistency, portability, and comprehensibility that better-organized imperative code cannot achieve. When you declare a metric as "monthly recurring revenue, calculated as sum of active subscription values, dimensioned by plan type and region," that declaration generates SQL for your warehouse, API responses for applications, and documentation for business users. One declaration serves all three needs. Maintaining separate SQL for each means maintaining three implementations that will inevitably diverge.
Declarative approaches make semantic definitions the source of truth and let code be generated from them. That principle applies at every layer of a data platform, which brings us to what fully model-driven architecture looks like.
Model-First Architecture: The Three Layers
A fully declarative data platform separates concerns into three semantic layers: Data As System sees it (DAS), Data As Business sees it (DAB), and Data As Requirements needs it (DAR). If these layers sound familiar, they should - they map to staging/core/marts in dbt terminology, or bronze/silver/gold in the medallion architecture. The naming is deliberate: DAS/DAB/DAR foregrounds the purpose of each layer (what the data represents) rather than its position in a sequence or quality grade. Each serves a different purpose and can be made declarative.

DAS: Where Raw Data Lands
DAS represents source systems without interpretation through two stages: landing zone and staging. Landing receives raw data exactly as produced - often JSON strings with no schema enforcement. Staging unpacks to tabular format while remaining faithful to source structure.
In declarative DAS, data contracts define unpacking. A 50-line YAML contract specifies source structure, target schema, and transformation rules. Systems generate 100+ lines of SQL to extract nested JSON fields, cast types, handle arrays, and create historized and latest views. When sources change, you update the contract and regenerate SQL. Contracts are truth; SQL is artifact.
Pipelines become self-healing. Ingestion is forgiving (accept all data as JSON); unpacking is strict (contract-driven). Data is never lost when schemas change. If sources send incorrect data then corrections, "latest" views automatically reflect the corrected state.
DAB: Business Semantic Model
DAB is where business entities live. Instead of writing SQL that joins customer, subscription, and payment tables - handling active status logic, managing slowly changing dimensions, embedding business knowledge in transformation code - you declare what a Subscription entity should be using Daana's Model Description Language (DMDL):
# Declarative entity definition in DMDL
entities:
- id: "SUBSCRIPTION"
name: "SUBSCRIPTION"
definition: "A customer subscription"
description: "Represents an active or historical subscription to a service plan"
attributes:
- id: "SUBSCRIPTION_ID"
name: "SUBSCRIPTION_ID"
definition: "Unique subscription identifier"
type: "STRING"
effective_timestamp: false
- id: "ACQUISITION_TYPE"
name: "ACQUISITION_TYPE"
definition: "How the subscription was acquired"
description: "Origin channel: ORGANIC, CAMPAIGN, REFERRAL, TRIAL_CONVERSION"
type: "STRING"
effective_timestamp: true
- id: "STATUS"
name: "STATUS"
definition: "Current subscription status"
description: "Status derived from dates: ACTIVE, CANCELLED, EXPIRED"
type: "STRING"
effective_timestamp: true
- id: "SUBSCRIPTION_START_DATE"
name: "SUBSCRIPTION_START_DATE"
definition: "When subscription activated"
type: "START_TIMESTAMP"
- id: "SUBSCRIPTION_END_DATE"
name: "SUBSCRIPTION_END_DATE"
definition: "When subscription expires or was cancelled"
type: "END_TIMESTAMP"
- id: "MONTHLY_VALUE"
name: "MONTHLY_VALUE"
definition: "Monthly subscription value with currency"
effective_timestamp: true
group:
- id: "MONTHLY_VALUE"
name: "MONTHLY_VALUE"
definition: "The monetary amount"
type: "NUMBER"
- id: "MONTHLY_VALUE_CURRENCY"
name: "MONTHLY_VALUE_CURRENCY"
definition: "Currency code (USD, EUR, SEK)"
type: "UNIT"
relationships:
- id: "BELONGS_TO_ACCOUNT"
name: "BELONGS_TO_ACCOUNT"
definition: "Subscription belongs to an account"
source_entity_id: "SUBSCRIPTION"
target_entity_id: "ACCOUNT"
- id: "HAS_BASE_PLAN"
name: "HAS_BASE_PLAN"
definition: "Subscription is based on a plan"
source_entity_id: "SUBSCRIPTION"
target_entity_id: "BASE_PLAN"
From these declarations and a corresponding one for data mappings, systems generate transformation logic. They join source tables based on defined relationships, implement slowly changing dimension logic, and create both SUBSCRIPTION_HIST (complete history) and SUBSCRIPTION_LATEST (current state) views. Generated transformations follow established patterns consistently.
You design the system you want rather than implement how to build it. Semantic definitions drive everything downstream.
DAR: Consumption Patterns
DAR optimizes for actual usage - metrics, dimensions, aggregations, and serving patterns. Also declarative.
Instead of writing SQL to calculate monthly recurring revenue in your BI tool, similar logic in your API, then documenting separately, you declare the metric once: "MRR equals sum of monthly_value for active subscriptions, dimensioned by plan_type and region, grain of monthly." One declaration generates warehouse queries, drives API responses, and serves as documentation.
Why Separation Matters
Three-layer architecture provides stability. When a source system renames customerID to account_id, only DAS changes - you update the contract mapping. DAB's Account entity remains unchanged because it's defined in business concepts, not source fields. DAR consumers are unaffected.
Layer separation makes systems legible. Anyone examining your platform understands: these are source systems (DAS), these are business concepts they feed (DAB), these are consumption patterns (DAR). Each layer is documented through its declarative definitions - structured enough for both humans and AI to navigate.
Model-first architecture captures what should be consistent (entity definitions, transformation patterns, quality rules) while preserving flexibility where it matters (business logic, metrics, consumption patterns).
The Work Changes
Data engineers in declarative architectures spend less time writing SQL transformations, Python pipelines, and Airflow DAGs - more time on architecture and product design. Not easier work. Different work, requiring different skills.
New Skills Required
In model-driven environments, data engineers design entity models that accurately represent business concepts. Understanding not just technical data structure, but business semantics. What is a "subscription"? When does it begin - at signup, payment, or access grant? What makes it "active"? These are business domain questions with technical implications.
Defining semantic relationships becomes central work. A subscription belongs to an account, contains events, references a plan, tracks payment through transactions. Relationships need names, cardinalities, and clear definitions. Get them right and generated transformations correctly join data. Get them wrong and you're debugging entity mismatches - harder than debugging your own SQL because you need to understand how the generator works.
Creating data contracts becomes interface design. You define contracts between your platform and source systems - expected structure, required types, valid values. It's API design for data flows, requiring thought about versioning, backward compatibility, and error handling at the contract level.
The work requires architectural thinking about layered systems with clear boundaries. Changes in DAS shouldn't cascade to DAB. DAB entities should be stable despite evolving sources. DAR optimizations shouldn't leak business logic. Separation requires discipline and design.
Not Easier, Just Different
Model-driven approaches raise the bar. You can't hack together SQL that works. You must consider the semantic model you're building, whether entity definitions are correct, whether relationships are properly understood. Abstraction layers help you scale but require understanding how they work.
Debugging shifts too. Instead of stepping through SQL, you examine entity definitions, contract-source alignment, generation edge cases. You need both business semantics AND technical implementation patterns.
Model-driven data engineers develop dual fluency: thinking in business entities and relationships while understanding how declarations translate to implementations. They explain to product managers why Subscription needs particular attributes, and to platform teams why generated transformations need specific join patterns.
What Stays the Same
Deep technical knowledge remains crucial - perhaps more than before. Understanding data types, when normalization helps versus hurts, performance implications of join patterns, incremental processing logic. Application differs: instead of directly writing transformations, you design models that generate good ones.
You still write code, but less and differently. Custom business logic for non-standard patterns. Edge case handling. Integration scripts. But repetitive, pattern-following transformation code - the kind that accumulates bugs when individually authored by multiple people - gets generated from models.
Work becomes more about architecture and semantics, less about implementation mechanics. You design the data platform as a cohesive system rather than assembling pipelines one at a time.
Where Declarative Appears Today
In practice, declarative thinking shows up unevenly across the data stack. Here's what we see on the ground.
Ingestion Layer
Data contracts are the most mature piece. Teams use YAML or JSON schemas to specify what they expect from source systems. The key distinction: is your contract documentation that someone manually keeps in sync with custom SQL, or does it directly generate unpacking logic? The first improves documentation. The second unifies documentation and implementation - changing the contract changes the code, forcing accuracy. That second version is what's genuinely new.
Business Layer
This is where things get messy. Most tooling today - dbt included - gives you better-organized imperative code: macros, packages, reusable patterns. You're still writing transformations, just with better tools. dbt's semantic layer moves closer by letting you define metrics and entities with relationships, but the transformation layer below still requires hand-written SQL models.
Fully declarative business layers - define entities and relationships, get generated transformations - remain uncommon in off-the-shelf tools. They exist in custom platforms and specialized tools where teams have committed to the approach. But they demonstrate what's possible: entity definitions as metadata, transformations generated, standard views maintained automatically.
Metrics Layer
This is the furthest along. Tools like dbt's semantic layer, Cube, Malloy, and headless BI approaches let you declare metrics once with calculation logic, grain, and dimensions. One definition drives BI tools, APIs, and embedded analytics. Genuinely declarative: metric definition is source of truth, implementations are generated.
The Spectrum
"Declarative" isn't binary. Most organizations operate in the middle: declarative contracts for ingestion, hand-written SQL with good practices for transformations, declarative metrics for consumption. That pragmatic mix works for many teams.
The question worth asking: are you writing SQL with better references and macros, or declaring entities and relationships that generate SQL? Different paradigms, different scaling characteristics.
Benefits, Tradeoffs, and When to Stay Imperative
Declarative approaches deliver real benefits, but they're not universally superior. Understanding when each makes sense means examining what you gain and give up.
What You Gain and Give Up
| Aspect | Declarative Approach | Imperative Approach |
|---|---|---|
| Consistency | Generated code follows same patterns for every entity - same historization, incremental processing, error handling | Each implementation subtly different, accumulating technical debt across engineers |
| Quality | Patterns encoded once, applied everywhere. Hundredth entity gets same error handling as the first | Later implementations cut corners under time pressure |
| Documentation | Model definitions ARE the implementation. Update model, transformations regenerate | Every engineer has wasted hours reconciling docs with reality |
| Testing | Validate models: "Does this entity definition represent the business concept correctly?" | Test code syntax and logic. Less insight into whether you're building the right thing |
| AI Effectiveness | LLMs read structured entity definitions with clear semantic meaning - though they still require human review | AI struggles to extract intent from custom SQL spread across repositories |
| Productivity | Optimize the pattern once and every pipeline benefits | Linear scaling: more sources means proportionally more people |
| Upfront Investment | Weeks to months: framework setup, code generation tooling, team training | Days: start writing SQL immediately |
| Learning Curve | Must learn to think in entity models and semantic relationships | Familiar SQL/Python patterns with incremental learning |
| Edge Cases | Common patterns elegant; edge cases awkward. Too many escape hatches undermine consistency | Full flexibility to write exactly what's needed |
| Flexibility | Framework constraints help when they align with good practices. Frustrating when they don't | Full control over every detail |
| Tooling Maturity | Emerging ecosystem. Early adopters pay pioneering tax | Mature, battle-tested tools: dbt, Airflow, Fivetran |
| Debugging | Must understand both model definition AND generation logic | Debug your own code directly - straightforward cause and effect |
When Imperative Still Wins
Declarative architecture makes sense when patterns exist and repetition is high. But clear cases favor imperative approaches:
One-off analyses: Exploratory queries answering specific business questions don't need entity definitions and generated transformations. Write SQL, get the answer, move on.
Genuinely novel algorithms: Implementing new recommendation algorithms or complex statistical models requires flexibility to write custom logic. Forcing this into a declarative framework adds overhead without benefit.
Performance-critical optimization: Sometimes you need hand-tuned SQL exploiting specific warehouse features. Generated code follows general patterns; custom code optimizes for specific cases.
Early-stage exploration: Before patterns emerge, forcing declarative structure is premature. Build a few pipelines imperatively, discover patterns, then consider declarative approaches for scaling.
Very small teams with simple needs: Three sources, ten entities, one data engineer - declarative architecture overhead likely exceeds benefits. Better-organized imperative code could serve you well.
Declarative and imperative coexist. Use declarative architecture for repetitive, pattern-following work comprising most of a data platform. Preserve imperative flexibility for the edges, the novel, the optimized. Mature platforms provide both with clear guidance on when to use each.
Context matters. Benefits that make declarative architecture compelling for growing companies with dozens of data sources may not outweigh costs for startups with three tables. Team size, complexity, rate of change, engineering maturity - these guide decisions better than universal recommendations.
What This Means for Platforms and Teams
Once an organization commits to declarative architecture, the effects ripple beyond individual engineering work into how platforms and teams are structured.
Platforms need model registries storing entity definitions as queryable metadata - not documentation sites, but active systems driving code generation. They need generation frameworks that reliably translate models into correct transformations while providing room for custom logic where standard patterns don't fit. And they need validation at the model level, catching errors before code generation rather than after deployment.
Declarative approaches also reframe self-service. Domain teams don't get access to write arbitrary SQL against production tables. Instead, they declare domain models that platforms implement according to standards. This addresses a central tension: domain teams have business context but often lack data engineering expertise, while platform teams have expertise but lack domain context. Declarative architecture lets each contribute what they know best.
This creates a natural model for federated ownership - the practical implementation path for data mesh principles. Domain teams own their models: defining entities, maintaining contracts with source systems, evolving definitions as their domain changes. Central platform teams own patterns, standards, and generation frameworks. Declarative definitions provide both autonomy and consistency.
Technical debt doesn't disappear - it shifts. Instead of unmaintainable SQL across repositories, you risk rigid abstractions that can't accommodate edge cases. Instead of undocumented transformations, you risk overly generic models lacking domain specificity. Different problems requiring different mitigation strategies. Mature platforms support both declarative models for common patterns and imperative code for exceptions, with clear guidance on when to use each.
Data Engineering Is Growing Up
We're moving from "write SQL to solve each problem" to "design semantic models that generate solutions." Software engineering went through the same thing: assembly to high-level languages, manual configuration to infrastructure as code, monolithic apps to declarative orchestration. Every transition traded direct control for abstractions that let systems scale.
Data engineering is making that transition now. Imperative approaches won't become obsolete - assembly language still exists for specific use cases - but the default for most work shifts to higher abstraction.
We're in a transition period where hybrid approaches dominate, and that's appropriate. Organizations can't flip a switch from fully imperative to fully declarative. The path: identify where patterns and repetition are high, apply declarative approaches there, preserve imperative flexibility for genuine edge cases.
Practical Guidance
For teams considering this shift: start with layers where patterns are clearest. Ingestion contracts often provide high return - transformation from landing zone to staging follows consistent patterns across sources. Core business entities in the DAB layer offer similar benefits once you've established your domain model. Metrics definitions pay off when serving multiple consumption patterns.
Not everything fits generated patterns, and forcing it creates more problems than it solves. Effectiveness matters, not purity.
Invest in learning model-driven thinking, not just specific tools. Understanding how to design good entity models, specify clear semantic relationships, and when to use declarative versus imperative approaches matters more than mastering any particular framework. Tools evolve; principles endure.
Be thoughtful about what to abstract. Declarative approaches multiply effort by encoding patterns, but premature abstraction before patterns emerge adds overhead without benefit. Build a few implementations, discover commonalities, then abstract.
What's Next
Data engineers of the next decade will be fluent in both paradigms - comfortable designing declarative models and writing imperative code, knowing when each serves better. They'll think in semantic layers and entity relationships rather than just tables and joins. They'll spend more time on architecture and domain modeling, less on transformation mechanics.
This isn't about making data engineering "easier" - it's about applying skills at a higher abstraction level. Understanding business semantics, designing coherent architectures, building systems that stay comprehensible as they scale. Different challenges than writing correct SQL, but not simpler ones.
The organizations that get this right won't treat declarative approaches as religion. They'll treat them as engineering tools - reach for them where patterns repeat, set them aside where they don't. And they'll invest in data engineers who can work across both paradigms rather than specializing in one.
Model-driven data engineers aren't replacing code. They're making code a generated artifact of something more durable: the semantic understanding of what the business actually needs. That's the shift.
