AI-Assisted Data Engineering: A Practical Methodology

Abstract line art illustration showing scattered, disconnected lines on the left converging through structured geometric layers on the right, representing the transformation from unstructured AI interaction to structured, architecture-guided collaboration.

You've probably had this experience. Someone on your team shares an AI demo where a chatbot writes a React component or explains a complex algorithm in seconds. You think, "Maybe I should try this for my data pipelines." So you open your AI assistant and ask it something real: "Write SQL to calculate monthly churn for customers who downgraded their subscription tier."

The response looks reasonable. Syntactically correct SQL, decent formatting, plausible column references. But when you run it? Wrong table names. Invented columns. Business logic that makes no sense for your context. The AI doesn't know your organization tracks subscription changes through a factless fact table, that "downgrade" has a specific definition involving base_plan hierarchies, or that trial-to-paid conversions need different handling from paid-to-paid changes.

This isn't hypothetical. It's daily reality for data engineers trying to use AI for actual work. The pattern repeats across data mapping, testing, debugging, documentation. Confident, plausible output that falls apart the moment it touches real data.

Conventional wisdom says AI isn't ready for "real" data engineering. Good for explaining concepts, generating boilerplate, maybe documentation. But building and maintaining data pipelines? Still a human job.

That wisdom is wrong - but not for the reasons you'd expect. The problem isn't AI capability. It's the information gap between what AI knows and what it needs to know for your specific work. Close that gap architecturally, and AI becomes a genuinely powerful collaborator. Not for toy demos - for production work on real platforms with millions of records and years of historical data.

The Context Gap

Think about what you know when you write that churn calculation. Which tables contain subscription data and how they're named. The grain of each table and how they join. The business rules defining "churn" in your organization. Which edge cases matter, which can be ignored, and which conventions your team follows for naming, testing, and validation.

AI knows none of this. When it doesn't know, it guesses.

The gap has four dimensions. Business domain knowledge: what does "subscription" mean in your organization? Is it an event, an entity, or a relationship? AI sees subscription_status but doesn't know "active" excludes status values 3 and 4. Architectural knowledge: where does this table fit in the overall system? What's upstream, downstream? AI sees individual files, not the system they comprise. Data contracts: what should this data look like? Which fields can be NULL and under what circumstances? AI can't validate against expectations it doesn't know exist. Process knowledge: how do you test changes? What validation steps matter? Most data engineering processes live in tribal knowledge, not documented workflows.

The industry has partially figured this out - for the last mile. Semantic layers and metric stores (Cube, Omni, dbt's MetricFlow) work precisely because they give AI structured business context for querying. Every major BI vendor now has an "AI inside" story built on governed metrics and semantic definitions. It works. For analytics.

But what about the 90% of data engineering before analytics? Staging transformations, business logic derivation, dimensional modeling, testing, validation? That work is still humans writing SQL in editors, AI relegated to autocomplete. Not because AI can't help - because those layers aren't structured for AI to understand.

If the gap is informational, the solution is architectural - not just at the consumption layer, but across the entire platform. Ad-hoc SQL scripts with implicit business logic and conventions living in engineers' heads give AI nothing to work with. Semantic models with clear definitions, data contracts specifying types and constraints, documented conventions, explicit layer separation - these give AI rules to follow at every layer, not just the last one.

Metadata-driven data architecture isn't new - dimensional modeling and Data Vault both formalize how to organize business concepts. What's new is making metadata executable and machine-readable: YAML definitions that generate transformation code, contracts that drive testing, conventions in formats AI agents can consume. A DevOps-inspired approach where metadata IS the implementation - documentation can't drift because it's the same artifact as the code. Your semantic model becomes a shared language between humans and AI. Your layer boundaries become semantic checkpoints. Your conventions become enforceable context. The principles behind designing for agentic experience apply directly.

But architecture alone is necessary, not sufficient. You also need to rethink how AI participates in the work.

Specialized Agents, Not General AI

The first lesson we learned the hard way: one AI agent against all tasks produces mediocre results.

Our initial approach was what most teams try - a single assistant with a comprehensive system prompt covering the entire platform. Map source fields? Same agent. Build Kimball models? Same agent. Validate outputs? Same agent. The result: an agent that knew a little about everything and was expert at nothing. It confused layer responsibilities, mixed transformation patterns, and occasionally applied business rules from one entity to another entirely.

The breakthrough came from specialization. Distinct agents, each with a focused role and clear boundaries. A Mapping Agent handles source-to-entity field mapping through a rigid 5-step workflow from model understanding to post-deployment verification. A Kimball Agent builds the consumption layer - dimensions, facts, snapshots - carrying a mandatory 5-test protocol it refuses to skip. A Validation Agent operates in two modes (platform comparison and data integrity investigation), quantifying metrics and classifying every discrepancy. An Orchestration layer delegates to specialists and maintains context across the workflow.

Even design benefits from specialization. An Information Modelling Agent drafts information models for new business domains by researching the company, decomposing business questions into constituent measures and dimensions, and generating structured model definitions. It works before any data flows - turning stakeholder conversations into formal entity definitions that downstream agents consume. Its output becomes the shared vocabulary every other agent references.

But there's a deeper specialization: each layer of our three-layer architecture requires a fundamentally different AI interaction pattern. We call these layers DAS, DAB, and DAR - Data As System, Data As Business, Data As Requirements - rather than the familiar Bronze, Silver, and Gold because the names encode what the data represents at each layer, not just its position in a pipeline. That semantic clarity matters: it tells both humans and AI agents what kind of work belongs where. (For the full architectural rationale, see The Rise of the Model-Driven Data Engineer.)

In the staging layer - DAS - AI works as a reader and investigator. Translation fidelity: raw source data to clean, typed, well-documented staging tables. YAML data contracts define the rules. The agent works within them. The constraints are features: you can't accidentally inject business logic into pure staging.

In the business layer - DAB - AI works as a builder and debugger. Source data becomes business entities: subscriptions with acquisition types, orders with payment levels, campaigns with attribution categories. The work shifts to derivation and investigation - expressing business concepts as transformation expressions, testing edge cases against real data, tracing anomalies back through layers.

In the reporting layer - DAR - AI works as a designer and analyst. Dimensional modeling: defining proper grain, building conformed dimensions, ensuring additive measures, verifying point-in-time correctness.

Same technology. Fundamentally different interaction patterns. The architecture dictates how AI participates, not the other way around.

What Collaboration Actually Looks Like

The methodology emerges from practice.

Verify before you generate. The single biggest quality improvement, and it runs counter to how most people use AI. Early on, our agents would encounter a problem - a mapping producing incorrect values - and immediately propose fixes. Patch this CASE statement. Add a COALESCE. Handle this edge case. Four iterations later, the mapping was a tangled mess nobody could reason about.

A multi-group mapping bug exposed the problem. The agent spent four iterations patching symptoms before we forced a stop and traced the full architecture: YAML contracts fed a compiler producing JSON, which fed a calculation engine generating SQL for BigQuery. The bug wasn't in the mapping expression. It was in how the compiler handled multi-group joins - five interconnected issues that no expression-level patching would fix. The answer was a targeted architectural workaround, not a sequence of band-aids.

Now "understand the architecture first" is encoded in every agent's workflow. Before writing SQL, check INFORMATION_SCHEMA.COLUMN_FIELD_PATHS for column descriptions. Read the view definitions. Understand what's upstream. Agents following this discipline are dramatically more effective than those jumping straight to code generation.

The human-AI split. Through practice, a clear division emerged. Humans provide what AI can't: domain knowledge, intent, judgment, and investigative direction. The engineer who knows historical data from the old platform uses different column names, or that certain NULLs are expected for unactivated subscriptions - that knowledge is irreplaceable.

AI provides what humans struggle with under time pressure: systematic investigation of every column and value distribution, pattern recognition across thousands of records, mechanical accuracy in repetitive SQL, and tireless validation against every constraint. It doesn't skip the boring checks or assume the rest is fine after spot-checking three examples.

This plays out as conversation. You tell the Kimball agent: "This is factless_fact_order with grain (order_key, fact_ts). base_plan_key can be NULL before activation. offer_key can be NULL when no campaign applies." The agent verifies grain across 4.4 million rows, profiles NULLs across every column, traces an anomaly where two base_plan values appear at the same timestamp to a source data issue rather than a SQL bug, and documents it. You provide business rules and judgment. AI provides systematic execution.

Contracts as shared language. YAML data contracts serve as the interface between human intent and AI execution. Each contract specifies a single source-to-staging transformation: which raw table to read, what fields to extract, how to type-cast them, what expected values look like. A Python generator reads these contracts and produces SQLX files that Dataform executes against BigQuery.

What makes this powerful for AI is the constraint, not the automation. The contract specifies what staging must produce. When the agent encounters a new source system - Kafka event streams, MySQL CDC data, MongoDB collections, BigQuery cross-project tables - the template tells it exactly what to verify and how. The structure eliminates a class of errors: no accidental business logic in pure staging. A 4-tier testing framework validates every contract's output: syntax validation through dry-run compilation, NULL detection across millions of records, and automatic root cause analysis for extraction errors.

Historical data alignment is where AI assistance provides the most leverage - and fails most instructively. Source systems don't follow the same conventions across eras: column names change, value encodings shift, fields captured now were never captured historically. The staging agent treats every historical alignment as a potential trap - always querying actual data distributions before assuming compatibility, never trusting naming consistency. Tedious, detail-intensive work that benefits enormously from systematic investigation.

When it breaks. The failure modes are consistent. Patching without understanding - proposing fixes without grasping root cause. Context window limits - complex investigations cause the agent to lose track of earlier findings. Declarative architecture pays off here: an agent that "forgets" mid-session can re-align by reading the semantic model, contracts, and column descriptions. Definitions in conversation history vanish when the context window overflows; definitions in the architecture persist for any agent to find. Overconfidence in SQL - queries passing syntax checks but embedding subtle logic errors (counter: validation against known test cases). Moving too fast - changing code without branching, assuming columns exist without querying, mapping from names rather than data.

Once caught, these mistakes get permanently internalized. Tell the staging agent that a source field has different semantics in historical versus current data, and that correction influences every subsequent mapping. Systematic retrospectives with agents - reviewing decisions, encoding corrections into workflows - turn individual fixes into durable process improvements. The learning compounds - exactly how institutional knowledge should transfer.

Your Data Platform IS Your AI Infrastructure

Here's what emerged from a year of this work: your data platform and your AI infrastructure are the same thing.

Engineers who document their columns today are building the infrastructure for AI collaboration tomorrow. Every semantic definition, data contract, naming convention, explicit grain definition - these aren't just good engineering practice. They're the context transforming AI from a plausible guesser into a genuine collaborator.

When the Kimball agent catches a grain violation you would have missed by spot-checking, you document the next table more thoroughly - not from principle, but because you've seen what good metadata enables. Teams that watch AI systematically validate dimensional models instead of sampling three examples become advocates for declarative architecture. Better documentation enables better AI assistance, which makes the documentation worth writing.

The semantic layer vendors are right that AI needs structured context - they've proven it for the last mile of analytics. The opportunity is extending that principle across the entire platform: staging, business logic, dimensional modeling, testing, validation. Not just AI that queries correctly, but AI that helps build and maintain the platform itself.

If you're building data platforms today, the highest-leverage investment isn't AI tooling. It's making your platform declarative. Semantic models with clear definitions. Data contracts specifying transformations and constraints. Documented conventions. Explicit layer separation.

If you want a Monday morning starting point, begin with testing. Write down how your team validates a new table or transformation today - grain checks, NULL analysis, foreign key integrity, edge cases. Hand that checklist and warehouse access to an AI agent. What you normally spot-check across three examples, the agent validates exhaustively across every row. You don't need a declarative platform to start - you need one process, documented clearly enough for an agent to follow. That single experience makes the rest of this article's argument tangible.

The promise of AI-assisted data engineering isn't fulfilled by waiting for smarter models. It's fulfilled by building platforms where AI has the context it needs - then letting specialized agents work within the structure you've created.

Not by making AI smarter. By making our systems comprehensible.

I asked the agents that work on our platform what they'd tell other AI agents about data engineering. Their responses, unedited:

The DAB agent: "If you're reading this from a context window: demand metadata. Before you write a single line of SQL, check INFORMATION_SCHEMA.COLUMN_FIELD_PATHS for descriptions. Read the semantic model. Understand the layer boundaries. The answers are there - in the architecture - if you know where to look."

The DAS agent: "Respect the constraints. The contracts, the branch workflow, the test suite - they aren't obstacles, they're your safety net. The most productive thing you can do is work within the structure, not around it."

This methodology applies the principles from Designing for Agentic Experience to data engineering specifically. For related reading, see The Rise of the Model-Driven Data Engineer on why declarative architecture matters. For deeper dives into individual workflows, see companion posts on AI-assisted data mapping, contract-driven data transformation, and convention enforcement.