Why We Built Daana: Reclaiming Lost Knowledge

Abstract line art illustration of architectural knowledge being reclaimed — structured geometric patterns emerging from dissolving connections, representing the transition from lost data modeling knowledge to restored architectural clarity

The Knowledge That Was Lost

I started my career in Stockholm around 2010, learning from experienced data architects who took modeling seriously. Data modeling wasn't optional - it was foundational. Every data professional learned Kimball's dimensional modeling, Inmon's enterprise warehouses, normal forms, ensemble modeling techniques like Data Vault. We understood that how you structured data mattered as much as what you captured.

These weren't abstract theories but proven patterns for building systems that could grow without collapsing under complexity. Proper modeling meant source system changes didn't cascade chaotically. Business logic lived in well-defined places. New team members understood the system by examining the model.

Then came the "big data" revolution. A generation of data professionals was told: "You don't need to model. Just throw it in the data lake." Schema on read, not schema on write. ELT replaced ETL. Store everything raw and figure it out later. The lake became a swamp.

The problem wasn't the technology - distributed storage and processing unlocked genuine new capabilities. The problem was discarding decades of architectural knowledge because the new tools didn't enforce it. We went from "model carefully" to "modeling is old-school, just dump it all in."

Teams stopped teaching data modeling because it seemed irrelevant. Senior architects who carried this knowledge retired or moved on. Junior engineers never learned it because companies didn't practice it. The skills that made data systems comprehensible became concentrated in a shrinking pool of practitioners.

The pendulum swung too far. We gained powerful tools but lost the discipline that made data platforms sustainable. Now, 10-15 years later, we're living with the consequences.

The Consequences We're Living With

The pattern is impossible to ignore. Data engineers drowning in unmaintainable hand-written SQL. Pipelines that break constantly - 3 AM alerts because some upstream column changed. Data quality issues that erode trust until stakeholders abandon the platform entirely.

The chaos manifests predictably. Point-to-point solutions with no coherent structure. Poorly defined "medallion architectures" where bronze, silver, and gold layers exist in name only - nobody can articulate what belongs where or why. Transformation logic scattered randomly. Business rules duplicated in five places, each implementation uniquely wrong.

Junior engineers write SQL reflecting no understanding of separation of concerns or semantic clarity. They're not at fault - they were never exposed to these principles.

Senior engineers hold critical knowledge in their heads with no way to preserve it structurally. They know which transformations are fragile, which sources can't be trusted, where hidden dependencies lurk. But this exists as tribal lore, not captured in any system. When they leave, that knowledge vanishes.

Every company rebuilds the same transformation patterns badly. Same mistakes. Same brittle approaches. Same late discovery that mixing source representation with business logic makes everything unmaintainable.

The symptom everyone sees is brittle pipelines requiring constant manual intervention. The root cause runs deeper - architectural knowledge that's been lost. We stopped teaching the disciplines that made data systems maintainable, and now we're paying the price in toil and fragility.

What We Lost (And Why It Mattered)

Data modeling isn't about resisting innovation. It's about understanding what you're building semantically - separation of concerns, knowing what belongs where and why. Clear interfaces between layers so changes don't ripple unpredictably. Systems comprehensible to humans and AI.

Different modeling approaches serve different purposes, and a practitioner picks between them based on the problem at hand. When you need to capture how the business actually works - handling source systems that change underneath you while preserving historical context - that's where ensemble modeling techniques earn their keep in the Data As Business (DAB) layer. When you're shaping data for how people actually analyze it, dimensional modeling remains the natural choice for the Data As Requirements (DAR) layer. Normalized models fit operational reporting. Event-driven patterns handle real-time needs.

These aren't competing philosophies - they're complementary techniques for different layers. The three-layer architecture (Data As System sees it, Data As Business sees it, Data As Requirements needs it) provides clear separation - if you know staging/core/marts or bronze/silver/gold, this is the same concept with names that encode purpose rather than position. Source changes stay isolated in DAS. Business semantics stabilize in DAB. Consumption patterns in DAR evolve without touching upstream layers.

Three-Layer Architecture Diagram showing data flow through DAS (raw ingestion), DAB (business entity modeling with ensemble patterns), and DAR (analytics consumption) layers with clear separation of concerns

The three-layer architecture enables practical patterns like forgiving ingestion and self-healing pipelines - covered in detail in The Rise of the Model-Driven Data Engineer.

But accessing this knowledge requires massive upfront investment. Want to build proper three-layer architecture? You'll spend months designing, implementing, debugging hand-written transformations. By the time you deliver value, stakeholders have lost patience. The business case for doing it right can't compete with shipping something broken next week.

This is why knowledge stays concentrated. Only organizations with sufficient resources can invest in building these systems properly. Everyone else cobbles together point-to-point solutions because they have no practical alternative.

What We Gained (And What's Still Missing)

While we lost modeling knowledge, we gained something valuable. Tools like dbt brought modern software engineering to data work - version control with Git, CI/CD pipelines, testing frameworks, code review workflows, infrastructure as code. These professionalized data teams.

Anyone who's worked on data projects before and after this shift knows the difference. We finally got the engineering rigor that software teams had enjoyed for years.

But here's the gap: we have modern engineering practices without modeling knowledge to guide what we're building. Excellent tools for version-controlling SQL that lacks architectural coherence. Testing frameworks validating brittle point-to-point transformations instead of well-designed semantic models. CI/CD pipelines deploying code reflecting no understanding of separation of concerns.

The problem isn't the tools - they work as designed. The problem is using modern engineering practices to build poorly-architected systems more efficiently. We've gotten very good at deploying bad architecture faster.

What if we could have both? This is where declarative insight matters. Separate knowledge from ceremony. What if proven patterns were accessible without requiring everyone to become ensemble modeling experts? What if understanding business semantics didn't mean months of design work before seeing value?

Declarative tooling changes the equation. Declare what you want - business entities, relationships, semantic meaning - and let the system generate how to build it following sound architectural principles.

Documentation becomes implementation. Change the semantic definition, transformations regenerate automatically. Not documentation gathering dust in Confluence - documentation that directly drives your platform.

LLMs become dramatically more effective. Feed them hand-written SQL and they can maybe help debug. Feed them semantic definitions of business entities and they can work with far more context about your specific domain - answering questions, suggesting transformations, identifying inconsistencies. But only if you have semantic clarity to begin with.

The path forward isn't choosing between modeling knowledge and modern practices. It's combining them. Declarative definitions encoding architectural knowledge. Version control and CI/CD for those definitions. Testing frameworks validating semantic correctness. The craft of data modeling, made accessible through modern tooling.

Standing on the Shoulders of Giants

The modeling techniques behind Daana aren't new, and we don't claim otherwise. Kimball's dimensional modeling principles still hold. Three-layer architecture works. Ensemble modeling techniques handle change while preserving history. Data contracts and self-healing pipelines are battle-tested concepts.

What's new isn't the patterns. What's new is making these principles accessible through declarative YAML. Encoding the knowledge so teams don't rebuild it from scratch. Designing for the AI age where semantic definitions matter because LLMs need structure to be effective.

Humility matters here. The data architecture community built incredible knowledge over decades. Practitioners like Ralph Kimball, Bill Inmon, Dan Linstedt, and countless others solved hard problems and shared insights. Our contribution isn't the patterns - it's creating a path for teams to actually use them without years of specialized training or massive upfront investment.

We're restoring lost knowledge, not inventing it. That restoration is only possible because the foundations were built by those who came before us.

Why Daana Exists

We built Daana because the same problems keep repeating everywhere. Teams suffering from brittle pipelines. Architectural knowledge concentrated in a few senior people. The same transformation patterns rebuilt poorly in every organization.

Tools existed for writing SQL and orchestrating pipelines. But nothing encoded architectural knowledge itself - no way to capture what works and make it repeatable. The gap between knowing what good architecture looks like and actually building it - that's the problem we set out to close.

We'd seen firsthand how powerful well-modeled systems could be: platforms that remained stable as businesses grew, pipelines that didn't break constantly, architectures that new team members could understand and extend. The knowledge existed. The tooling to deliver it didn't. And most teams couldn't afford the upfront investment to do it manually.

Daana is declarative data modeling for modern platforms. You declare your business entities in YAML - what they are, how they relate, what they mean semantically. The system generates transformation pipelines following proven architectural patterns. It works on BigQuery, Snowflake, and other cloud platforms.

The goal: let data professionals focus on understanding the business. Capturing business processes in information models. Creating value iteratively. Not becoming experts in ensemble modeling techniques, but still enforcing that you actually model with clear purpose and semantic meaning. Not choosing between speed and quality, but getting both.

It's open source under the Elastic License v2 (ELv2) because this knowledge should belong to everyone. Not locked behind proprietary tools or expensive consulting. The data community built these patterns over decades - they should be accessible to everyone building data platforms.

That's why Daana exists: to make architectural knowledge accessible through tooling. To combine the craft of data modeling with modern engineering practices. To reclaim what was lost.

What This Means in Practice

You define your business entities in YAML - what they are, how they relate, what their attributes mean:

entities:
  - id: "CUSTOMER"
    name: "CUSTOMER"
    definition: "Customer who places orders"
    attributes:
      - id: "CUSTOMER_NAME"
        definition: "Full name of the customer"
        type: "STRING"
        effective_timestamp: true

      - id: "CUSTOMER_CITY"
        definition: "City where customer is located"
        type: "STRING"
        effective_timestamp: true

      - id: "CUSTOMER_STATUS"
        definition: "Account status (Active, Inactive, Suspended)"
        type: "STRING"
        effective_timestamp: true

You map these attributes to source tables - which columns feed which attributes, how to handle timestamps, how to merge data from multiple sources. Then you run two commands: daana-cli deploy and daana-cli execute.

Daana generates transformation logic and creates three views per entity in your data warehouse:

VIEW_CUSTOMER - Current state. One row per customer, latest attribute values.
VIEW_CUSTOMER_HIST - Full history. Every change tracked, time-travel ready.
VIEW_CUSTOMER_WITH_REL - Current state with relationships already joined.

No historization code written. No SCD2 logic. No temporal join patterns. The entity definition drives everything.

Here's what that enables. Say a customer moved from Boston to Philadelphia on November 29th. You can query any point in time:

-- Where was this customer living on November 27th?
SELECT customer_name, customer_city, customer_state, eff_tmstp
FROM daana_dw.view_customer_hist
WHERE customer_key = '2'
  AND eff_tmstp <= '2024-11-27'
ORDER BY eff_tmstp DESC
LIMIT 1;

customer_name | customer_city | customer_state | eff_tmstp
Emily Johnson | Boston        | MA             | 2024-11-05

Change the date to November 29th and you get Philadelphia. The history built itself from your entity definition and source mappings. For a full walkthrough with multi-source merging and relationships, see the Daana tutorial.

Comparison diagram showing traditional hand-written SQL pipelines (tangled dependencies, scattered logic, fragility) versus Daana's declarative approach (organized YAML definitions generating structured three-layer architecture)

Your semantic definitions serve triple duty: what you read to understand the business model, what LLMs read when answering questions about your data, and what drives transformation generation. These aren't three separate artifacts that can drift - they're the same source. Update the model, transformations regenerate automatically.

There are trade-offs. Learning declarative modeling requires shifting from imperative SQL thinking. You gain consistency but sacrifice some flexibility in custom transformations. The generated output may not match exactly what you'd hand-build. For the full architectural discussion - why three layers, when declarative makes sense, when to stay imperative - see The Rise of the Model-Driven Data Engineer.

The shift changes what the work looks like. Hundreds of hand-written SQL models maintaining transformation logic become generated artifacts of a semantic model that you design and evolve. The repetitive, pattern-following work that consumed most of the engineering effort gets automated. What remains is the work that actually requires human judgment: understanding the business, modeling semantics correctly, and making architectural decisions.

The Invitation

This isn't finished. It's a beginning.

Daana is open source and available today. We need data engineers and analytics engineers who've felt this pain to help shape what it becomes. Your feedback will determine whether we're solving the real problem or building something nobody needs.

If you recognize this problem - if you've watched teams struggle with brittle pipelines and lost architectural knowledge - we want your perspective. What patterns matter most in your context? What barriers prevent adoption? What did we get wrong?

This is open source because the data architecture community built these patterns over decades - they shouldn't be locked behind proprietary tools. But open source doesn't mean we know all the answers. It means we're building this together.

This is alpha because we're still learning. The core patterns work - we've used them across organizations with consistent results. But translating those into a tool that works for diverse teams requires feedback from people building real systems under real constraints.

This is an invitation because we can't reclaim lost knowledge alone. It takes a community of data engineers, analytics engineers, and architects who believe data platforms should be better than what we've settled for.

Check out the project on GitHub, try it on your own data, and tell us what you think. Help us bring back the craft of data modeling, accessible to everyone who builds data platforms.

The data architecture community built incredible knowledge over decades. Let's make sure the next generation can access it.