Growing companies aren’t short on data. They’re short on data they can trust. The question most teams are asking is: How do we utilize AI to gain better insights?
The question they should be asking is: why can’t we get reliable insights right now? The answer is almost always the same. Not a strategy. Not tools. Not budget. Data infrastructure.
Why AI projects stall
There’s a gap between how companies describe their data situation and what’s actually there when you look.
The story tends to follow a familiar pattern. Data lives in a dozen disconnected tools. Marketing runs its own spreadsheets. Finance has a different number for the same metric. The engineering team is too busy keeping existing systems running to connect any of it. Leadership is making decisions based on gut feel, or worse, debating which spreadsheet is correct instead of using data to move forward.
Meanwhile, the board is asking about AI.
Gartner estimates that organizations lacking AI-ready data will abandon 60% of their AI projects this year of 2026, not because of bad models or wrong tools, but because the underlying data isn’t ready.
McKinsey’s 2025 State of AI report found that 88% of organizations are using AI in some form, but only 6% are extracting meaningful value (≥5% EBIT impact from AI). Data infrastructure is the most common first blocker, and the one that’s entirely within your control to fix.
You can’t build a RAG pipeline on top of stale spreadsheets. You can’t train an agent to act on your operations data if that data lives in manual exports that someone runs on Friday afternoons. You can’t get AI to answer questions about your business if no one agrees on the numbers in the first place.
What “data modernization” actually means
The term is used loosely, so let me clarify what it means in practice.
Traditional data architecture was built around ETL: Extract, Transform, Load. You pulled data from source systems, cleaned and shaped it before storage, then loaded it into a warehouse. The problem is that the transformation happened before you knew what questions you’d eventually want to ask, including questions you couldn’t have predicted when AI barely existed.
Modern data architecture flips this. ELT: Extract, Load, Transform. You ingest raw data first, store everything, then transform it on demand. Storage is highly scalable and no longer a binding constraint. Compute is consumption-based, meaning you pay for what you use, but that also means runaway query costs are real if you don’t partition tables and manage access carefully.
The value of ELT isn’t just cheapness; it’s flexibility: you can reshape data for new AI use cases without rebuilding pipelines from scratch.
The most durable way to organize this is through a three-layer structure called the medallion architecture:
Bronze is raw ingested data. Exactly what came from the source system. Untouched.
| Silver is where the real work happens. Data gets cleaned, joined, enriched, and given a business context. This is the layer that matters most for AI, structured enough for models to use, but close enough to raw that you haven’t over-engineered it for a single use case. |
Gold is business-ready data: structured, aggregated, and modeled for specific reporting needs. Executive dashboards, KPI tracking, board-level metrics.
Most companies think of gold as the finish line. In an AI context, silver is often more valuable. It’s the layer that feeds RAG pipelines, knowledge graphs, and agentic workflows, because AI needs context and nuance, not just summary statistics.
A note on unstructured data: not everything lives in a warehouse. Many RAG use cases retrieve from PDFs, support tickets, contracts, call transcripts, and internal wikis. The principle is the same (well-chunked, access-controlled documents with clean metadata), but the pipeline looks different. Structured and unstructured data need to be solved in parallel, not sequentially.
Read more: Machine Learning Explained: What It Is, How It Works, and Why It Matters for Business
The stack isn’t complicated
One reason companies delay data modernization is the assumption that it requires a massive infrastructure overhaul. It doesn’t.
A well-designed modern data stack is intentionally lean:
For extraction and loading, Airbyte is open-source, connects to hundreds of sources natively, and costs a fraction of incumbent options like Fivetran. For teams with the engineering bandwidth, custom Python connectors work well for specific sources.
For storage, BigQuery is a strong default for most mid-market companies, with consumption-based billing and deep integration with the Google ecosystem. Snowflake and Databricks are solid alternatives depending on your existing infrastructure.
For transformation and modeling, dbt is the default choice for most teams in this space: open-source, with a massive community and a proven track record for turning raw data into trustworthy, reusable models.
For visualization, the choice depends on your team’s needs. Sigma is warehouse-native and AI-enabled, built for real exploration. Omni leads with a semantic layer. Hex is better suited for data science workflows. Looker scales well at the enterprise level.
The goal isn’t to build the perfect architecture on day one. It’s to demonstrate that the data is trustworthy before asking anyone to make decisions on it.
Where AI comes in
Once you have a clean silver layer, the AI use cases become straightforward to build.
RAG pipelines work by retrieving relevant chunks of your data and feeding them to an LLM as context. Without a structured, governed silver layer, RAG systems either hallucinate (when they don’t have good data to retrieve) or return inconsistent results (when the same concept is modeled differently across sources). The silver layer is the retrieval database; clean it once, use it everywhere.
Agentic workflows are where the real operational leverage lives. An agent that monitors customer health scores and triggers a renewal conversation. An agent that analyzes weekly allocation data and flags margin risk before it becomes a problem.
An agent that pulls together cross-source data into an executive brief without anyone spending Friday afternoon in spreadsheets. These aren’t science fiction, they’re n8n workflows sitting on top of a well-modeled data layer.
The pattern is consistent: the companies that can actually deploy these use cases aren’t the ones that spent the most on AI tooling. They’re the ones who did the unglamorous work of cleaning their data first.
Governance
One piece the lean stack above doesn’t cover: governance.
Who owns this data? Who can query it? What happens when an AI agent surfaces a customer’s PII to the wrong employee, or hallucinates a revenue number into an executive brief?
Before you layer agents on top of your silver layer, you need at a minimum: column-level access controls, data quality tests (dbt tests are a solid starting point), and a defined review process for agentic outputs before they reach decision-makers.
Start small. Build trust. Then expand.
The biggest mistake isn’t choosing the wrong tools. It’s trying to boil the ocean in phase one.
Pick one high-pain use case. Connect the sources. Build the model. Show that the dashboard is trustworthy and that it saves real time. That proof of value is what earns budget and organizational buy-in for the next phase.
Then expand: more sources, more business unit views, more sophisticated modeling. By the time you’re adding AI agents and RAG pipelines, the data foundation is already there, and the team already trusts it.
| That sequence matters. AI on top of bad data doesn’t produce bad AI. It produces confident wrong answers. And confident wrong answers are worse than no answers at all. |
The reframe
When your board asks, “What’s our AI strategy?” the honest answer might be: “We’re building it, and it starts with making our data trustworthy.”
That’s not a delay. That’s the strategy.
Data modernization isn’t the thing you do before AI. It’s the work that makes AI possible. The teams treating these as separate projects are running the same project twice, just slower and more expensively. The teams that understand they’re the same project are the ones actually shipping.
Frontier models are a commodity. Every company, including your competitors, has access to the same LLMs. The differentiation was never the model. It was always whether your data was good enough to use one.
At Cheesecake Labs, we help growing companies build the data foundation that makes AI possible, from architecture and pipeline engineering to analytics and agentic workflows. If your team is data-rich but insight-poor, let’s talk.