Your AI Strategy Has a Data Problem

Summary

AI projects often stall not because of models or tools, but because data infrastructure isn't ready—Gartner predicts 60% of AI projects will be abandoned in 2026 due to unprepared data, and McKinsey found only 6% of organizations extract meaningful value from AI.
Modern data architecture shifts from ETL to ELT and uses a medallion structure (Bronze for raw data, Silver for cleaned and contextualized data, Gold for business-ready data), with the Silver layer being most valuable for AI use cases like RAG pipelines and agentic workflows.
A lean modern data stack can include tools like Airbyte for extraction, BigQuery for storage, dbt for transformation, and visualization tools such as Sigma, Omni, Hex, or Looker, paired with governance practices like access controls, data quality tests, and review processes.
Teams should start small with one high-pain use case to prove value before expanding, since AI built on bad data produces confident wrong answers—the differentiation isn't the LLM but whether the underlying data is trustworthy.

Growing companies aren’t short on data. They’re short on data they can trust. The question most teams are asking is: How do we utilize AI to gain better insights?

The question they should be asking is: why can’t we get reliable insights right now? The answer is almost always the same. Not a strategy. Not tools. Not budget. Data infrastructure.

Why AI projects stall

There’s a gap between how companies describe their data situation and what’s actually there when you look.

The story tends to follow a familiar pattern. Data lives in a dozen disconnected tools. Marketing runs its own spreadsheets. Finance has a different number for the same metric. The engineering team is too busy keeping existing systems running to connect any of it. Leadership is making decisions based on gut feel, or worse, debating which spreadsheet is correct instead of using data to move forward.

Meanwhile, the board is asking about AI.

Gartner estimates that organizations lacking AI-ready data will abandon 60% of their AI projects in 2026, not because of bad models or wrong tools, but because the underlying data isn’t ready.

McKinsey’s 2025 State of AI report found that 88% of organizations are using AI in some form, but only 6% are extracting meaningful value (≥5% EBIT impact from AI). Data infrastructure is the most common first blocker, and the one that’s entirely within your control to fix.

You can’t build a RAG pipeline on top of stale spreadsheets. You can’t train an agent to act on your operations data if that data lives in manual exports that someone runs on Friday afternoons. You can’t get AI to answer questions about your business if no one agrees on the numbers in the first place.

What “data modernization” actually means

The term is used loosely, so let me clarify what it means in practice.

Traditional data architecture was built around ETL: Extract, Transform, Load. You pulled data from source systems, cleaned and shaped it before storage, then loaded it into a warehouse. The problem is that the transformation happened before you knew what questions you’d eventually want to ask, including questions you couldn’t have predicted when AI barely existed.

Modern data architecture flips this. ELT: Extract, Load, Transform. You ingest raw data first, store everything, then transform it on demand. Storage is highly scalable and no longer a binding constraint. Compute is consumption-based, meaning you pay for what you use, but that also means runaway query costs are real if you don’t partition tables and manage access carefully.

The value of ELT isn’t just cheapness; it’s flexibility: you can reshape data for new AI use cases without rebuilding pipelines from scratch.

The most durable way to organize this is through a three-layer structure called the medallion architecture:

Bronze is raw ingested data. Exactly what came from the source system. Untouched.

Silver is where the real work happens. Data gets cleaned, joined, enriched, and given a business context. This is the layer that matters most for AI, structured enough for models to use, but close enough to raw that you haven’t over-engineered it for a single use case.

Gold is business-ready data: structured, aggregated, and modeled for specific reporting needs. Executive dashboards, KPI tracking, board-level metrics.

Most companies think of gold as the finish line. In an AI context, silver is often more valuable. It’s the layer that feeds RAG pipelines, knowledge graphs, and agentic workflows, because AI needs context and nuance, not just summary statistics.

A note on unstructured data: not everything lives in a warehouse. Many RAG use cases retrieve from PDFs, support tickets, contracts, call transcripts, and internal wikis. The principle is the same (well-chunked, access-controlled documents with clean metadata), but the pipeline looks different. Structured and unstructured data need to be solved in parallel, not sequentially.

The stack isn’t complicated

One reason companies delay data modernization is the assumption that it requires a massive infrastructure overhaul. It doesn’t.

A well-designed modern data stack is intentionally lean:

For extraction and loading, Airbyte is open-source, connects to hundreds of sources natively, and costs a fraction of incumbent options like Fivetran. For teams with the engineering bandwidth, custom Python connectors work well for specific sources.

For storage, BigQuery is a strong default for most mid-market companies, with consumption-based billing and deep integration with the Google ecosystem. Snowflake and Databricks are solid alternatives depending on your existing infrastructure.

For transformation and modeling, dbt is the default choice for most teams in this space: open-source, with a massive community and a proven track record for turning raw data into trustworthy, reusable models.

For visualization, the choice depends on your team’s needs. Sigma is warehouse-native and AI-enabled, built for real exploration. Omni leads with a semantic layer. Hex is better suited for data science workflows. Looker scales well at the enterprise level.

The goal isn’t to build the perfect architecture on day one. It’s to demonstrate that the data is trustworthy before asking anyone to make decisions on it.

Where AI comes in

Once you have a clean silver layer, the AI use cases become straightforward to build.

RAG pipelines work by retrieving relevant chunks of your data and feeding them to an LLM as context. Without a structured, governed silver layer, RAG systems either hallucinate (when they don’t have good data to retrieve) or return inconsistent results (when the same concept is modeled differently across sources). The silver layer is the retrieval database; clean it once, use it everywhere.

Agentic workflows are where the real operational leverage lives. An agent that monitors customer health scores and triggers a renewal conversation. An agent that analyzes weekly allocation data and flags margin risk before it becomes a problem.

An agent that pulls together cross-source data into an executive brief without anyone spending Friday afternoon in spreadsheets. These aren’t science fiction, they’re n8n workflows sitting on top of a well-modeled data layer.

The pattern is consistent: the companies that can actually deploy these use cases aren’t the ones that spent the most on AI tooling. They’re the ones who did the unglamorous work of cleaning their data first.

Governance

One piece the lean stack above doesn’t cover: governance.

Who owns this data? Who can query it? What happens when an AI agent surfaces a customer’s PII to the wrong employee, or hallucinates a revenue number into an executive brief?

Before you layer agents on top of your silver layer, you need at a minimum: column-level access controls, data quality tests (dbt tests are a solid starting point), and a defined review process for agentic outputs before they reach decision-makers.

Start small. Build trust. Then expand.

The biggest mistake isn’t choosing the wrong tools. It’s trying to boil the ocean in phase one.

Pick one high-pain use case. Connect the sources. Build the model. Show that the dashboard is trustworthy and that it saves real time. That proof of value is what earns budget and organizational buy-in for the next phase.

Then expand: more sources, more business unit views, more sophisticated modeling. By the time you’re adding AI agents and RAG pipelines, the data foundation is already there, and the team already trusts it.

That sequence matters. AI on top of bad data doesn’t produce bad AI. It produces confident wrong answers. And confident wrong answers are worse than no answers at all.

The reframe

When your board asks, “What’s our AI strategy?” the honest answer might be: “We’re building it, and it starts with making our data trustworthy.”

That’s not a delay. That’s the strategy.

Data modernization isn’t the thing you do before AI. It’s the work that makes AI possible. The teams treating these as separate projects are running the same project twice, just slower and more expensively. The teams that understand they’re the same project are the ones actually shipping.

Frontier models are a commodity. Every company, including your competitors, has access to the same LLMs. The differentiation was never the model. It was always whether your data was good enough to use one.

At Cheesecake Labs, we help growing companies build the data foundation that makes AI possible, from architecture and pipeline engineering to analytics and agentic workflows. If your team is data-rich but insight-poor, let’s talk.

FAQ

Why do most AI projects stall in growing companies?

AI projects most commonly stall due to data infrastructure issues, not strategy, tools, or budget. Data is often scattered across disconnected tools, different teams have different numbers for the same metric, and engineering teams are too busy maintaining existing systems to connect it all. Gartner estimates that organizations lacking AI-ready data will abandon 60% of their AI projects in 2026, and McKinsey's 2025 State of AI report found that while 88% of organizations use AI in some form, only 6% extract meaningful value (≥5% EBIT impact).

What does data modernization actually mean?

It means shifting from traditional ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform), where raw data is ingested first, stored, and then transformed on demand. This provides flexibility to reshape data for new AI use cases without rebuilding pipelines. The most durable way to organize this is the medallion architecture with three layers: Bronze (raw ingested data), Silver (cleaned, joined, enriched data with business context), and Gold (business-ready, aggregated data for specific reporting needs).

Why is the Silver layer especially important for AI?

While most companies treat Gold as the finish line, Silver is often more valuable in an AI context. It's structured enough for models to use but close enough to raw data that it isn't over-engineered for a single use case. The Silver layer feeds RAG pipelines, knowledge graphs, and agentic workflows, because AI needs context and nuance rather than just summary statistics.

What tools make up a lean modern data stack?

For extraction and loading: Airbyte (open-source) or custom Python connectors. For storage: BigQuery as a strong default for most mid-market companies, with Snowflake and Databricks as alternatives. For transformation and modeling: dbt is the default choice. For visualization: Sigma (warehouse-native and AI-enabled), Omni (semantic layer), Hex (data science workflows), or Looker (enterprise scale).

What governance practices are needed before deploying AI agents?

At a minimum, you need column-level access controls, data quality tests (dbt tests are a solid starting point), and a defined review process for agentic outputs before they reach decision-makers. This helps prevent issues like surfacing PII to the wrong employee or hallucinating numbers into executive briefs.

About the author.

Marcelo Gracietti

Marcelo is CEO of Cheesecake Labs and a Forbes Technology Council member, recognized as a Top Changemaker in Mobile Apps and featured on Mobile App Daily's '40 Under 40' list. With 10+ years of experience, he drives innovation across the U.S. and Brazil.

Your AI Strategy Has a Data Problem

Why AI projects stall

What “data modernization” actually means

The stack isn’t complicated

Where AI comes in

Governance

Start small. Build trust. Then expand.

The reframe

FAQ

About the author.

See also.

When to Move Your Data Out of Spreadsheets?

Python vs SQL in Data Pipelines: Why the Answer is Both

The Data Architecture Decisions That Actually Matter (Before You Write a Single Query)