The Data Architecture Decisions That Actually Matter (Before You Write a Single Query)

Most articles about the modern data architecture stack start with the tools. Every engagement we run starts somewhere different: your sources and your outputs. The architecture follows the data, not the other way around.

What follows are the engineering decisions that actually move the needle. Not the concepts, but the tradeoffs. Where things break, what holds up under pressure, and how to avoid the mistakes that quietly sink data projects before they ever reach a dashboard

Start with outputs and sources, not tools

The first thing we anchor on in any data engagement is a two-column exercise:

What data architecture decisions do you need to make faster or better? (outputs)
Where does the data that feeds those decisions actually live? (sources)

The entire stack flows from the answer to those two questions.

This sounds obvious, but it’s easy to skip. Teams get excited about BigQuery or Databricks or dbt and start designing infrastructure before they’ve mapped what they’re actually trying to surface. The result is a platform that may be best in class, but doesn’t address what the business actually needs.

Before recommending a single tool, we document the target analytics outputs: specific dashboards, reports, or decisions. Then we trace backwards to the source systems. Everything in between is plumbing.

The hardest data source type is the most common one: spreadsheets

We almost always find spreadsheets when we trace back to sources.

Finance data lives in a spreadsheet. Cost per employee lives in a spreadsheet. Is that the weekly allocation report someone runs every Monday? A spreadsheet, pulling from another spreadsheet, linked to a third via a VLOOKUP that breaks every time someone changes a column header.

Spreadsheets are not a bad starting point. They’re a bad permanent state.

Treat spreadsheets as a bridge, not a source of truth. You can connect Google Sheets directly to BigQuery via native connectors. It works, and it lets you stand up your first models and get data into a warehouse without changing anything in the existing workflow. Those connections are fragile, though: credentials expire, column names shift, someone deletes a row they shouldn’t have. Plan for maintenance time and plan to replace them.

Before connecting anything, ask: Is the spreadsheet the actual source, or is there an upstream system that someone is manually transcribing into it? This is one of the most valuable questions to surface early. If someone in finance is manually entering salary data into a sheet once a month, you’ll get better stability going directly to the payroll or HRIS system that feeds them, even if it requires a custom connector. A stable API beats a manual spreadsheet every time.

The goal is to migrate the logic, not just the data. Spreadsheets contain business rules: margin calculations, FX conversions, and allocation formulas. When you move to a warehouse, those rules need to move too and live in your data architecture models. The person who owns the spreadsheet process is your most valuable resource during that migration. They know where the edge cases are.

Read more: Your AI Strategy Has a Data Problem

Choosing your warehouse: the decision depends on your context

The BigQuery vs. Snowflake vs. Databricks question gets a lot of ink. Our take is based on what we see across the companies we work with.

BigQuery is the right default for most mid-market and growth-stage companies, especially if you’re already in the Google ecosystem. Consumption-based billing means costs stay near zero until you’re moving real volume. Native integration with Google Sheets reduces friction on the source side. The free tier and startup credits make it easy to validate before committing to anything.

The tradeoff: BigQuery is powerful but relatively bare metal compared to all-in-one platforms. You’ll compose it with other tools: Airbyte for ingestion, dbt for transformation, and a separate BI layer. That composability is an advantage for teams with engineering capacity and overhead for teams without it.

Snowflake and Databricks make more sense when you’re dealing with very high data volumes, machine learning workloads, or lakehouse use cases: product telemetry, streaming data, and unstructured content. Both are excellent platforms, but both have higher entry costs and tend to favor larger committed volumes before pricing becomes competitive.

Databricks is worth evaluating when your data includes high-volume unstructured content alongside structured operational data. The unified lakehouse simplifies architecture considerably for companies that would otherwise need to manage a separate data lake and a warehouse in parallel.

On Microsoft Fabric: We recommend it if you’re already running within the Microsoft ecosystem. If your organization is standardized on Azure, your team works in the Microsoft toolchain, and your data stays within that environment, Fabric is a legitimate choice, and the tech is solid. The concern isn’t the platform itself; it’s adopting it outside that context.

Also, read it: How to Migrate User Data from Native to Cross-Platform Apps

Ingestion: Airbyte first, Fivetran when it’s justified, custom connectors more often than you’d expect

Airbyte is where you start. Open-source, hundreds of native connectors, low entry cost, active community. The cloud version adds managed infrastructure with usage-based pricing, accessible enough for early-stage data work and scalable to capacity-based plans as pipelines grow. If a connector doesn’t exist natively, you can write a custom one in Python without much effort.

Fivetran has the best connector reliability in the market, but the pricing model got significantly more expensive in March 2025. They shifted from account-level to connector-level MAR (Monthly Active Rows) billing, eliminating the bulk discounts that previously made multi-source setups manageable. Teams with several integrations are reporting 40-70% cost increases as a result.

Initial loads are still free; everything incremental is metered per connector. Use it when reliability is non-negotiable, and the budget is clearly there, but run the numbers for your specific connector count before committing.

Custom Python connectors are underrated. If you have a handful of specific source systems and the engineering capacity to maintain lightweight Python scripts, this is often the most cost-effective long-term path. It also builds internal ownership of the data flow, which matters when something breaks at 2 am. Pair custom connectors with Airflow for orchestration: open-source, well-documented, and flexible enough for most scheduling and dependency requirements.

The practical framework: Airbyte for standard SaaS sources. Native cloud DW connectors, when available, and the source is high-frequency. Custom connectors for niche or proprietary internal systems. Fivetran only when reliability requirements and budget both point in that direction.

The medallion architecture in practice: bronze is free, silver is where the work is

The three-layer structure, bronze (raw), silver (cleaned and joined), gold (business-ready), is the right framework. What it actually looks like from an engineering standpoint:

Bronze is just a dump. Raw tables from your sources, untransformed. You never query Bronze directly for analytics. Its job is to exist and be immutable. If something goes wrong downstream, you rebuild from bronze without re-extracting from the source. Keep everything.

Silver is where the real engineering happens. This is where you define what a “customer” means in your data architecture model. Where the FX conversion logic that used to live in a spreadsheet now lives in a dbt model. Where allocation data and cost get joined into a clean, reliable view of margin per engagement. The silver layer should be structured enough to answer most questions, but not so aggregated that you’ve locked into a single use case.

There’s something important about the silver layer that often doesn’t get said: it’s also your AI layer. When you deploy RAG pipelines, agentic workflows, or any LLM-based automation, you want models pulling context from clean, governed silver-layer data, not hitting source systems directly. Source APIs have rate limits.

The source data has noise. Silver data has been cleaned, joined, and given a business context. That’s what a model needs to produce reliable outputs. Building the silver layer with AI access in mind from day one means you won’t have to rebuild it when AI use cases arrive.

Gold is for specific stakeholders. Sales mart. Finance Mart. Operations mart. These are highly structured, aggregated views that answer known, recurring questions. Build them after the silver layer is stable. Don’t build gold for every possible use case upfront; that’s how you end up with stale marts nobody uses.

Also, read it: Product Framework: Model Fallback and AI Pricing Strategy for better decision-making

DBT is the transformation layer, and the governance layer too

Once data is landing in the warehouse, you need a modeling layer. DBT is where the data engineering community has broadly landed, and for good reason.

DBT Core (free, open-source) is sufficient for most engagements. You write SQL models, run them with the CLI, and get clean, versioned, testable transformation logic. If your engineering team is comfortable in a terminal, Core gets you 90% of the way there.

DBT Cloud adds the visual DAG layer: graphical lineage, a UI for running jobs, and managed scheduling. If stakeholders outside the data team need visibility into what’s running and why, Cloud is worth the cost. For a purely technical audience, Core is enough.

The governance story lives inside dbt’s model architecture. When every metric is defined once, one model that says “this is what an active engagement means,” you eliminate the situation where two teams compute the same number differently and spend a board meeting arguing about it. Not a policy document, but a single source of logic that everyone queries from. DBT’s native tests (not_null, unique, accepted_values) catch data quality issues before they reach a dashboard. Start with basic tests on every model from day one.

BI tooling: pick based on your team’s needs, not tool popularity

Looker Studio is free and useful for early validation. The real limitations aren’t in the number of charts; they’re in the absence of a semantic or modeling layer, basic role-only governance with no row-level security or workspaces on the free tier, performance degradation with large or multi-source datasets, and third-party connectors that break more than you’d expect. To confirm that data is landing correctly before committing to a paid tool, it works. For anything stakeholders depend on daily, the fragility shows up fast.

Sigma is a strong default for warehouse-native analytics. Good AI-assisted exploration, intuitive enough for business users after a short ramp-up, and it doesn’t require data to be moved out of the warehouse to generate reports.

One thing to account for: Sigma live-queries the warehouse for every interaction rather than using extracts or caching, so it can drive up compute costs on pay-per-query billing models like BigQuery or Snowflake if exploration patterns are heavy. For teams on reserved capacity, it’s a non-issue. For consumption-billed warehouses, monitor query patterns early.

Hex fits data science workflows better than pure operational BI. If your team blends notebooks with dashboards for ML experimentation, ad-hoc analysis, and sharing results, Hex fits well. For standard operations or business analytics, it’s more surface area than you need.

Looker (the full product) is expensive and has capable alternatives now. We’d evaluate it primarily for organizations already committed to Google Cloud that want to consolidate vendor relationships and have the budget for enterprise tooling.

For internal tooling where flexibility matters more than licensing, a lightweight React frontend querying your gold layer directly is often faster to build and easier to maintain than bending a BI tool to fit an unusual use case. This works particularly well for operational tools: allocation dashboards, margin trackers, team capacity views, anywhere the UI needs to match an internal workflow that no off-the-shelf product quite fits.

A practical decision framework

When starting a data architecture engagement, the decisions flow in this order. Is there an existing cloud provider already committed to? Start there and build within that ecosystem.

If no preference exists, are the primary sources in Google Workspace? BigQuery and native connectors are the path of least resistance.

What are the source types? Mostly standard SaaS tools with existing connectors like Salesforce, HubSpot, or Amplitude? Airbyte. Mostly custom or proprietary internal systems? Custom Python connectors with Airflow. Mixed? Airbyte for the standard sources, custom for the rest.

Are there spreadsheet sources? Connect them as a bridge. Map the upstream systems. Begin planning the migration to stable API connections. What are the target outputs? Design silver layer models around those outputs, not around what’s easy to build. Build gold only when the silver layer is stable, and the audience is defined.

Is AI a near-term use case? Design the silver layer to be AI-accessible from day one. A gold-only architecture will need rework when LLM or agentic workflows arrive.

The bottom line

Understanding your data well enough to know which tools are justified is the hard part. Picking the tools is easy once you do.

Map your outputs first. Trace your sources. Design the silver layer around the decisions that matter. Build governance in from the start with dbt. Treat spreadsheets as a bridge while you identify the stable systems behind them. Keep the stack lean; every additional tool is a surface area for failure and a vendor relationship to manage.

The companies that get AI working in production aren’t the ones with the most sophisticated architectures. They’re the ones who got their data trustworthy first, then layered AI on top.

At Cheesecake Labs, we help companies build the data architecture foundation that makes AI possible, from architecture and pipeline engineering to analytics and agentic workflows. If your team is data-rich but insight-poor, let’s talk.

About the author.

Douglas da Silva

Douglas started as a Senior FullStack Developer at Cheesecake Labs and currently he's Partner and CBDO at the company.