Jidono
← Back to all posts
Data 9 min read

The Modern Data Lakehouse: Architecture Decisions That Matter

The lakehouse architecture is no longer a debate. The combination of cheap object storage, open table formats like Iceberg and Delta, and query engines that can read both structured and semi-structured data has settled the data warehouse versus data lake question for most organizations.

The remaining decisions are about how to build a lakehouse, not whether to. Those decisions still matter, and getting them wrong creates the same data swamps that data lakes were notorious for a decade ago.

The architectural shape

A modern lakehouse has four layers, regardless of which vendor stack you pick:

  1. Storage: Object storage (S3, GCS, ADLS) holding files in an open columnar format like Parquet
  2. Table format: Iceberg, Delta, or Hudi — providing ACID transactions, schema evolution, and time travel over the raw files
  3. Catalog: A metadata service that exposes those tables to query engines and governs access
  4. Compute: Multiple query engines (warehouse-style, streaming, ML) reading the same tables through the catalog

The decisions that matter live at the boundaries between these layers, especially the table format and catalog choices.

Open table formats: pick one and commit

Iceberg, Delta, and Hudi all solve roughly the same problems. The differences matter at the margins, but the bigger risk is using all three. We’ve seen organizations end up with Delta tables for their Databricks workloads, Iceberg for their analytics platform, and Hudi for their streaming ingest — and then spend the next two years reconciling the three.

Pick one. The criteria that should drive the decision:

  • Which engines you actually use (Iceberg has the broadest engine support; Delta is strongest in Databricks)
  • Whether you have streaming requirements (Hudi was designed for them)
  • Where your team’s skills currently live

The wrong answer is “all of the above.”

The catalog is the most underrated decision

Most lakehouse projects spend three months debating the table format and three days picking a catalog. That ratio is backwards.

The catalog is where your governance, lineage, access control, and engine federation live. It’s the most permanent decision you’ll make because migrating it later means migrating every workload that depends on it. The catalog choices that matter:

  • Does it support fine-grained access control (column, row, masking)?
  • Does it integrate with your identity provider?
  • Does it expose tables to every engine you need, or just one vendor’s?
  • Does it support both batch and streaming workloads natively?

Vendor-specific catalogs are seductive because they integrate well with their own engines. They become limiting the moment you adopt a second engine, which most organizations eventually do.

Bronze, silver, gold: still useful

The medallion architecture (bronze for raw, silver for cleaned, gold for serving) gets dismissed as too rigid by some teams. We still recommend it as a default, with one important nuance: the gold layer should be modeled for specific use cases, not as a single “serving” layer.

A reporting gold table looks different from an ML feature table looks different from a customer-facing API table. Trying to make one gold layer serve all three usually produces something that serves none of them well.

Streaming and batch: stop treating them as separate

The historical pattern was Lambda architecture: a batch pipeline producing one set of tables, a streaming pipeline producing another, and reconciliation logic in the middle. Most modern table formats support unified streaming and batch ingest into the same tables. Use that.

If your architecture still has parallel batch and streaming paths, you’re carrying complexity that isn’t necessary anymore.

Cost discipline matters more than performance tuning

The fastest way to run up a lakehouse bill is to let analysts run unrestricted queries against ungoverned tables. The fastest way to fix it is governance, not query tuning:

  • Tag every table with an owner and a cost center
  • Set query timeouts and result-size caps by default
  • Auto-tier cold partitions to cheaper storage classes
  • Monitor cost per query and surface the top offenders weekly

A well-governed lakehouse with mediocre query plans usually costs less than a poorly-governed one with optimized queries.

What we tell new clients

Three principles, in order of importance:

  1. Pick a single table format and commit to it across all workloads
  2. Treat the catalog as the most strategic decision in the stack
  3. Don’t ship without governance — bolting it on later costs three times more

The lakehouse pattern itself is sound. The mistakes happen at the implementation level, and they’re consistent enough that you can avoid them by being deliberate about the decisions above.

Ready to transform your business?

Let's discuss how AI, cloud, and IT consulting can accelerate your growth. Book a free discovery call with our team.