The Anatomy of Self-Service Analytics You Can Actually Trust (Humans and AI Agents)


It is Monday, 10:10 am. The data analysts on your team are overwhelmed. They have to write, fast, the ad-hoc SQL queries that landed on their desk this morning, and the numbers have to be ready for the managers’ meeting at 11:15. The whole week looks like this. Margin by category for one manager, returns by channel for another, every request landing in the same queue, answered by the people you hired to do far more valuable work. The strategy waits.

So you turn to self-service. You have two options: a drag-and-drop BI tool for the business teams, or an AI agent plugged straight into your warehouse. You decide to test the agent.

You ask a question in natural language and get an answer that is well formatted and confident. Confidently wrong.

Your agent is not dumb, it is blind.1 It lacks the implicit business knowledge that lives in the heads of your experienced people, the conversations between colleagues, the tribal knowledge it has no access to. The context.

So for now you cannot trust it. The queue of ad-hoc requests keeps growing, and the strategic problems your analysts should be solving have to wait. Again. Your company is slowed down by a low-value to-do list that never stops growing.

But it is almost a solved problem. Anthropic automated 95% of its internal analytics queries with Claude, at roughly 95% accuracy, and its lesson is blunt: reliable self-service is a context problem, not a model problem.

I built that same anatomy on a stack anyone can replicate at almost zero cost. Here it is, layer by layer, gaps included. Not a showcase, an honest audit.

What “reliable” self-service actually means

Look again at those two options. A drag-and-drop BI tool and an AI agent are not two competing choices. They are two ways into the same thing: one governed foundation, reached through two different front doors. The business user clicks through governed metrics and builds a dashboard, no SQL required. The agent reads that same foundation, plus a layer of context, and answers a question asked in plain language. Same definitions, same numbers. You do not pick one. They stack, the agent’s door built on top of the foundation the human already uses.

But the two doors do not fail for the same reason, and fixing one does not fix the other. The human needs a single place where every KPI is computed once, the same way for everyone: a semantic layer, the single source of truth. Give the business that, and metric drift disappears. The agent needs that same semantic layer, and one more thing on top of it. A semantic layer holds the metrics and their definitions, and nothing else. It does not hold the knowledge that turns a vague question into the right query: which metric the user actually means, which filter to apply, the business rule that lives only in a senior analyst’s head. That knowledge is the context layer. The human needs one layer. The agent needs two, stacked.

That second layer is not optional, and Anthropic, which automated most of its internal analytics this way, shows why. It names three ways an agent fails without context, worth borrowing as shared vocabulary: concept-to-entity ambiguity (which table really is “a customer”?), data staleness (the number is real, but out of date), and retrieval failure (the agent never finds the right source). None of the three is a reasoning problem. Each one is missing context.

So here is what “reliable” means for the rest of this article: the same trustworthy answer, whether a human clicks or an agent asks. Not a prettier dashboard. Not a smarter model. The same number, defined once, served to everyone who asks, in any tool. What follows is the anatomy that makes that true: five layers, and each one removes a reason you still cannot step back and let your people serve themselves.

A 3D stack of five layers: data foundations (dbt), semantic layer (Cube), context layer (ktx), procedural skills, and validation and maintenance. A business user consumes the semantic layer by drag and drop; an AI agent consumes the context layer above it; both reach the same number.

Layer 1: Data foundations, the groundwork you cannot skip

Self-service is delegation, and you cannot delegate on top of a mess. Before any semantic layer, before any agent, the warehouse has to become something a person can reason about: raw sources modeled into clean tables, tested, and documented. This is the layer everyone wants to skip because it is not glamorous. Skip it, and every layer above inherits the mess.

There is nothing exotic here. I use dbt to model the raw sources into clean tables the Kimball way: fact tables at a clear grain, with dimensions around them. Tests run on every build, so a broken assumption fails loudly instead of silently. Metadata and lineage let anyone see where a number comes from. In my stack, the raw data is a simulated direct-to-consumer retailer: messy source tables mapped down into a handful of clean marts. The point is not the tool. The point is that the warehouse stops being a pile of tables and becomes a model.

The single most important move on this layer: each metric is defined exactly once, in the model, by the human who owns it. Here is net revenue, defined one time in the finance fact table.

File: transformation/models/marts/finance/fct_order_items.sql

-- net_revenue defined ONCE here; everything downstream references it by name.
case when status = 'Returned' then 0 else sale_price end as net_revenue

One line decides that a returned order contributes zero. Every dashboard, every query, every agent that ever reports net revenue references this column by name. The definition cannot drift because it exists in exactly one place.

It is tempting to let an LLM write these definitions for you. Anthropic tried exactly that, and it failed: the result was “plausible-looking definitions that encoded the very ambiguities we were trying to eliminate.” The lesson is sharp, and it holds for the whole article: use an LLM to draft the documentation around a metric, but a human owns the metric itself. The definition is the one thing you do not delegate.

That removes the first reason you cannot yet trust self-service: the data underneath is no longer ambiguous. It is modeled, tested, and defined once. But a definition sitting in a SQL file is no use to a business user clicking through a dashboard or to an agent answering a question. It has to be exposed once, as a governed metric every tool reaches the same way. That is the semantic layer.

Layer 2: The semantic layer, define once and prove it

Layer 1 defined net revenue once, but it is still trapped in a SQL file. Only someone who can read and write SQL can use it. The semantic layer lifts that definition out and exposes it as something you click, not something you query. I use Cube for this, on Cube Cloud’s free tier. A business user opens a list of governed metrics and dimensions, picks net revenue, slices it by category and month, and gets a chart. No SQL written, ever.

There are three places a semantic layer can live. Close to the data, like Snowflake’s semantic views or the dbt semantic layer. Inside a BI tool, like Looker’s LookML or Power BI’s models. Or on its own, in front of the warehouse and tied to no single tool. Cube is the third kind, what the industry calls a headless or universal semantic layer. Headless, because it ships no charts of its own. Universal, because it exposes the same governed metrics through SQL, REST, and GraphQL to anything downstream: a BI tool, a spreadsheet, an embedded app, or an AI agent.

That independence is why I chose it. A semantic layer locked inside one BI tool defines the metric for that tool alone; add a second tool or an agent and you re-author the definition, so “define once” quietly dies. A layer bound to the warehouse or the modeling tool can couple you to one vendor and narrow what is able to query it. A headless layer in front of the warehouse is the only option that serves the human’s BI door and the agent’s door from the exact same definition, through the same APIs. It is, almost literally, the architecture this article argues for.

The payoff lands on three fronts: speed, trust, and governance. Speed, because a non-technical user answers their own simple questions, which is exactly the low-value ad-hoc work that used to pile up on your analysts. And for the analyses that still need an analyst, the analyst is faster too: they click the metric instead of hand-writing the whole query. Trust, because the number is computed from one governed definition, and because the generated SQL is right there to inspect. No black box. Governance, because the user can only ever touch metrics you chose to expose. They cannot invent an ungoverned definition by accident.

Here is net revenue in the semantic layer. Notice it does not redefine anything. It points at the same column the dbt model already produced.

File: semantic/model/cubes/order_items.yml

- name: net_revenue
  sql: net_revenue          # same column, no re-definition
  type: sum
  format: currency
# rates are MEASURES computed at the aggregation level, never row averages:
- name: gross_margin_rate
  sql: "SUM({CUBE}.gross_margin) / NULLIF(SUM({CUBE}.net_revenue), 0)"

There is a second discipline hiding in that file. A rate like gross margin is computed as a measure, at the aggregation level, never as an average of row-level rates. Averaging rates is one of the most common ways a self-service number quietly goes wrong, and the semantic layer is where you forbid it. More on that when an agent gets involved, in Layer 3.

Two architectural rules make the result provable. The semantic layer exposes a view, not the raw tables, and every metric is built at a single grain, one row level, so an aggregation can never double-count. The effect is that the same question returns the same number everywhere. I checked: net revenue for one category in one month comes back as roughly 38,322 USD in Cube’s own explorer, in Metabase, and in a direct SQL query against the warehouse. The same figure, to the cent, on all three surfaces.

That is the entire non-agentic mode, working and provable. A human can now self-serve with trust, and the simple ad-hoc stops reaching your team. But notice what the human still does that the agent cannot: they know which metric they want, and they click it. An agent gets a question in plain language, vague and human, and has to work out which metric, which filter, which grain. The semantic layer tells it what exists. It does not tell it what the question means. That is the next layer.

Layer 3: The context layer, hard and soft semantics

The semantic layer told the agent what exists: these metrics, these dimensions, these grains. It did not tell the agent what a question means. Ask “how did margin look last quarter?” and the agent still has to decide which margin, which dates count as last quarter, whether returns are in or out, and which unwritten rule applies. That knowledge is the context layer, and it sits on top of the semantic layer, the second of the two stacked layers an agent needs.

Simon Spati splits this context in two.1 Hard semantics are structured: tables, joins, measures, and filters that compile to SQL. That is largely what the semantic layer already gives you. Soft semantics are the prose: caveats, historical decisions, policies, the things that live in a senior analyst’s head and in old Slack threads. Both are kept as plain files in Git, “diffable, mergeable, and reviewable exactly like code.”

I use an open-source tool called ktx2 for this layer. ktx is the context layer: it holds the soft semantics, the caveats, policies, and prose definitions the semantic layer has no place for, and serves them to an agent over MCP, the protocol agents use to call tools. ktx is normally pointed straight at a warehouse. I made a deliberate, slightly unusual choice and connected it to Cube instead, through Cube’s SQL API. I have not seen anyone else wire it this way, and I did it for one reason: I wanted humans and agents to read from the exact same governed metrics, defined once in Cube, not from two different copies of the truth.

A useful consequence falls out of that choice. ktx never touches the warehouse. Cube stays the single execution path and the single auth boundary: the agent asks ktx, ktx asks Cube, Cube queries Snowflake. ktx never sees a Snowflake credential. One door in, one place where access is enforced, whether a human or an agent is knocking.

Architecture flow: an AI agent calls ktx over MCP, ktx queries Cube over the SQL API, a business user reaches Cube through Metabase, and Cube queries Snowflake. An auth boundary shows that Snowflake credentials stay with Cube and ktx never sees them.

This is also where you govern what the agent is allowed to do, and the trick is disarmingly simple: an agent will average anything you let it see, so you control behavior by controlling visibility. Remember the gross margin rate from the last section. You hide it from the agent, and instead expose the additive parts and recompose the ratio in the context layer.

File: semantic-context/semantic-layer/cube/finance.yaml

column_overrides:
  - name: gross_margin_rate
    visibility: hidden        # hidden on purpose: an agent cannot average a rate
measures:
  - name: gross_margin_ratio  # recomposed from additive sums
    expr: sum(gross_margin) / nullif(sum(net_revenue), 0)

The agent literally cannot average a rate, because it cannot see one. It sees additive sums and a ratio built correctly from them. Governance here is not a policy document nobody reads. It is the shape of what the agent can touch.

Because all of this is plain files in Git, every change to the agent’s context goes through a human review, the same as any code change. Spati calls the result a warm start: instead of pointing an agent at a cold database and hoping, you give it “a reviewed and governed starting point built from the knowledge your team has already created.” The human stays in the loop once, at review time, not on every question.

Here is one soft-semantic file, small and boring, that will matter more than anything else in this article.

File: semantic-context/wiki/global/analytical-horizon.md

“Any query touching dates earlier than 2023-01-01 returns nothing by design. Do not interpret the absence of 2022 data as missing data or an ingestion failure.”

It tells the agent that the data has a start date, and that an empty 2022 is expected, not broken. Hold on to this page. In Layer 5 it is the one thing that catches a bug four other surfaces miss.

With this layer in place, the agent draws on the same governed definitions a human does, plus the why and the guardrails, all reviewed and version-controlled. Is its door now as reliable as the human’s? Honestly, I cannot prove it. I showed you the human’s number to the cent; I cannot yet do the same for the agent, because I have not built the evals that would measure it. That gap, and how a working system quietly rots without it, is Layer 5. But there is a more basic gap to name first. Everything in the context layer so far is declarative: it tells the agent what things are. It does not tell the agent how to run a hard investigation, step by step. That is the next layer, and the one I have not finished.

Layer 4: Procedural knowledge, the gap I have not closed

Look back at everything in the last three layers. It is all declarative. It tells the agent what a metric means, what tables exist, what caveats apply. Declarative knowledge answers “what is this?” But a good analyst runs on a different kind of knowledge: procedural. Which sources to consult, in what order. How to investigate a revenue drop. What to check first when a number looks wrong. The steps, not the definitions.

This is not a small gap, and Anthropic put a number on it. “Without skills, Claude’s ability to answer analytics questions accurately didn’t exceed 21% on our evals. Adding skills gets these numbers consistently above 95%.” Read that again. Their declarative foundation, the same kind I have spent three layers building, got them to 21%. The procedural skills are what took them to 95%.

So here is where I honestly stand, and it splits cleanly in two. The human, non-agentic side is finished and proven: the same number to the cent across every surface, no model in the loop, nothing left to close. The agentic side is another story. I have a rich declarative context and no domain skills yet, which by Anthropic’s own measure puts that side much closer to 21% than to 95%. The layers underneath are real and they work. The procedural knowledge that turns a capable agent into a reliable one is the part I have not built.

Procedural knowledge gets packaged as skills: small, reusable playbooks an agent can follow, like “how to investigate a margin drop” written as steps, not prose. I would build these as Claude Code skills, or with nao3, an open-source framework for building and testing analytics agents. That is the next build, and I am writing it up on its own.

And there is a quieter problem underneath. How would I even know whether the agentic side sits at 21% or 95%? You measure it, with evals: a frozen set of real questions, each with a known-good answer, run against the agent and graded. Anthropic adds a sharp refinement, grade the agent’s query, not just its number, because a number can look right and still be wrong. I have not built that eval set yet, so on the agentic side I am partly flying blind. And measurement is only the start, because even a system you have measured at 95% will not stay there. That is the last layer, and the one that surprised me most.

Layer 5: Validation and maintenance, or your 95 percent rots to 65

Everything so far has been about building. This layer is about the fact that what you build does not stay built. Anthropic measured it: “We watched our offline accuracy drift from ~95% at launch to ~65% over a month before we treated this as an engineering problem.” The data changes, the models change, the context stays still, and the gap between them quietly widens. Validation is not a launch gate you pass once. It is maintenance, and skipping it is how 95% rots to 65% with nobody noticing.

Let me show you how well this can hide, with the worst day I had building this stack. I rebuilt my Snowflake warehouse from scratch. A subtle bug in my dlt4 pipeline’s incremental state meant that instead of reloading three and a half years of history, it loaded only the last two weeks. I did not know that yet. So I did what you are supposed to do: I checked net revenue across every surface I had, Cube’s explorer, Metabase, a direct SQL query, and the agent. All four returned the same number, to the cent.

Four independent surfaces, complete consensus, and all four were wrong. The number described two weeks of data wearing the label of three and a half years. Every dbt test was green, because referential tests check that relationships hold, and relationships hold just fine inside a two-week window. The governance I had spent four layers building had done its job too well: it made a wrong answer consistent, confident, and identical everywhere.

Two timeline bars from 2023 to 2026. The expected bar is full, three and a half years; after the rebuild only the last two weeks loaded, a thin red sliver. Consistency is not completeness.

But one of those four surfaces did something the others could not. The agent returned the same number, and then it questioned it. Remember the analytical-horizon wiki page from Layer 3, the boring file that told the agent data begins on 2023-01-01. Because it carried that context, the agent noticed the data no longer reached back to where it was supposed to start, and it flagged the period as suspect. Cube, Metabase, and raw SQL had no such reflex; they reported what was there. The agent knew what should have been there. That is the lesson, and the line I want you to keep: consistency validates your definitions, not your completeness. Four surfaces agreeing proves they share one definition. It proves nothing about whether the data underneath is whole.

The bug itself took a single setting to fix, a forced full refresh of the pipeline. After it, the warehouse held 149,483 line items, first order on 2023-01-01, about 8.02M USD in net revenue, the real figures. But the fix that matters is the one that stops it coming back. I added dbt tests that check what consistency cannot. A minimum-volume test that fails the build when a table holds far fewer rows than a healthy month, which is exactly what two weeks of data trips. And a recency test that fails when the newest record is older than it should be, guarding the opposite failure, a pipeline that quietly stopped. It was the volume test this bug needed.

This is why the layer is called maintenance, not validation. Anthropic’s version of the same lesson is to colocate the agent’s context with the data models and let CI block any model change that ships without a matching context update; they report that roughly 90% of their data-model pull requests now carry a context change in the same diff. Mine is smaller, recency and volume tests instead of a CI policy, but the principle is identical: the context and the data have to be kept honest together, continuously, or the whole thing drifts. So where does that leave the anatomy? Time to score it honestly.

The maturity map

I have walked you through five layers. Some are solid, one is missing, one is half-built. Here is the honest scorecard, mapped against the anatomy Anthropic describes, so you can see where a real self-service stack stands, and where mine still has work.

LayerIn Anthropic’s anatomyWhere my stack stands
1. Data foundationsModeled, tested dataDone. Modeled in dbt, tested, each metric defined once.
2. Semantic layerSingle source of truthDone, and proven. Same number to the cent across Cube, Metabase, and SQL.
3. Context layerBusiness context, sources of truthBuilt, not yet proven. Hard and soft semantics in Git; no evals to measure the effect.
4. Procedural skillsSkills (21% to 95%)Missing. No domain skills written yet.
5. Validation and maintenanceOffline evals, drift controlPartial. Recency and volume dbt tests; no agent evals.

Two things stand out when you lay it flat. First, reliable self-service is a system, not a product you buy. No single tool on this list makes analytics trustworthy; the trust comes from the layers working together. Second, look at where the gaps are. Almost everyone builds the first two layers, solid data and a BI tool, and stops. That is the part that makes a human productive and an agent dangerous. Reliability lives in the three layers built on top, context, skills, and validation, the part most stacks never reach. I have built more of it than most, and I am honest that I have not built all of it.

Takeaways

If you take three things from this, take these.

Context engineering is the new analytics engineering. For a decade the job was modeling data correctly. That still matters, it is Layer 1, but it is now the floor, not the ceiling. The work that makes analytics trustworthy, for humans and agents alike, is engineering the context around the data: the definitions, the soft semantics, the guardrails. That is the new craft.

The human owns the definition. An LLM can draft your documentation, speed up your modeling, and answer your questions. But the moment you let it decide what a metric means, you have encoded your ambiguity instead of removing it. A person owns each definition. Everything else can be assisted.

Consistency is not completeness. Four surfaces agreeing to the cent felt like proof, and was not. Agreement tells you your definitions match. It says nothing about whether the data underneath is whole. Test for completeness on purpose, because consistency will not do it for you.

I have been candid about the gaps, and each is the next thing I am building in the open: the procedural skills that would move my agent from 21% toward 95%, and the eval set that would let me prove any of it. Those are the next pieces of this series.

None of this is a slide deck. The whole stack is live and open, and you are welcome to pull on it:

If reliable self-service analytics is the problem you are chewing on, follow along on LinkedIn: https://www.linkedin.com/in/gaelmukunde/. The next pieces are the gaps I just admitted to.

Footnotes

  1. Simon Spati is a data engineer who writes the pipeline2insights newsletter. The quotes here are from his article Introduction to ktx: The Open-Source Context Layer for Data Agents. 2

  2. ktx is an open-source context layer for data agents, built by Kaelio. See github.com/Kaelio/ktx.

  3. nao is an open-source framework for building and testing analytics agents, by nao Labs. See getnao.io.

  4. dlt (data load tool) is an open-source Python library for building data pipelines, by dltHub. See dlthub.com.