Data Agents: From OpenAI's 6 Layers to Open-Source Alternatives

Feb 6 2026 AI

Every company sitting on a data warehouse wants the same thing: let anyone ask questions in plain English and get reliable answers. OpenAI published how they built their internal data agent, and the open-source community responded fast. Here’s a quick summary of three projects pushing this forward.

OpenAI’s In-House Data Agent: 6 Layers of Context

Source: Inside OpenAI’s In-House Data Agent by OpenAI, January 2026

OpenAI built a bespoke data agent serving 3,500+ internal users across 600 petabytes and 70k datasets. The core insight: context is everything. Without it, even strong models hallucinate column names and misinterpret business terminology.

The agent grounds itself in 6 layers of context:

Layer	What It Provides	How
Metadata Grounding	Schema, columns, data types, table lineage	Warehouse metadata
Query Inference	Historical query patterns, common joins	Ingested past queries
Curated Descriptions	Business meaning, caveats, intent	Domain expert annotations
Code-Level Definitions	How tables are built, freshness, scope	Codex-powered code crawling
Institutional Knowledge	Launches, incidents, metric definitions	Slack, Docs, Notion (RAG)
Memory	Corrections, discovered filters, nuances	Self-learning from conversations

Plus a runtime context layer for live schema inspection when existing info is stale.

Key Lessons

Tool consolidation matters - overlapping tools confuse agents. Restrict and consolidate.
Less prescriptive prompting works better - rigid instructions pushed the agent down wrong paths. Higher-level guidance + model reasoning = more robust results.
Code > metadata - pipeline logic captures assumptions and business intent that never surface in SQL or table schemas. Crawling the codebase with Codex was a game-changer.
Memory is non-negotiable - stateless agents repeat the same mistakes. The self-learning memory stores corrections, non-obvious filters, and constraints critical for correctness.

Evaluation

OpenAI uses curated question-answer pairs with “golden” SQL. Generated SQL is compared both syntactically and by result set, using an LLM grader that accounts for acceptable variation. These run continuously as canaries in production.

Dash: Open-Source Self-Learning Data Agent

Source: Dash: Self-learning data agent by Ashpreet Bedi
GitHub: agno-agi/dash

Dash is an open-source implementation directly inspired by OpenAI’s architecture. It implements the same 6-layer context approach:

Layer	Source
Table Usage	`knowledge/tables/*.json`
Human Annotations	`knowledge/business/*.json`
Query Patterns	`knowledge/queries/*.sql`
Institutional Knowledge	MCP (optional)
Memory	`LearningMachine`
Runtime Context	`introspect_schema` tool

Self-Learning Loop

Dash learns through two systems:

Static Knowledge - validated queries, business context, table schemas, metric definitions. Curated by your team, maintained alongside the agent.
Continuous Learning - patterns discovered through trial and error. Column mappings, team focus areas, business term disambiguation. Implemented with just ~5 lines of code via LearningMachine.

Quick Start

git clone https://github.com/agno-agi/dash && cd dash
cp example.env .env  # Add OPENAI_API_KEY
docker compose up -d --build
docker exec -it dash-api python -m dash.scripts.load_data
docker exec -it dash-api python -m dash.scripts.load_knowledge

Ships with F1 race data (1950-2020), a built-in UI via Agno, and an evaluation suite (string matching, LLM grading, golden SQL comparison). Built with the Agno framework.

Nao: Open-Source Analytics Agent

GitHub: getnao/nao (Y Combinator backed)

Nao takes a different approach - it’s a framework-first analytics agent focused on context building and deployment.

Two-Step Architecture

nao-core CLI - build and manage agent context (data, metadata, modeling, rules, docs, tools, MCPs)
nao chat UI - deploy a conversational interface for anyone to query

Key Differentiators

Feature	Detail
Data stack agnostic	Works with any warehouse, any LLM
File-system context	Context organized as files - no limit on what you include
Agent reliability testing	Unit test agent performance before deploying
Version tracking	Version context and track performance over time
User feedback loop	Built-in thumbs up/down for continuous improvement
Self-hosted	Use your own LLM keys, full data privacy

Quick Start

pip install nao-core
nao init        # Interactive project setup
nao debug       # Verify setup
nao sync        # Populate context from sources
nao chat        # Launch UI at localhost:5005

Also available via Docker:

docker pull getnao/nao:latest
docker run -d --name nao -p 5005:5005 \
  -e BETTER_AUTH_URL=http://localhost:5005 \
  -e FASTAPI_URL=http://127.0.0.1:8005 \
  getnao/nao:latest

Stack: Fastify + Drizzle + tRPC (backend), React + TanStack Query + shadcn (frontend).

Comparison

Aspect	OpenAI Agent	Dash	Nao
Open Source	No (internal)	Yes (Apache 2.0)	Yes (Apache 2.0)
Context Layers	6 layers	6 layers (same model)	File-system based
Self-Learning	Memory system	LearningMachine	User feedback loop
Evaluation	Golden SQL + LLM grader	String match + LLM grader + golden SQL	Unit testing framework
LLM Support	GPT-5 / Codex	OpenAI	Any LLM
Data Stack	OpenAI internal	Configurable	Agnostic
UI	Internal tool	Agno platform	Built-in chat
Best For	-	Teams wanting OpenAI’s architecture OSS	Teams wanting stack-agnostic framework

Takeaway

The pattern is clear: context-rich, self-learning data agents are becoming the standard. OpenAI proved the architecture at scale, Dash made it accessible, and Nao provides a framework-first alternative. The shared insight across all three: without deep, layered context, even the best models produce unreliable results.

References

Inside OpenAI’s In-House Data Agent - OpenAI (Jan 2026)
Dash: Self-learning data agent - Ashpreet Bedi
agno-agi/dash - Dash GitHub repo
getnao/nao - Nao GitHub repo

#llm #agentic-ai #data-agent #analytics #text-to-sql

Data Agents: From OpenAI's 6 Layers to Open-Source Alternatives

OpenAI’s In-House Data Agent: 6 Layers of Context

Key Lessons

Evaluation

Dash: Open-Source Self-Learning Data Agent

Self-Learning Loop

Quick Start

Nao: Open-Source Analytics Agent

Two-Step Architecture

Key Differentiators

Quick Start

Comparison

Takeaway

References

Comments

Your browser is out-of-date!

Data Agents: From OpenAI's 6 Layers to Open-Source Alternatives

OpenAI’s In-House Data Agent: 6 Layers of Context

Key Lessons

Evaluation

Dash: Open-Source Self-Learning Data Agent

Self-Learning Loop

Quick Start

Nao: Open-Source Analytics Agent

Two-Step Architecture

Key Differentiators

Quick Start

Comparison

Takeaway

References

Related Posts

Comments

Your browser is out-of-date!