Building Reliable AI - Chains, Gates, and Self-Improvement

AI systems that work once under ideal conditions are interesting. AI systems that work reliably in production are valuable. In this final post of the series, I’ll share techniques for building robust AI workflows - connecting multiple reasoning steps, validating outputs along the way, and creating systems that improve through iteration.

The Limits of Single-Shot Prompts

Complex tasks rarely succeed with a single prompt. Ask an AI to “research a topic, summarize findings, and write a LinkedIn post about it” and you’ll get… something. But probably not something good.

The problem is cognitive overload. Even powerful language models perform better when they can focus on one thing at a time. It’s similar to how humans struggle when given too many instructions at once.

Breaking Work into Manageable Steps

The solution is sequential prompting - decomposing complex work into discrete steps, each handled by a focused prompt.

Instead of one massive request:

1
Research AI agents, summarize key concepts, and write a LinkedIn post.

Break it into stages:

1
2
3
4
5
6
7
8
Step 1: "Research the key concepts of AI agents."
→ Get research output

Step 2: "Summarize these research findings: {research_output}"
→ Get summary

Step 3: "Write a LinkedIn post based on this summary: {summary}"
→ Get final post

Each step does one thing well. The output of each becomes input for the next.

flowchart LR
    P1[Prompt 1
Research] --> R1[Response 1] R1 --> P2[Prompt 2
Summarize] P2 --> R2[Response 2] R2 --> P3[Prompt 3
Draft Post] P3 --> Final[Final Output] style P1 fill:#e3f2fd style P2 fill:#e3f2fd style P3 fill:#e3f2fd style Final fill:#c8e6c9

Why This Works Better

  • Focus: Each prompt has clear, singular purpose
  • Debuggability: When something fails, you know which step caused it
  • Iterability: You can refine individual prompts without rebuilding everything
  • Quality: Smaller, focused tasks produce more reliable outputs

Connecting the Chain Programmatically

In practice, we automate these chains with code. The orchestrator handles:

  1. Calling the LLM with each prompt
  2. Capturing responses
  3. Injecting outputs into subsequent prompts
  4. Managing the overall flow
1
2
3
4
5
6
7
8
9
10
11
12
# Simplified example of prompt chaining
def run_chain(topic):
# Step 1: Research
research = call_llm(f"Research key concepts about {topic}")

# Step 2: Summarize
summary = call_llm(f"Summarize these findings: {research}")

# Step 3: Draft
post = call_llm(f"Write a LinkedIn post from: {summary}")

return post

The Problem: Errors Cascade

Here’s the issue with basic chaining - if one step produces bad output, everything downstream suffers. An error in step 1 propagates through steps 2 and 3, compounding the problem.

Language models are inherently unpredictable. They sometimes:

  • Hallucinate information
  • Produce wrong formats
  • Miss instructions
  • Generate incomplete outputs

We need quality control between steps.

Introducing Validation Gates

Gate checks are programmatic validations between chain steps. Before passing output forward, we verify it meets our criteria.

flowchart LR
    P1[Step 1] --> G1{Gate Check}
    G1 -->|Pass| P2[Step 2]
    G1 -->|Fail| Retry1[Retry / Handle Error]
    P2 --> G2{Gate Check}
    G2 -->|Pass| P3[Step 3]
    G2 -->|Fail| Retry2[Retry / Handle Error]

    style G1 fill:#fff3e0
    style G2 fill:#fff3e0

Types of Validation

Format checks: Does the output match expected structure?

  • Valid JSON/XML
  • Required fields present
  • Correct length ranges

Content checks: Does it contain what we need?

  • Required keywords present
  • Topic relevance
  • No prohibited content

Logic checks: Does it make sense?

  • Code compiles
  • Numbers are reasonable
  • References exist

Handling Failures

When a gate check fails, you have options:

  1. Halt: Stop execution and report error (good for sensitive operations)
  2. Retry: Run the step again (works for transient issues)
  3. Retry with feedback: Include the failure reason in the retry prompt
1
2
3
4
5
6
7
8
9
10
11
def run_with_validation(prompt, validator, max_retries=3):
for attempt in range(max_retries):
output = call_llm(prompt)

if validator(output):
return output

# Include error in retry prompt
prompt = f"{prompt}\n\nPrevious attempt failed: {get_error(output)}"

raise Exception("Max retries exceeded")

Building Self-Improving Systems

The retry-with-feedback pattern is powerful enough to deserve its own section. It’s the foundation of feedback loops - systems that learn from their mistakes within a single task.

The Feedback Loop Pattern

Instead of one-shot execution, we iterate:

  1. Generate: Produce initial output
  2. Evaluate: Check against criteria
  3. Feedback: If criteria not met, provide specific guidance
  4. Refine: Generate improved version using feedback
  5. Repeat: Continue until success or max iterations
flowchart TD
    Gen[Generate Output] --> Eval{Evaluate}
    Eval -->|Pass| Done[Complete]
    Eval -->|Fail| Feed[Generate Feedback]
    Feed --> Refine[Refine with Feedback]
    Refine --> Eval

    style Done fill:#c8e6c9

Feedback Sources

The evaluation feedback can come from multiple sources:

Self-critique: Ask the same LLM to review its own work

1
Review this email for professionalism. List any issues.

External tools: Run code, execute tests, validate against schemas

1
2
3
result = run_tests(generated_code)
if result.failures:
feedback = f"Tests failed: {result.failures}"

Validation rules: Programmatic checks for format, length, content

1
2
if "required_field" not in json.loads(output):
feedback = "Missing required_field in response"

Human input: Direct feedback from users (useful for subjective criteria)

A Complete Example: Self-Correcting Code Generation

Let’s build a system that generates code until it passes tests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def generate_code_with_feedback(task, test_cases, max_iterations=5):
code = None
feedback = None

for i in range(max_iterations):
# Generate (or refine) code
if code is None:
prompt = f"Write Python code to: {task}"
else:
prompt = f"""Fix this code based on feedback.
Task: {task}
Current code: {code}
Feedback: {feedback}"""

code = call_llm(prompt)

# Evaluate by running tests
results = run_tests(code, test_cases)

if results.all_passed:
return code, i + 1 # Success!

# Generate feedback from test failures
feedback = format_test_failures(results)

raise Exception(f"Could not generate passing code after {max_iterations} attempts")

A typical iteration might look like:

Iteration Tests Passed Feedback
1 1/4 “Edge case: empty list not handled”
2 2/4 “Negative numbers cause IndexError”
3 3/4 “Off-by-one error in loop bounds”
4 4/4 Success!

Monitoring and Observability

For production systems, you need visibility into what’s happening:

Essential Metrics

  • Success rate: What percentage of chains complete successfully?
  • Iteration count: How many feedback loops before success?
  • Step timing: Where are the bottlenecks?
  • Failure modes: What types of errors occur most?

Detailed Tracing

Log every step for debugging:

1
2
3
4
5
6
7
8
9
10
{
"step": "generate_code",
"iteration": 2,
"prompt": "...",
"response": "...",
"validation": {
"passed": false,
"feedback": "Tests failed: ..."
}
}

This trace is invaluable when things go wrong. You can see exactly what the model received, what it produced, and why validation failed.

Putting It All Together

A robust AI workflow combines everything we’ve covered:

  1. Sequential prompting: Break complex tasks into focused steps
  2. Prompt chaining: Connect steps programmatically
  3. Gate checks: Validate outputs between steps
  4. Feedback loops: Iterate on failures with specific guidance
  5. Monitoring: Track success rates and debug failures
flowchart TD
    subgraph Chain["Prompt Chain"]
        S1[Step 1] --> V1{Validate}
        V1 -->|Pass| S2[Step 2]
        V1 -->|Fail| F1[Feedback Loop 1]
        F1 --> S1

        S2 --> V2{Validate}
        V2 -->|Pass| S3[Step 3]
        V2 -->|Fail| F2[Feedback Loop 2]
        F2 --> S2

        S3 --> V3{Final Check}
        V3 -->|Pass| Done[Complete]
        V3 -->|Fail| F3[Feedback Loop 3]
        F3 --> S3
    end

    Monitor[Monitoring & Logging] -.-> Chain

Key Takeaways

  1. Decompose complexity: Multiple focused prompts beat one massive prompt
  2. Validate early, validate often: Gate checks prevent error cascades
  3. Feedback enables improvement: Let AI learn from its mistakes within tasks
  4. Multiple feedback sources: Tools, rules, self-critique, and users all help
  5. Instrument everything: You can’t improve what you can’t measure

These patterns form the foundation for building AI systems that actually work in production - not just demos that succeed under perfect conditions.


This wraps up the foundational prompting techniques. In the upcoming posts, I’ll explore agentic workflows - how to design and implement AI agents that can handle complex, multi-step tasks autonomously.

Anatomy of an AI Agent - Building Blocks and Workflows Step-by-Step Reasoning - How AI Learns to Think

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×