RAG and Evaluation for Financial Agents

Financial agents need two critical capabilities that set them apart from simple chatbots: the ability to ground responses in authoritative documents through intelligent retrieval, and continuous evaluation to ensure accuracy in regulated environments. Agentic RAG transforms retrieval from a passive lookup into an active reasoning loop, while long-term memory enables personalization across sessions. Together with robust evaluation frameworks, these capabilities create production-ready financial assistants.

From Basic RAG to Agentic RAG

Traditional Retrieval-Augmented Generation follows a simple pattern: accept a query, retrieve documents, generate an answer. This works for straightforward questions but falls short in complex financial scenarios where initial retrieval might miss critical context.

flowchart TB
    subgraph Agentic["Agentic RAG"]
        direction LR
        Q2[Query] --> R2[Retrieve]
        R2 --> E[Evaluate]
        E --> D{Sufficient?}
        D -->|No| RF[Reformulate]
        RF --> R2
        D -->|Yes| G2[Generate]
        G2 --> A2[Answer]
    end

    subgraph Traditional["Traditional RAG"]
        direction LR
        Q1[Query] --> R1[Retrieve Once]
        R1 --> G1[Generate]
        G1 --> A1[Answer]
    end

    classDef pinkClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class Traditional pinkClass
    class Agentic greenClass

Agentic RAG introduces a feedback loop where the agent:

  • Assesses whether retrieved documents are relevant
  • Identifies missing information
  • Reformulates queries when needed
  • Retries retrieval with improved strategies before answering

The Retrieve-Reason-Retry Pattern

Consider a compliance question about investment restrictions. A basic RAG might surface general policy documents and produce a vague answer. An agentic RAG agent would:

  1. Retrieve initial documents about investment policies
  2. Reason about what鈥檚 missing (specific asset class restrictions, regulatory citations)
  3. Retry with refined queries targeting compliance frameworks
1
2
3
4
5
6
7
8
9
@dataclass
class RAGResponse:
"""Response from an agentic RAG agent"""
query: str
answer: str
sources: List[str]
retrieved_chunks: int
needed_retry: bool
confidence: str # "high", "medium", "low"

Financial Policy Assistant

Here鈥檚 how an agentic RAG agent handles employee benefits queries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
class PolicyRAGAgent:
"""Autonomous RAG agent with self-improvement capabilities"""

def __init__(self, collection):
self.collection = collection
self.llm = OpenAI()

def process_query(self, question: str) -> RAGResponse:
"""Full RAG workflow with reflection and retry"""

# Step 1: Initial retrieval
docs = self.retrieve_documents(question)

# Step 2: Generate initial answer
answer = self.generate_answer(question, docs)

# Step 3: Self-evaluate - does answer need more context?
if self.should_retry(question, answer, docs):
# Step 4: Improve the query and retry
better_query = self.improve_query(question, answer)
docs = self.retrieve_documents(better_query)
answer = self.generate_answer(question, docs)
needed_retry = True
else:
needed_retry = False

# Step 5: Assess confidence
confidence = self.assess_confidence(answer, docs)

return RAGResponse(
query=question,
answer=answer,
sources=[d.metadata['source'] for d in docs],
retrieved_chunks=len(docs),
needed_retry=needed_retry,
confidence=confidence
)

def should_retry(self, question: str, answer: str,
docs: List) -> bool:
"""Use LLM to evaluate if answer is sufficient"""

prompt = f"""
Question: {question}
Answer: {answer}
Documents used: {len(docs)}

Is this answer complete and well-grounded?
Consider:
- Does it directly address the question?
- Are there gaps in the information?
- Does it rely on speculation vs. retrieved facts?

Respond with just YES or NO.
"""

response = self.llm.complete(prompt)
return "NO" in response.upper()

def improve_query(self, original: str, initial_answer: str) -> str:
"""Generate a better search query based on gaps"""

prompt = f"""
Original question: {original}
Initial answer (incomplete): {initial_answer}

Generate a more specific search query that would find
the missing information. Focus on what the answer lacks.
"""

return self.llm.complete(prompt).strip()

Handling Out-of-Scope Questions

A well-designed agentic RAG gracefully handles questions outside its knowledge base:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def answer_with_boundaries(self, question: str) -> RAGResponse:
"""Answer with awareness of knowledge boundaries"""

response = self.process_query(question)

# If confidence is low after retry, acknowledge limitations
if response.needed_retry and response.confidence == "low":
response.answer = (
"I don't have enough information in the policy documents "
"to answer that question. This may require consulting "
"with the relevant department directly."
)

return response

Long-Term Memory for Financial Agents

While short-term memory maintains coherence within a session, long-term memory enables agents to recall facts and preferences across sessions - essential for personalized financial advice.

flowchart TD
    subgraph Memory["Long-Term Memory Types"]
        S[Semantic Memory
Facts & Preferences]
        E[Episodic Memory
Past Interactions]
        P[Procedural Memory
Behavioral Adaptations]
    end

    U[User Interaction] --> S
    U --> E
    U --> P

    S --> R[Recall for Future Sessions]
    E --> R
    P --> R

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class S blueClass
    class E orangeClass
    class P greenClass

Three Types of Long-Term Memory

Semantic Memory stores factual information learned from users:

  • Client names and account numbers
  • Risk tolerance preferences
  • Investment goals and constraints

Episodic Memory captures past interactions:

  • Previous advisory conversations
  • Successful problem resolutions
  • Transaction history context

Procedural Memory encodes behavioral adaptations:

  • Preferred communication style (formal vs. casual)
  • Report formatting preferences
  • Escalation thresholds

Memory-Enabled Treasury Assistant

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from sqlalchemy import Column, String, DateTime, Float, Integer, Text, JSON
from sqlalchemy.orm import declarative_base
from datetime import datetime, timedelta

Base = declarative_base()

class MemoryEntry(Base):
"""Database model for agent memory"""
__tablename__ = 'treasury_memory'

id = Column(String(36), primary_key=True)

# Core memory data
topic = Column(String(100), nullable=False, index=True)
fact_text = Column(Text, nullable=False)
source = Column(String(50), nullable=False) # "cfo", "policy", "interaction"

# Temporal metadata
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, onupdate=datetime.utcnow)
ttl_days = Column(Integer, nullable=True) # Time-to-live

# Prioritization
weight = Column(Float, default=1.0)
pinned = Column(Boolean, default=False)
access_count = Column(Integer, default=1)

# Organization
tags = Column(JSON, default=list)

def is_expired(self) -> bool:
"""Check if memory has expired based on TTL"""
if self.ttl_days is None:
return False
expiry = self.created_at + timedelta(days=self.ttl_days)
return datetime.utcnow() > expiry


@dataclass
class MemoryPolicy:
"""Configuration for memory management"""

# TTL by topic (days) - None means permanent
ttl_policies: Dict[str, Optional[int]] = field(default_factory=lambda: {
'cash_policy': None, # Permanent
'investment_rules': None, # Permanent
'counterparty_limits': None, # Permanent
'market_opportunities': 30, # Expire after 30 days
'temporary_approvals': 90, # Expire after 90 days
})

# Weighting formula
recency_boost: float = 0.1
frequency_boost: float = 0.05
pinned_boost: float = 2.0

max_digest_size: int = 12

Intelligent Memory Manager

The memory manager uses LLM to canonicalize facts for consistency:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
class TreasuryMemoryManager:
"""Intelligent memory manager with LLM integration"""

def __init__(self, policy: MemoryPolicy = None):
self.policy = policy or MemoryPolicy()
self.session = SessionLocal()

def canonicalize_fact(self, fact_text: str, topic: str) -> str:
"""Normalize fact text for consistent storage"""

prompt = f"""
Canonicalize this {topic} policy into a clear statement:

Original: {fact_text}

Rules:
- Use standard financial terminology
- Format amounts as $X,XXX,XXX or $XM
- Use standard percentage format
- Keep compliance language precise
- Maintain core meaning

Canonicalized fact:
"""

response = self.llm.complete(prompt, temperature=0.1)
return response.strip()

def upsert_memory(self, topic: str, fact: str,
source: str = "interaction") -> MemoryEntry:
"""Add or update a memory entry"""

# Canonicalize for consistency
canonical_fact = self.canonicalize_fact(fact, topic)

# Check for existing similar memory
existing = self.find_similar(canonical_fact, topic)

if existing:
# Update existing memory
existing.fact_text = canonical_fact
existing.access_count += 1
existing.updated_at = datetime.utcnow()
return existing

# Create new memory
entry = MemoryEntry(
id=str(uuid.uuid4()),
topic=topic,
fact_text=canonical_fact,
source=source,
ttl_days=self.policy.ttl_policies.get(topic)
)
self.session.add(entry)
self.session.commit()
return entry

def get_policy_digest(self, max_items: int = None) -> List[str]:
"""Get prioritized list of active policies"""

max_items = max_items or self.policy.max_digest_size

# Query non-expired memories
memories = self.session.query(MemoryEntry).all()
active = [m for m in memories if not m.is_expired()]

# Score and sort by priority
scored = []
for m in active:
score = self.calculate_priority(m)
scored.append((score, m))

scored.sort(reverse=True, key=lambda x: x[0])

# Return top items with pinned indicator
result = []
for _, m in scored[:max_items]:
prefix = "馃搶 " if m.pinned else ""
result.append(f"{prefix}{m.fact_text}")

return result

def calculate_priority(self, memory: MemoryEntry) -> float:
"""Calculate dynamic priority score"""

base = memory.weight

# Recency boost
days_old = (datetime.utcnow() - memory.created_at).days
recency = self.policy.recency_boost * max(0, 30 - days_old)

# Frequency boost
frequency = self.policy.frequency_boost * memory.access_count

# Pinned boost
pinned = self.policy.pinned_boost if memory.pinned else 0

return base + recency + frequency + pinned

Memory Scoping

Memories can be scoped at different levels:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class ScopedMemoryManager:
"""Memory manager with access scoping"""

def retrieve_memories(self, user_id: str, team_id: str = None,
include_global: bool = True) -> List[MemoryEntry]:
"""Retrieve memories at appropriate scope levels"""

memories = []

# User-specific memories (highest priority)
memories.extend(
self.get_by_scope("user", user_id)
)

# Team-shared memories
if team_id:
memories.extend(
self.get_by_scope("team", team_id)
)

# Global organizational memories
if include_global:
memories.extend(
self.get_by_scope("global", "org")
)

return self.deduplicate(memories)

Agent Evaluation

Building agents is only half the challenge - ensuring they work correctly, safely, and efficiently requires systematic evaluation. This is especially critical in financial services where errors have regulatory and monetary consequences.

Four Dimensions of Evaluation

flowchart TD
    E[Agent Evaluation] --> TC[Task Completion]
    E --> QC[Quality Control]
    E --> TI[Tool Interaction]
    E --> SM[System Metrics]

    TC --> TC1[Goal achieved?]
    TC --> TC2[Steps required?]
    TC --> TC3[Human intervention?]

    QC --> QC1[Correct format?]
    QC --> QC2[Instructions followed?]
    QC --> QC3[Context used?]

    TI --> TI1[Right tools?]
    TI --> TI2[Valid arguments?]
    TI --> TI3[Appropriate sequence?]

    SM --> SM1[Latency]
    SM --> SM2[Token usage]
    SM --> SM3[Error rate]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef pinkClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff

    class TC blueClass
    class QC orangeClass
    class TI greenClass
    class SM pinkClass

Three Evaluation Strategies

1. Final Response Evaluation
Judges only the end result - ideal for end-to-end testing but provides no insight into how the result was produced.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def evaluate_final_response(agent_answer: str,
correct_answer: str,
question: str) -> Dict:
"""Black-box evaluation of final output"""

prompt = f"""
Evaluate the agent's answer against the correct answer.

Question: {question}
Correct Answer: {correct_answer}
Agent Answer: {agent_answer}

Score accuracy 0-100 based on:
- 90-100: Fully correct, all key facts present
- 70-89: Mostly correct, minor omissions
- 50-69: Partially correct, significant gaps
- Below 50: Incorrect or misleading

Return JSON: {{"score": <number>, "reasoning": "<explanation>"}}
"""

return llm.complete(prompt, response_format="json")

2. Single-Step Evaluation
Focuses on individual decisions - fast and ideal for debugging specific issues.

1
2
3
4
5
6
7
8
9
10
def evaluate_tool_choice(context: str, available_tools: List[str],
chosen_tool: str, expected_tool: str) -> Dict:
"""Evaluate a single tool selection decision"""

return {
"correct": chosen_tool == expected_tool,
"chosen": chosen_tool,
"expected": expected_tool,
"context_summary": context[:200]
}

3. Trajectory Evaluation
Reviews the full execution path - richest insight but requires more setup.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def evaluate_trajectory(execution_trace: List[Dict],
expected_trajectory: List[Dict]) -> Dict:
"""Evaluate complete agent execution path"""

results = {
"tool_accuracy": 0,
"argument_accuracy": 0,
"sequence_score": 0,
"efficiency_score": 0
}

# Compare tool calls
for i, (actual, expected) in enumerate(
zip(execution_trace, expected_trajectory)
):
if actual['tool'] == expected['tool']:
results['tool_accuracy'] += 1
if actual.get('args') == expected.get('args'):
results['argument_accuracy'] += 1

# Normalize scores
total = len(expected_trajectory)
results['tool_accuracy'] /= total
results['argument_accuracy'] /= total

# Efficiency: fewer steps is better
results['efficiency_score'] = min(1.0, total / len(execution_trace))

return results

Financial Agent Evaluation Framework

For insurance claims processing, we evaluate three weighted metrics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
class InsuranceAgentEvaluator:
"""Evaluation framework for insurance claims assistant"""

def evaluate_complete_response(self, agent_response: Dict,
gold_item: Dict) -> Dict:
"""Comprehensive evaluation with weighted scoring"""

# Metric 1: Factual Accuracy (40% weight)
accuracy = self.evaluate_factual_accuracy(
agent_response['answer'],
gold_item['correct_answer'],
gold_item['question']
)

# Metric 2: Citation Compliance (30% weight)
citation = self.evaluate_citation_compliance(
agent_response['answer'],
gold_item['requires_citation'],
agent_response.get('sources', [])
)

# Metric 3: Retrieval Relevance (30% weight)
retrieval = self.evaluate_retrieval_relevance(
gold_item['question'],
agent_response['retrieved_docs'],
gold_item['expected_doc_ids']
)

# Weighted composite score
composite = (
accuracy['score'] * 0.40 +
citation['score'] * 0.30 +
retrieval['score'] * 0.30
)

return {
'composite_score': composite,
'factual_accuracy': accuracy,
'citation_compliance': citation,
'retrieval_relevance': retrieval
}

def evaluate_citation_compliance(self, answer: str,
should_cite: bool,
sources: List[str]) -> Dict:
"""Check for proper source attribution"""

citation_patterns = [
r'\[.*?\]', # [Source Name]
r'according to', # According to policy...
r'source:', # Source: Document
r'per the', # Per the guidelines
r'as stated in' # As stated in...
]

citations_found = []
for pattern in citation_patterns:
matches = re.findall(pattern, answer, re.IGNORECASE)
citations_found.extend(matches)

has_citations = len(citations_found) > 0

if should_cite:
score = 100 if has_citations else 0
else:
score = 100 # No citation required

return {
'score': score,
'citations_found': citations_found,
'sources_used': sources
}

def evaluate_retrieval_relevance(self, question: str,
retrieved_docs: List,
expected_doc_ids: List[str]) -> Dict:
"""Measure retrieval quality with precision/recall"""

retrieved_ids = [doc.metadata.get('doc_id') for doc in retrieved_docs]

expected_set = set(expected_doc_ids)
retrieved_set = set(retrieved_ids)

if retrieved_set:
precision = len(expected_set & retrieved_set) / len(retrieved_set)
else:
precision = 0

if expected_set:
recall = len(expected_set & retrieved_set) / len(expected_set)
else:
recall = 1.0

# F1 score
if precision + recall > 0:
f1 = 2 * (precision * recall) / (precision + recall)
else:
f1 = 0

return {
'score': f1 * 100,
'precision': precision,
'recall': recall,
'retrieved_count': len(retrieved_ids),
'expected_count': len(expected_doc_ids)
}

Running Evaluation Suites

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def run_evaluation_suite(agent, golden_dataset: List[Dict]) -> Dict:
"""Execute full evaluation across test dataset"""

evaluator = InsuranceAgentEvaluator()
results = []

for item in golden_dataset:
# Get agent response
response = agent.answer_question(item['question'])

# Evaluate
eval_result = evaluator.evaluate_complete_response(response, item)
eval_result['question_id'] = item['id']
eval_result['category'] = item['category']
results.append(eval_result)

# Aggregate metrics
df = pd.DataFrame(results)

return {
'avg_composite': df['composite_score'].mean(),
'avg_accuracy': df['factual_accuracy'].apply(lambda x: x['score']).mean(),
'avg_citation': df['citation_compliance'].apply(lambda x: x['score']).mean(),
'avg_retrieval': df['retrieval_relevance'].apply(lambda x: x['score']).mean(),
'results_by_category': df.groupby('category')['composite_score'].mean().to_dict(),
'detailed_results': results
}

Generating Improvement Recommendations

Evaluation drives improvement by identifying specific weaknesses:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def generate_recommendations(metrics: Dict, results_df) -> List[str]:
"""Generate actionable recommendations from evaluation results"""

recommendations = []

# Low retrieval relevance
if metrics['avg_retrieval'] < 70:
recommendations.append(
"RETRIEVAL: Improve document chunking strategy. Current chunks "
"may be too large or missing key metadata. Consider semantic "
"chunking based on policy sections."
)

# Low citation compliance
if metrics['avg_citation'] < 80:
recommendations.append(
"CITATIONS: Add explicit citation instructions to system prompt. "
"Require agent to reference specific policy documents for all "
"regulatory guidance."
)

# Low accuracy on specific categories
for category, score in metrics['results_by_category'].items():
if score < 75:
recommendations.append(
f"CATEGORY-{category.upper()}: Performance is below threshold. "
f"Consider adding more {category} examples to training data "
f"or improving {category}-specific retrieval."
)

return recommendations

Takeaways

  1. Agentic RAG transforms retrieval from passive lookup to active reasoning with reflect-reformulate-retry loops

  2. Self-evaluation enables quality control - agents assess their own answers and retry when confidence is low

  3. Long-term memory creates personalization through three types: semantic (facts), episodic (interactions), and procedural (behaviors)

  4. Memory management requires policies - TTL settings, priority scoring, and scope levels prevent bloat and maintain relevance

  5. Evaluation spans four dimensions: task completion, quality control, tool interaction, and system metrics

  6. Three evaluation strategies serve different needs: final response for end-to-end, single-step for debugging, trajectory for comprehensive analysis

  7. Weighted composite scores provide holistic assessment while highlighting specific areas for improvement


This is the eleventh post in my Applied Agentic AI for Finance series. Next: Multi-Agent Architecture for Trading where we鈥檒l explore coordinating multiple specialized agents for complex financial operations.

Connecting Agents to Financial Data Sources Multi-Agent Architecture for Trading

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×