Financial agents need two critical capabilities that set them apart from simple chatbots: the ability to ground responses in authoritative documents through intelligent retrieval, and continuous evaluation to ensure accuracy in regulated environments. Agentic RAG transforms retrieval from a passive lookup into an active reasoning loop, while long-term memory enables personalization across sessions. Together with robust evaluation frameworks, these capabilities create production-ready financial assistants.
From Basic RAG to Agentic RAG
Traditional Retrieval-Augmented Generation follows a simple pattern: accept a query, retrieve documents, generate an answer. This works for straightforward questions but falls short in complex financial scenarios where initial retrieval might miss critical context.
flowchart TB
subgraph Agentic["Agentic RAG"]
direction LR
Q2[Query] --> R2[Retrieve]
R2 --> E[Evaluate]
E --> D{Sufficient?}
D -->|No| RF[Reformulate]
RF --> R2
D -->|Yes| G2[Generate]
G2 --> A2[Answer]
end
subgraph Traditional["Traditional RAG"]
direction LR
Q1[Query] --> R1[Retrieve Once]
R1 --> G1[Generate]
G1 --> A1[Answer]
end
classDef pinkClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class Traditional pinkClass
class Agentic greenClass
Agentic RAG introduces a feedback loop where the agent:
Assesses whether retrieved documents are relevant
Identifies missing information
Reformulates queries when needed
Retries retrieval with improved strategies before answering
The Retrieve-Reason-Retry Pattern
Consider a compliance question about investment restrictions. A basic RAG might surface general policy documents and produce a vague answer. An agentic RAG agent would:
Retrieve initial documents about investment policies
Reason about what鈥檚 missing (specific asset class restrictions, regulatory citations)
Retry with refined queries targeting compliance frameworks
1 2 3 4 5 6 7 8 9
@dataclass classRAGResponse: """Response from an agentic RAG agent""" query: str answer: str sources: List[str] retrieved_chunks: int needed_retry: bool confidence: str# "high", "medium", "low"
Financial Policy Assistant
Here鈥檚 how an agentic RAG agent handles employee benefits queries:
return RAGResponse( query=question, answer=answer, sources=[d.metadata['source'] for d in docs], retrieved_chunks=len(docs), needed_retry=needed_retry, confidence=confidence )
defshould_retry(self, question: str, answer: str, docs: List) -> bool: """Use LLM to evaluate if answer is sufficient"""
prompt = f""" Question: {question} Answer: {answer} Documents used: {len(docs)} Is this answer complete and well-grounded? Consider: - Does it directly address the question? - Are there gaps in the information? - Does it rely on speculation vs. retrieved facts? Respond with just YES or NO. """
defimprove_query(self, original: str, initial_answer: str) -> str: """Generate a better search query based on gaps"""
prompt = f""" Original question: {original} Initial answer (incomplete): {initial_answer} Generate a more specific search query that would find the missing information. Focus on what the answer lacks. """
returnself.llm.complete(prompt).strip()
Handling Out-of-Scope Questions
A well-designed agentic RAG gracefully handles questions outside its knowledge base:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
defanswer_with_boundaries(self, question: str) -> RAGResponse: """Answer with awareness of knowledge boundaries"""
response = self.process_query(question)
# If confidence is low after retry, acknowledge limitations if response.needed_retry and response.confidence == "low": response.answer = ( "I don't have enough information in the policy documents " "to answer that question. This may require consulting " "with the relevant department directly." )
return response
Long-Term Memory for Financial Agents
While short-term memory maintains coherence within a session, long-term memory enables agents to recall facts and preferences across sessions - essential for personalized financial advice.
flowchart TD
subgraph Memory["Long-Term Memory Types"]
S[Semantic Memory
Facts & Preferences]
E[Episodic Memory
Past Interactions]
P[Procedural Memory
Behavioral Adaptations]
end
U[User Interaction] --> S
U --> E
U --> P
S --> R[Recall for Future Sessions]
E --> R
P --> R
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
class S blueClass
class E orangeClass
class P greenClass
Three Types of Long-Term Memory
Semantic Memory stores factual information learned from users:
defcanonicalize_fact(self, fact_text: str, topic: str) -> str: """Normalize fact text for consistent storage"""
prompt = f""" Canonicalize this {topic} policy into a clear statement: Original: {fact_text} Rules: - Use standard financial terminology - Format amounts as $X,XXX,XXX or $XM - Use standard percentage format - Keep compliance language precise - Maintain core meaning Canonicalized fact: """
# Query non-expired memories memories = self.session.query(MemoryEntry).all() active = [m for m in memories ifnot m.is_expired()]
# Score and sort by priority scored = [] for m in active: score = self.calculate_priority(m) scored.append((score, m))
scored.sort(reverse=True, key=lambda x: x[0])
# Return top items with pinned indicator result = [] for _, m in scored[:max_items]: prefix = "馃搶 "if m.pinned else"" result.append(f"{prefix}{m.fact_text}")
# Team-shared memories if team_id: memories.extend( self.get_by_scope("team", team_id) )
# Global organizational memories if include_global: memories.extend( self.get_by_scope("global", "org") )
returnself.deduplicate(memories)
Agent Evaluation
Building agents is only half the challenge - ensuring they work correctly, safely, and efficiently requires systematic evaluation. This is especially critical in financial services where errors have regulatory and monetary consequences.
Four Dimensions of Evaluation
flowchart TD
E[Agent Evaluation] --> TC[Task Completion]
E --> QC[Quality Control]
E --> TI[Tool Interaction]
E --> SM[System Metrics]
TC --> TC1[Goal achieved?]
TC --> TC2[Steps required?]
TC --> TC3[Human intervention?]
QC --> QC1[Correct format?]
QC --> QC2[Instructions followed?]
QC --> QC3[Context used?]
TI --> TI1[Right tools?]
TI --> TI2[Valid arguments?]
TI --> TI3[Appropriate sequence?]
SM --> SM1[Latency]
SM --> SM2[Token usage]
SM --> SM3[Error rate]
classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
classDef pinkClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff
class TC blueClass
class QC orangeClass
class TI greenClass
class SM pinkClass
Three Evaluation Strategies
1. Final Response Evaluation Judges only the end result - ideal for end-to-end testing but provides no insight into how the result was produced.
# Low retrieval relevance if metrics['avg_retrieval'] < 70: recommendations.append( "RETRIEVAL: Improve document chunking strategy. Current chunks " "may be too large or missing key metadata. Consider semantic " "chunking based on policy sections." )
# Low citation compliance if metrics['avg_citation'] < 80: recommendations.append( "CITATIONS: Add explicit citation instructions to system prompt. " "Require agent to reference specific policy documents for all " "regulatory guidance." )
# Low accuracy on specific categories for category, score in metrics['results_by_category'].items(): if score < 75: recommendations.append( f"CATEGORY-{category.upper()}: Performance is below threshold. " f"Consider adding more {category} examples to training data " f"or improving {category}-specific retrieval." )
return recommendations
Takeaways
Agentic RAG transforms retrieval from passive lookup to active reasoning with reflect-reformulate-retry loops
Self-evaluation enables quality control - agents assess their own answers and retry when confidence is low
Long-term memory creates personalization through three types: semantic (facts), episodic (interactions), and procedural (behaviors)
Memory management requires policies - TTL settings, priority scoring, and scope levels prevent bloat and maintain relevance
Evaluation spans four dimensions: task completion, quality control, tool interaction, and system metrics
Three evaluation strategies serve different needs: final response for end-to-end, single-step for debugging, trajectory for comprehensive analysis
Weighted composite scores provide holistic assessment while highlighting specific areas for improvement
This is the eleventh post in my Applied Agentic AI for Finance series. Next: Multi-Agent Architecture for Trading where we鈥檒l explore coordinating multiple specialized agents for complex financial operations.
Comments