RAG and Evaluation for Financial Agents

Jan 21 2026 AI finance-ai

Financial agents need two critical capabilities that set them apart from simple chatbots: the ability to ground responses in authoritative documents through intelligent retrieval, and continuous evaluation to ensure accuracy in regulated environments. Agentic RAG transforms retrieval from a passive lookup into an active reasoning loop, while long-term memory enables personalization across sessions. Together with robust evaluation frameworks, these capabilities create production-ready financial assistants.

From Basic RAG to Agentic RAG

Traditional Retrieval-Augmented Generation follows a simple pattern: accept a query, retrieve documents, generate an answer. This works for straightforward questions but falls short in complex financial scenarios where initial retrieval might miss critical context.

flowchart TB
    subgraph Agentic["Agentic RAG"]
        direction LR
        Q2[Query] --> R2[Retrieve]
        R2 --> E[Evaluate]
        E --> D{Sufficient?}
        D -->|No| RF[Reformulate]
        RF --> R2
        D -->|Yes| G2[Generate]
        G2 --> A2[Answer]
    end

    subgraph Traditional["Traditional RAG"]
        direction LR
        Q1[Query] --> R1[Retrieve Once]
        R1 --> G1[Generate]
        G1 --> A1[Answer]
    end

    classDef pinkClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class Traditional pinkClass
    class Agentic greenClass

Agentic RAG introduces a feedback loop where the agent:

Assesses whether retrieved documents are relevant
Identifies missing information
Reformulates queries when needed
Retries retrieval with improved strategies before answering

The Retrieve-Reason-Retry Pattern

Consider a compliance question about investment restrictions. A basic RAG might surface general policy documents and produce a vague answer. An agentic RAG agent would:

Retrieve initial documents about investment policies
Reason about what’s missing (specific asset class restrictions, regulatory citations)
Retry with refined queries targeting compliance frameworks

@dataclass
class RAGResponse:
    """Response from an agentic RAG agent"""
    query: str
    answer: str
    sources: List[str]
    retrieved_chunks: int
    needed_retry: bool
    confidence: str  # "high", "medium", "low"

Financial Policy Assistant

Here’s how an agentic RAG agent handles employee benefits queries:

class PolicyRAGAgent:
    """Autonomous RAG agent with self-improvement capabilities"""

    def __init__(self, collection):
        self.collection = collection
        self.llm = OpenAI()

    def process_query(self, question: str) -> RAGResponse:
        """Full RAG workflow with reflection and retry"""

        # Step 1: Initial retrieval
        docs = self.retrieve_documents(question)

        # Step 2: Generate initial answer
        answer = self.generate_answer(question, docs)

        # Step 3: Self-evaluate - does answer need more context?
        if self.should_retry(question, answer, docs):
            # Step 4: Improve the query and retry
            better_query = self.improve_query(question, answer)
            docs = self.retrieve_documents(better_query)
            answer = self.generate_answer(question, docs)
            needed_retry = True
        else:
            needed_retry = False

        # Step 5: Assess confidence
        confidence = self.assess_confidence(answer, docs)

        return RAGResponse(
            query=question,
            answer=answer,
            sources=[d.metadata['source'] for d in docs],
            retrieved_chunks=len(docs),
            needed_retry=needed_retry,
            confidence=confidence
        )

    def should_retry(self, question: str, answer: str,
                     docs: List) -> bool:
        """Use LLM to evaluate if answer is sufficient"""

        prompt = f"""
        Question: {question}
        Answer: {answer}
        Documents used: {len(docs)}

        Is this answer complete and well-grounded?
        Consider:
        - Does it directly address the question?
        - Are there gaps in the information?
        - Does it rely on speculation vs. retrieved facts?

        Respond with just YES or NO.
        """

        response = self.llm.complete(prompt)
        return "NO" in response.upper()

    def improve_query(self, original: str, initial_answer: str) -> str:
        """Generate a better search query based on gaps"""

        prompt = f"""
        Original question: {original}
        Initial answer (incomplete): {initial_answer}

        Generate a more specific search query that would find
        the missing information. Focus on what the answer lacks.
        """

        return self.llm.complete(prompt).strip()

Handling Out-of-Scope Questions

A well-designed agentic RAG gracefully handles questions outside its knowledge base:

def answer_with_boundaries(self, question: str) -> RAGResponse:
    """Answer with awareness of knowledge boundaries"""

    response = self.process_query(question)

    # If confidence is low after retry, acknowledge limitations
    if response.needed_retry and response.confidence == "low":
        response.answer = (
            "I don't have enough information in the policy documents "
            "to answer that question. This may require consulting "
            "with the relevant department directly."
        )

    return response

Long-Term Memory for Financial Agents

While short-term memory maintains coherence within a session, long-term memory enables agents to recall facts and preferences across sessions - essential for personalized financial advice.

flowchart TD
    subgraph Memory["Long-Term Memory Types"]
        S[Semantic Memory
Facts & Preferences]
        E[Episodic Memory
Past Interactions]
        P[Procedural Memory
Behavioral Adaptations]
    end

    U[User Interaction] --> S
    U --> E
    U --> P

    S --> R[Recall for Future Sessions]
    E --> R
    P --> R

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff

    class S blueClass
    class E orangeClass
    class P greenClass

Three Types of Long-Term Memory

Semantic Memory stores factual information learned from users:

Client names and account numbers
Risk tolerance preferences
Investment goals and constraints

Episodic Memory captures past interactions:

Previous advisory conversations
Successful problem resolutions
Transaction history context

Procedural Memory encodes behavioral adaptations:

Preferred communication style (formal vs. casual)
Report formatting preferences
Escalation thresholds

Memory-Enabled Treasury Assistant

from sqlalchemy import Column, String, DateTime, Float, Integer, Text, JSON
from sqlalchemy.orm import declarative_base
from datetime import datetime, timedelta

Base = declarative_base()

class MemoryEntry(Base):
    """Database model for agent memory"""
    __tablename__ = 'treasury_memory'

    id = Column(String(36), primary_key=True)

    # Core memory data
    topic = Column(String(100), nullable=False, index=True)
    fact_text = Column(Text, nullable=False)
    source = Column(String(50), nullable=False)  # "cfo", "policy", "interaction"

    # Temporal metadata
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, onupdate=datetime.utcnow)
    ttl_days = Column(Integer, nullable=True)  # Time-to-live

    # Prioritization
    weight = Column(Float, default=1.0)
    pinned = Column(Boolean, default=False)
    access_count = Column(Integer, default=1)

    # Organization
    tags = Column(JSON, default=list)

    def is_expired(self) -> bool:
        """Check if memory has expired based on TTL"""
        if self.ttl_days is None:
            return False
        expiry = self.created_at + timedelta(days=self.ttl_days)
        return datetime.utcnow() > expiry


@dataclass
class MemoryPolicy:
    """Configuration for memory management"""

    # TTL by topic (days) - None means permanent
    ttl_policies: Dict[str, Optional[int]] = field(default_factory=lambda: {
        'cash_policy': None,           # Permanent
        'investment_rules': None,      # Permanent
        'counterparty_limits': None,   # Permanent
        'market_opportunities': 30,    # Expire after 30 days
        'temporary_approvals': 90,     # Expire after 90 days
    })

    # Weighting formula
    recency_boost: float = 0.1
    frequency_boost: float = 0.05
    pinned_boost: float = 2.0

    max_digest_size: int = 12

Intelligent Memory Manager

The memory manager uses LLM to canonicalize facts for consistency:

class TreasuryMemoryManager:
    """Intelligent memory manager with LLM integration"""

    def __init__(self, policy: MemoryPolicy = None):
        self.policy = policy or MemoryPolicy()
        self.session = SessionLocal()

    def canonicalize_fact(self, fact_text: str, topic: str) -> str:
        """Normalize fact text for consistent storage"""

        prompt = f"""
        Canonicalize this {topic} policy into a clear statement:

        Original: {fact_text}

        Rules:
        - Use standard financial terminology
        - Format amounts as $X,XXX,XXX or $XM
        - Use standard percentage format
        - Keep compliance language precise
        - Maintain core meaning

        Canonicalized fact:
        """

        response = self.llm.complete(prompt, temperature=0.1)
        return response.strip()

    def upsert_memory(self, topic: str, fact: str,
                      source: str = "interaction") -> MemoryEntry:
        """Add or update a memory entry"""

        # Canonicalize for consistency
        canonical_fact = self.canonicalize_fact(fact, topic)

        # Check for existing similar memory
        existing = self.find_similar(canonical_fact, topic)

        if existing:
            # Update existing memory
            existing.fact_text = canonical_fact
            existing.access_count += 1
            existing.updated_at = datetime.utcnow()
            return existing

        # Create new memory
        entry = MemoryEntry(
            id=str(uuid.uuid4()),
            topic=topic,
            fact_text=canonical_fact,
            source=source,
            ttl_days=self.policy.ttl_policies.get(topic)
        )
        self.session.add(entry)
        self.session.commit()
        return entry

    def get_policy_digest(self, max_items: int = None) -> List[str]:
        """Get prioritized list of active policies"""

        max_items = max_items or self.policy.max_digest_size

        # Query non-expired memories
        memories = self.session.query(MemoryEntry).all()
        active = [m for m in memories if not m.is_expired()]

        # Score and sort by priority
        scored = []
        for m in active:
            score = self.calculate_priority(m)
            scored.append((score, m))

        scored.sort(reverse=True, key=lambda x: x[0])

        # Return top items with pinned indicator
        result = []
        for _, m in scored[:max_items]:
            prefix = "📌 " if m.pinned else ""
            result.append(f"{prefix}{m.fact_text}")

        return result

    def calculate_priority(self, memory: MemoryEntry) -> float:
        """Calculate dynamic priority score"""

        base = memory.weight

        # Recency boost
        days_old = (datetime.utcnow() - memory.created_at).days
        recency = self.policy.recency_boost * max(0, 30 - days_old)

        # Frequency boost
        frequency = self.policy.frequency_boost * memory.access_count

        # Pinned boost
        pinned = self.policy.pinned_boost if memory.pinned else 0

        return base + recency + frequency + pinned

Memory Scoping

Memories can be scoped at different levels:

class ScopedMemoryManager:
    """Memory manager with access scoping"""

    def retrieve_memories(self, user_id: str, team_id: str = None,
                         include_global: bool = True) -> List[MemoryEntry]:
        """Retrieve memories at appropriate scope levels"""

        memories = []

        # User-specific memories (highest priority)
        memories.extend(
            self.get_by_scope("user", user_id)
        )

        # Team-shared memories
        if team_id:
            memories.extend(
                self.get_by_scope("team", team_id)
            )

        # Global organizational memories
        if include_global:
            memories.extend(
                self.get_by_scope("global", "org")
            )

        return self.deduplicate(memories)

Agent Evaluation

Building agents is only half the challenge - ensuring they work correctly, safely, and efficiently requires systematic evaluation. This is especially critical in financial services where errors have regulatory and monetary consequences.

Four Dimensions of Evaluation

flowchart TD
    E[Agent Evaluation] --> TC[Task Completion]
    E --> QC[Quality Control]
    E --> TI[Tool Interaction]
    E --> SM[System Metrics]

    TC --> TC1[Goal achieved?]
    TC --> TC2[Steps required?]
    TC --> TC3[Human intervention?]

    QC --> QC1[Correct format?]
    QC --> QC2[Instructions followed?]
    QC --> QC3[Context used?]

    TI --> TI1[Right tools?]
    TI --> TI2[Valid arguments?]
    TI --> TI3[Appropriate sequence?]

    SM --> SM1[Latency]
    SM --> SM2[Token usage]
    SM --> SM3[Error rate]

    classDef blueClass fill:#4A90E2,stroke:#333,stroke-width:2px,color:#fff
    classDef orangeClass fill:#F39C12,stroke:#333,stroke-width:2px,color:#fff
    classDef greenClass fill:#27AE60,stroke:#333,stroke-width:2px,color:#fff
    classDef pinkClass fill:#E74C3C,stroke:#333,stroke-width:2px,color:#fff

    class TC blueClass
    class QC orangeClass
    class TI greenClass
    class SM pinkClass

Three Evaluation Strategies

1. Final Response Evaluation
Judges only the end result - ideal for end-to-end testing but provides no insight into how the result was produced.

def evaluate_final_response(agent_answer: str,
                           correct_answer: str,
                           question: str) -> Dict:
    """Black-box evaluation of final output"""

    prompt = f"""
    Evaluate the agent's answer against the correct answer.

    Question: {question}
    Correct Answer: {correct_answer}
    Agent Answer: {agent_answer}

    Score accuracy 0-100 based on:
    - 90-100: Fully correct, all key facts present
    - 70-89: Mostly correct, minor omissions
    - 50-69: Partially correct, significant gaps
    - Below 50: Incorrect or misleading

    Return JSON: {{"score": <number>, "reasoning": "<explanation>"}}
    """

    return llm.complete(prompt, response_format="json")

2. Single-Step Evaluation
Focuses on individual decisions - fast and ideal for debugging specific issues.

def evaluate_tool_choice(context: str, available_tools: List[str],
                         chosen_tool: str, expected_tool: str) -> Dict:
    """Evaluate a single tool selection decision"""

    return {
        "correct": chosen_tool == expected_tool,
        "chosen": chosen_tool,
        "expected": expected_tool,
        "context_summary": context[:200]
    }

3. Trajectory Evaluation
Reviews the full execution path - richest insight but requires more setup.

def evaluate_trajectory(execution_trace: List[Dict],
                       expected_trajectory: List[Dict]) -> Dict:
    """Evaluate complete agent execution path"""

    results = {
        "tool_accuracy": 0,
        "argument_accuracy": 0,
        "sequence_score": 0,
        "efficiency_score": 0
    }

    # Compare tool calls
    for i, (actual, expected) in enumerate(
        zip(execution_trace, expected_trajectory)
    ):
        if actual['tool'] == expected['tool']:
            results['tool_accuracy'] += 1
        if actual.get('args') == expected.get('args'):
            results['argument_accuracy'] += 1

    # Normalize scores
    total = len(expected_trajectory)
    results['tool_accuracy'] /= total
    results['argument_accuracy'] /= total

    # Efficiency: fewer steps is better
    results['efficiency_score'] = min(1.0, total / len(execution_trace))

    return results

Financial Agent Evaluation Framework

For insurance claims processing, we evaluate three weighted metrics:

class InsuranceAgentEvaluator:
    """Evaluation framework for insurance claims assistant"""

    def evaluate_complete_response(self, agent_response: Dict,
                                   gold_item: Dict) -> Dict:
        """Comprehensive evaluation with weighted scoring"""

        # Metric 1: Factual Accuracy (40% weight)
        accuracy = self.evaluate_factual_accuracy(
            agent_response['answer'],
            gold_item['correct_answer'],
            gold_item['question']
        )

        # Metric 2: Citation Compliance (30% weight)
        citation = self.evaluate_citation_compliance(
            agent_response['answer'],
            gold_item['requires_citation'],
            agent_response.get('sources', [])
        )

        # Metric 3: Retrieval Relevance (30% weight)
        retrieval = self.evaluate_retrieval_relevance(
            gold_item['question'],
            agent_response['retrieved_docs'],
            gold_item['expected_doc_ids']
        )

        # Weighted composite score
        composite = (
            accuracy['score'] * 0.40 +
            citation['score'] * 0.30 +
            retrieval['score'] * 0.30
        )

        return {
            'composite_score': composite,
            'factual_accuracy': accuracy,
            'citation_compliance': citation,
            'retrieval_relevance': retrieval
        }

    def evaluate_citation_compliance(self, answer: str,
                                    should_cite: bool,
                                    sources: List[str]) -> Dict:
        """Check for proper source attribution"""

        citation_patterns = [
            r'\[.*?\]',           # [Source Name]
            r'according to',      # According to policy...
            r'source:',           # Source: Document
            r'per the',           # Per the guidelines
            r'as stated in'       # As stated in...
        ]

        citations_found = []
        for pattern in citation_patterns:
            matches = re.findall(pattern, answer, re.IGNORECASE)
            citations_found.extend(matches)

        has_citations = len(citations_found) > 0

        if should_cite:
            score = 100 if has_citations else 0
        else:
            score = 100  # No citation required

        return {
            'score': score,
            'citations_found': citations_found,
            'sources_used': sources
        }

    def evaluate_retrieval_relevance(self, question: str,
                                     retrieved_docs: List,
                                     expected_doc_ids: List[str]) -> Dict:
        """Measure retrieval quality with precision/recall"""

        retrieved_ids = [doc.metadata.get('doc_id') for doc in retrieved_docs]

        expected_set = set(expected_doc_ids)
        retrieved_set = set(retrieved_ids)

        if retrieved_set:
            precision = len(expected_set & retrieved_set) / len(retrieved_set)
        else:
            precision = 0

        if expected_set:
            recall = len(expected_set & retrieved_set) / len(expected_set)
        else:
            recall = 1.0

        # F1 score
        if precision + recall > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
        else:
            f1 = 0

        return {
            'score': f1 * 100,
            'precision': precision,
            'recall': recall,
            'retrieved_count': len(retrieved_ids),
            'expected_count': len(expected_doc_ids)
        }

Running Evaluation Suites

def run_evaluation_suite(agent, golden_dataset: List[Dict]) -> Dict:
    """Execute full evaluation across test dataset"""

    evaluator = InsuranceAgentEvaluator()
    results = []

    for item in golden_dataset:
        # Get agent response
        response = agent.answer_question(item['question'])

        # Evaluate
        eval_result = evaluator.evaluate_complete_response(response, item)
        eval_result['question_id'] = item['id']
        eval_result['category'] = item['category']
        results.append(eval_result)

    # Aggregate metrics
    df = pd.DataFrame(results)

    return {
        'avg_composite': df['composite_score'].mean(),
        'avg_accuracy': df['factual_accuracy'].apply(lambda x: x['score']).mean(),
        'avg_citation': df['citation_compliance'].apply(lambda x: x['score']).mean(),
        'avg_retrieval': df['retrieval_relevance'].apply(lambda x: x['score']).mean(),
        'results_by_category': df.groupby('category')['composite_score'].mean().to_dict(),
        'detailed_results': results
    }

Generating Improvement Recommendations

Evaluation drives improvement by identifying specific weaknesses:

def generate_recommendations(metrics: Dict, results_df) -> List[str]:
    """Generate actionable recommendations from evaluation results"""

    recommendations = []

    # Low retrieval relevance
    if metrics['avg_retrieval'] < 70:
        recommendations.append(
            "RETRIEVAL: Improve document chunking strategy. Current chunks "
            "may be too large or missing key metadata. Consider semantic "
            "chunking based on policy sections."
        )

    # Low citation compliance
    if metrics['avg_citation'] < 80:
        recommendations.append(
            "CITATIONS: Add explicit citation instructions to system prompt. "
            "Require agent to reference specific policy documents for all "
            "regulatory guidance."
        )

    # Low accuracy on specific categories
    for category, score in metrics['results_by_category'].items():
        if score < 75:
            recommendations.append(
                f"CATEGORY-{category.upper()}: Performance is below threshold. "
                f"Consider adding more {category} examples to training data "
                f"or improving {category}-specific retrieval."
            )

    return recommendations

Takeaways

Agentic RAG transforms retrieval from passive lookup to active reasoning with reflect-reformulate-retry loops
Self-evaluation enables quality control - agents assess their own answers and retry when confidence is low
Long-term memory creates personalization through three types: semantic (facts), episodic (interactions), and procedural (behaviors)
Memory management requires policies - TTL settings, priority scoring, and scope levels prevent bloat and maintain relevance
Evaluation spans four dimensions: task completion, quality control, tool interaction, and system metrics
Three evaluation strategies serve different needs: final response for end-to-end, single-step for debugging, trajectory for comprehensive analysis
Weighted composite scores provide holistic assessment while highlighting specific areas for improvement

This is the eleventh post in my Applied Agentic AI for Finance series. Next: Multi-Agent Architecture for Trading where we’ll explore coordinating multiple specialized agents for complex financial operations.

#llm #python #rag #finance

RAG and Evaluation for Financial Agents

From Basic RAG to Agentic RAG

The Retrieve-Reason-Retry Pattern

Financial Policy Assistant

Handling Out-of-Scope Questions

Long-Term Memory for Financial Agents

Three Types of Long-Term Memory

Memory-Enabled Treasury Assistant

Intelligent Memory Manager

Memory Scoping

Agent Evaluation

Four Dimensions of Evaluation

Three Evaluation Strategies

Financial Agent Evaluation Framework

Running Evaluation Suites

Generating Improvement Recommendations

Takeaways

Comments

Your browser is out-of-date!

RAG and Evaluation for Financial Agents

From Basic RAG to Agentic RAG

The Retrieve-Reason-Retry Pattern

Financial Policy Assistant

Handling Out-of-Scope Questions

Long-Term Memory for Financial Agents

Three Types of Long-Term Memory

Memory-Enabled Treasury Assistant

Intelligent Memory Manager

Memory Scoping

Agent Evaluation

Four Dimensions of Evaluation

Three Evaluation Strategies

Financial Agent Evaluation Framework

Running Evaluation Suites

Generating Improvement Recommendations

Takeaways

Related Posts

Comments

Your browser is out-of-date!