BlogEngineering
February 18, 2025

Building a smarter retrieval system: Lessons from Vanta AI

Written by
Walton Seymour
Reviewed by
No items found.

Accelerating security solutions for small businesses 

Tagore offers strategic services to small businesses. 

A partnership that can scale 

Tagore prioritized finding a managed compliance partner with an established product, dedicated support team, and rapid release rate.

Standing out from competitors

Tagore's partnership with Vanta enhances its strategic focus and deepens client value, creating differentiation in a competitive market.

At Vanta, we power a suite of AI products that enable thousands of customers worldwide to make critical business decisions. These products rely on the ability to quickly search through millions of customer documents to surface relevant information and drive accurate outcomes.

Building a retrieval system capable of handling this scale and complexity was no small feat. Along the way, we learned valuable lessons that we’re excited to share. Here’s how we designed and optimized Vanta’s AI retrieval system.

#1 Start with a robust evaluation system

The foundation of any successful project is defining what success looks like. As the saying goes, “You can’t improve what you don’t measure.” For us, this meant creating a comprehensive evaluation dataset that included:

  • A wide variety of document types
  • Sample queries tailored to those documents
  • Relevancy annotations provided by internal subject matter experts (SMEs)

To ensure our system could handle the nuances of compliance-related queries, we included specific terms and concepts such as “encryption at rest” vs. “encryption in transit,” “MFA (multi-factor authentication)” vs. “2FA (two-factor authentication),” and acronyms like “BYOD (bring your own device) policies.”

We approached this process with a test-driven development (TDD) mindset, where we defined success metrics and evaluation criteria upfront. This allowed us to iteratively test and refine our retrieval system against clear benchmarks. We measured retrieval quality using both isolated metrics—precision@k, recall@k, Mean Reciprocal Rank (MRR), and F1—and end-to-end performance. 

This dual approach allowed us to evaluate not only the accuracy of retrieved results but also their interpretability and overall impact on the system. By incorporating varying levels of difficulty in our dataset, we identified areas where the system excelled and where it needed improvement, significantly reducing the effort required to address gaps.

#2 Optimize your chunking strategies

One of the most critical decisions in building a retrieval system is how to segment documents into smaller, manageable chunks of text. Naive approaches, such as fixed-size or page-based chunking, often break apart related concepts, leading to reduced retrieval quality—especially for compliance-related queries. For example, long tables spanning multiple pages lose context if not properly chunked.

We experimented with several strategies, including:

Types of chunking strategies
Strategy Description Pros Cons
Fixed-sized chunking Splits content into chunks of predefined size (e.g., 512 tokens) Simple, fast, and ensures uniform chunk sizes May split sentences/ideas mid-context
Recursive chunking Splits content recursively using natural breaks (e.g., paragraphs, sentences) Preserves paragraphs/sentences—balances structure and size Complexity increases with nested splits
Section-based chunking Splits content based on document sections (e.g., headings, chapters) Maintains document hierarchy—ideal for structured content Requires well-defined sections
Hybrid chunking Combines multiple strategies (e.g., section-based first, then fixed/recursive) Flexible—adapts to document complexity Requires careful tuning
Context-aware chunking Uses NLP/ML to dynamically split content based on layout and semantics Maximizes contextual relevance—avoids arbitrary splits Computationally expensive

Through our evaluation framework, we discovered that preserving hierarchical elements—such as headers, subheaders, and structured data like tables—significantly improved retrieval performance. For compliance documents, section headers provide crucial metadata that helps disambiguate similar concepts appearing in different contexts.

By adopting context-aware chunking, we maintained continuity and minimized the risk of splitting key information across chunks. This change alone resulted in a 16% improvement in recall and a 15% improvement in MRR.

#3 Don’t overlook lexical search

While embedding-based semantic search is powerful, we found that combining it with traditional lexical search (BM25) significantly enhanced retrieval performance, particularly for compliance-specific terminology and acronyms.

Lexical search excels at exact matches and specialized terminology that may not be well-represented in general-purpose embedding models. For example, compliance framework references like “SOC2 CC6.1” or “ISO 27001 A.9.4” are better handled by lexical search.

We implemented a hybrid search approach that combines semantic and lexical search, using reciprocal rank fusion to merge results. This allowed us to capture both specific terms and broader conceptual queries effectively. After introducing hybrid search, we saw a nine percent improvement in recall, while MRR remained stable.

#4 Enhance results with reranking

To further improve retrieval quality, we introduced a reranking step. Using Cohere’s v3.5 multilingual rerank model, which leverages cross-attention to better capture the relationship between queries and documents, we addressed cases where initial retrieval returned semantically similar but contextually irrelevant results.

For example, reranking helped distinguish between employee security training requirements across different compliance frameworks. While reranking adds some latency, running it on a smaller set of initial results (top-k=50) struck a balance between quality and performance. This step delivered a six percent improvement in recall and a twenty-one percent improvement in MRR.

Conclusion

Building an effective retrieval system requires addressing domain-specific challenges through careful evaluation and iteration. By optimizing our chunking strategy, combining lexical and semantic search, and introducing reranking, we improved our system’s recall by 40% and MRR by 22%. These enhancements are now live in production, powering thousands of customer decisions daily across our AI products.

As we look to the future, we’re excited to explore new ways to push the boundaries of retrieval technology. From advancements in multimodal and personalized search to leveraging the latest innovations in AI, we’re committed to evolving our system to meet the ever-changing needs of our customers.

Access Review Stage Content / Functionality
Across all stages
  • Easily create and save a new access review at a point in time
  • View detailed audit evidence of historical access reviews
Setup access review procedures
  • Define a global access review procedure that stakeholders can follow, ensuring consistency and mitigation of human error in reviews
  • Set your access review frequency (monthly, quarterly, etc.) and working period/deadlines
Consolidate account access data from systems
  • Integrate systems using dozens of pre-built integrations, or “connectors”. System account and HRIS data is pulled into Vanta.
  • Upcoming integrations include Zoom and Intercom (account access), and Personio (HRIS)
  • Upload access files from non-integrated systems
  • View and select systems in-scope for the review
Review, approve, and deny user access
  • Select the appropriate systems reviewer and due date
  • Get automatic notifications and reminders to systems reviewer of deadlines
  • Automatic flagging of “risky” employee accounts that have been terminated or switched departments
  • Intuitive interface to see all accounts with access, account accept/deny buttons, and notes section
  • Track progress of individual systems access reviews and see accounts that need to be removed or have access modified
  • Bulk sort, filter, and alter accounts based on account roles and employee title
Assign remediation tasks to system owners
  • Built-in remediation workflow for reviewers to request access changes and for admin to view and manage requests
  • Optional task tracker integration to create tickets for any access changes and provide visibility to the status of tickets and remediation
Verify changes to access
  • Focused view of accounts flagged for access changes for easy tracking and management
  • Automated evidence of remediation completion displayed for integrated systems
  • Manual evidence of remediation can be uploaded for non-integrated systems
Report and re-evaluate results
  • Auditor can log into Vanta to see history of all completed access reviews
  • Internals can see status of reviews in progress and also historical review detail
FEATURED VANTA RESOURCE

The ultimate guide to scaling your compliance program

Learn how to scale, manage, and optimize alongside your business goals.