April 17, 2025

How we navigated database limits with a growing product

Written by

Lavanya Kannan

Reviewed by

Accelerating security solutions for small businesses
‍

Tagore offers strategic services to small businesses.

A partnership that can scale
‍

Tagore prioritized finding a managed compliance partner with an established product, dedicated support team, and rapid release rate.

Standing out from competitors
‍

Tagore's partnership with Vanta enhances its strategic focus and deepens client value, creating differentiation in a competitive market.

In 2024, one of Vanta’s engineering goals was to improve the quality while maintaining our rapid product development. Around the same time, we also discovered we were months away from reaching our MongoDB Atlas database storage limit. If this threshold was reached, then we wouldn't be able to write any new data and the Vanta product would’ve been heavily degraded. This was a clear signal that we needed to invest more in our infrastructure and storage solution.

‍

We began by investigating short-term patches, but quickly realized that this was a foundational problem. To address it effectively, we rethought our overall data strategy and designed a long-term plan. By the end of the effort, we reduced 30% of storage and created a strategic plan to scale even further with minimal disruption to the product and future roadmap.

‍

The problem: approaching storage capacity

Vanta uses MongoDB Atlas as our primary monolithic database that powers the majority of our product workflows. In early 2024, we were on the second largest cluster tier and already consuming 80% of its available disk space. The largest tier only offers 4TB, and our projections showed we would exceed that capacity in a few more months. We had to act fast before the database became completely full.

‍

‍

We started with digging into this problem and assessing whether all the stored data was actually necessary and relevant. Unsurprisingly, the largest portion of our storage was tied to Vanta’s continuous control monitoring (CCM), a core part of our offering. This entails scanning customers’ resources via their connected integrations, running compliance and security checks on them, and tracking these resources for the duration of a typical audit period of one to two years. With [integrations_count] integrations added over the previous year, the related data consumed two-thirds of the available disk space, and we expected it to grow exponentially.

‍

Clean-ups and tactical improvements

To quickly address the bottleneck, we kicked off some quick wins: cleaning up unused indexes, deleting deprecated data models, and manually “compacting” collections with recently deleted data. This saved us about 100 GB of disk space and gave us a little breathing room.

Next, we analyzed the largest and fastest growing collections, which led us to one of our cloud asset collections containing AWS EC2 instances. This was 8x the size of the next largest collection!

‍

On closer inspection, we made some interesting observations:

Over 99% of the EC2 instances have been deleted in customer environments, but we needed to retain them for audits.
Most of them were ephemeral. About 87% of them were active for under three hours, which explains the massive data footprint here.
At least half of the EC2 instances belonged to AWS Auto Scaling groups (ASG), which frequently spin up and down identical instances. This means that the majority of managed instances are functionally duplicates!

‍

We coordinated with our security and audit teams to validate a deduplicating strategy that would still meet compliance and monitoring requirements. After aligning, we implemented a new approach that grouped these instances by their ASGs by default.

‍

The result: The growth rate of EC2 instances was halved and the growth rate of inspector vulnerabilities, a dependent type, dropped another 40%. This was a huge win in slowing data footprint, and our projected runway increased by a few more months.

‍

Planning for long-term scale

Even after EC2 deduplication, the database continued to grow quickly and unpredictably. We decided to evaluate broader architectural options, starting with outlining estimated data storage needs over the next 12 months—analyzing growth rates, relationships between data categories, and overall access patterns. An interesting pattern we discovered was that there was a clear separation across most of our data based on access frequency.

‍

Next, we considered some well-known approaches to scaling a database:

‍

1. Vertical scaling

We could simply increase the storage capacity. However we were already on M400, the largest available NVMe cluster tier. This approach would require self-hosting which would be a major investment.

‍

2. Vertical partitioning

Moving some collections to another database is another option. However, this would limit our ability to run join queries or transactional writes across the split collections, which would impose a future product limitation.

‍

3. Horizontal partitioning

Splitting collections by access frequency or another dimension could relieve some pressure on the primary database. However, similar to vertical partitioning, this would limit querying capabilities on the split data.

‍

4. Sharding

There is a native Mongo offering for sharding either the entire database or specific collections, but this also comes with operational complexity and querying constraints across different shards.

‍

5. Adopting a new database

Switching to another database or data model entirely is a possible option, but this would involve high upfront cost and would introduce significant migration and data quality risk. There are some possible mitigations for this, like moving a smaller category of data, but the overall effort would still be a major engineering investment with unclear benefit.

‍

Decision making framework

To make more progress, we had to align on a decision making framework. Our product roadmap and upcoming big bets were still evolving, so we wanted to avoid one-way doors and any major investments based solely on product assumptions. The goal was to work in small and incremental steps while maintaining flexibility for as long as possible.

‍

Based on this criteria, we ruled out vertical scaling due to the engineering complexity. We deferred vertical partitioning and adopting a new database to maintain flexibility. Finally, we postponed sharding in favor of optimizing the existing database first. This process pointed us to horizontally partitioning our data by access frequency.

‍

Archiving low activity data

Large amounts of historical data, that needed to be retained for compliance, were rarely accessed by product workflows. We investigated a few options for archiving this data.

‍

Build vs. buy

The MongoDB Atlas Online Archive offering was a potential option, but we found this too limiting for our purposes:

Archived data was slow and expensive to query. The read layer was not built to support live product needs.
Data could not be selectively deleted. Entire collection archives must be removed together.

‍

Ultimately, we decided to build a solution ourselves.

‍

Custom archival system

We considered a few options for the archival store:

Amazon S3: Has virtually unlimited scale but would support limited query patterns
Snowflake: High scalability but also high cost
Amazon Aurora: Scales up to 128TB but has high migration complexity and requires creating new data models

‍

We settled on S3 as the most flexible option with the least risk. S3 buckets have virtually unlimited storage and can be easily ported to other stores or data lakes when needed.

‍

‍

The key pieces of the system include:

An archival service to migrate documents that were inactive past some time window. We aligned with our product team on data retention windows that would minimize customer impact.
A scheduling piece that periodically enqueues archival tasks per collection.
An S3 bucket with intelligent tiering, storing compressed JSON files per document.
A read layer for on-demand access, with S3 paths designed to support our limited query patterns.

‍

We also built many guardrails: per-collection configurations, partial rollouts, and toggles for safely testing the migration and deletion pieces in isolation. Upon initial rollout, we temporarily scaled up the number of worker containers to quickly migrate existing historical data. After the backfill was complete, we tuned them based on continuous archival needs.

‍

After historical backfill, we deleted almost 1.5 billion documents from MongoDB from our largest collections!

‍

Learnings and next steps

This effort collectively cleared about 30% of disk space (over 1TB) from the MongoDB cluster in our largest region and it halved the growth rate of most large collections. We had last minute unexpected growth in other product areas, as shown in the large spikes below. However, the archival work offset this jump and bought us another one to two years of runway.

‍

As a growing organization, we have learned so much from this initiative:

It is important to continuously invest in infrastructure, especially if we are supporting rapid product growth.
There is a careful balance: making thoughtful, flexible decisions that support future workflows, without letting the platform get ahead of the product. Always evaluate quick wins and medium term optimizations before planning major revamps.
When approaching major projects, always assess the impact on infrastructure. Build the right communication channels for teams to continuously share potential risks and learnings with each other.

‍

Looking ahead, we know our database will continue to grow, but we have a plan. Our next step is to use MongoDB’s native sharding to split up our largest collections. As the product roadmap takes shape, we’ll revisit the remaining scoped options—vertically partitioning or adopting a new database. We don’t know exactly which path we’ll take yet, but we’ve set ourselves up with the flexibility to choose when the time comes. We’re proud of what we’ve built and even more excited about what’s ahead.

‍

If you want to solve meaningful problems like this, check out our open roles at Vanta.

Access Review Stage	Content / Functionality
Across all stages	Easily create and save a new access review at a point in time View detailed audit evidence of historical access reviews
Setup access review procedures	Define a global access review procedure that stakeholders can follow, ensuring consistency and mitigation of human error in reviews Set your access review frequency (monthly, quarterly, etc.) and working period/deadlines
Consolidate account access data from systems	Integrate systems using dozens of pre-built integrations, or “connectors”. System account and HRIS data is pulled into Vanta. Upcoming integrations include Zoom and Intercom (account access), and Personio (HRIS) Upload access files from non-integrated systems View and select systems in-scope for the review
Review, approve, and deny user access	Select the appropriate systems reviewer and due date Get automatic notifications and reminders to systems reviewer of deadlines Automatic flagging of “risky” employee accounts that have been terminated or switched departments Intuitive interface to see all accounts with access, account accept/deny buttons, and notes section Track progress of individual systems access reviews and see accounts that need to be removed or have access modified Bulk sort, filter, and alter accounts based on account roles and employee title
Assign remediation tasks to system owners	Built-in remediation workflow for reviewers to request access changes and for admin to view and manage requests Optional task tracker integration to create tickets for any access changes and provide visibility to the status of tickets and remediation
Verify changes to access	Focused view of accounts flagged for access changes for easy tracking and management Automated evidence of remediation completion displayed for integrated systems Manual evidence of remediation can be uploaded for non-integrated systems
Report and re-evaluate results	Auditor can log into Vanta to see history of all completed access reviews Internals can see status of reviews in progress and also historical review detail