On what was probably a foggy San Francisco day in late 2017, one of Vanta’s co-founders made a fateful commit:
That day, we did indeed serve some GraphQL requests. Since then we’ve been on a quest to get GraphQL working just how we like it.
Vanta chose GraphQL as our API layer early in our company’s history. It was the hot new thing and had a lot of potential, but industry best practices hadn’t yet been established. No one on the team had any GraphQL experience, but we had a vague notion that many ideas in operational security were best expressed through a graph. Ultimately, after some investments in tooling and culture, GraphQL ended up being the perfect tool for us.
In this post, we’ll admit some early pitfalls that we encountered in our initial implementation and explain how we make sure not to repeat the mistakes of the past.
If you want to skip to the good stuff, we’ve also open-sourced our GraphQL style guide along with a corresponding eslint plugin.
What is GraphQL?
GraphQL is a query language for APIs. It allows an API designer to define a nested schema, or “graph.” The cool thing about GraphQL is that a client can drill down to multiple levels of depth in a single query, requesting only exactly the data it wants.
A schema might look like this:
Which lets a client who doesn’t care about age fetch just the name and favorite character:
On the server side, the snippets of code that populate the requested fields are called resolvers.
Read more about GraphQL basics in the official documentation.
Pitfall #1: Not enough tooling
What we did wrong
As the API surface area grows, GraphQL development can become painful without the right tooling. Vanta defines its GraphQL schema using the Schema Definition Language rather than letting resolvers implicitly define the schema, since we want API designers to think about the schema without worrying about implementation details. But forgetting to define a resolver for a schema can cause disruptive runtime failures.
How we fixed it
We use Typescript for all of Vanta’s microservices, and we’re proud that our codebase is fully typed. GraphQL schemas are typed, but they’re not typed in Typescript. We needed to bridge the gap and somehow convert our schema into Typescript types so the typechecker could enforce the shapes of our resolvers.
Luckily, other people have also tried to convert GraphQL types to Typescript types, so we can use an off-the-shelf tool for most of the heavy lifting. We’ve had great success with GraphQL Code Generator – whenever the schema changes, we run a command that generates all of the types we need. Not only do we get resolver types, but we also generate client types and React hooks for our GraphQL client that are fully typed and ready-to-use.
We use Apollo to power our GraphQL server. Apollo generates default resolvers for every type defined in our schema, which is generally convenient. Taking an example from the Apollo docs, let’s say you have a schema that looks like this:
If you write a resolver for the books field that returns an array of objects with a title field, Apollo is smart enough to return that field by default instead of requiring a developer to implement a trivial resolver.
Unfortunately, this behavior means that Typescript cannot consistently check whether a resolver is missing or mistyped (since the default case might work). To replace Apollo’s implicit resolvers, we wrote a custom code generator that generates explicit default resolvers which can be overridden. This lets Typescript ensure that all resolvers are defined without requiring any extra boilerplate.
Between our generated resolvers and our generated types, we’ve eliminated whole classes of bugs that are now caught by the Typescript type system.
Now, any engineer can easily:
- Add a field to the schema
- See where Typescript is angry
- Fix the errors from step 2
- Rinse and repeat
Type systems aren’t the only way to increase development velocity. We also want to make sure that every change to our schema conforms to the norms that we’ve defined in our style guide without having to go through several rounds of code review.
Norms are only as good so far as they are enforced. With a quickly growing engineering team and exponentially expanding requirements, we can’t afford to have a GraphQL czar who reviews every single change to ensure that it lines up with her mental model. Instead, we’ve found that linters and other automated tools are the best solution.
Linters and autoformatters are much more effective than code reviewers when it comes to enforcing norms that are automatically checkable. Code reviewers are still expected to review changes with the style guide in mind, but the majority of rules in the style guide are enforced by our eslint plugin which runs automatically. On the flip side, if it’s too tricky to write a lint rule for some aspect of the style guide, it may suggest that the style guide is not prescriptive enough.
Pitfall #2: REST-ful GraphQL isn’t restful for devs
What we did wrong
Since GraphQL is so flexible, it’s easy to accidentally superimpose a REST-ful mindset on top of a GraphQL schema. GraphQL isn’t REST, and shouldn’t be shoehorned into traditional REST patterns.
A REST API tends to have endpoints that return data about one kind of thing along with pointers to other related data. /org/:orgId/users might return a list of each user in some organization with some metadata about those users, and /users/:userId/posts might return a list of posts for one particular user. This is easy to reason about but each endpoint tends to be a totally distinct resource with its own logic.
GraphQL allows an API designer to express an API as, well, a graph. This only makes sense when the data has a graph structure – but lots of business data does. For example, an organization might have a bunch of employees with computers, each with a list of installed applications. This could be expressed in a graph like this:
Superimposing a REST mindset on this graph works, but it requires lots of extra frontend logic and sometimes even extra network requests. In the most extreme case, a schema might look like this:
If a client wants the names of all of the applications installed on a computer in some organization, they have to make at least four round trips to the server:
- Query for the list of user Ids in the current organization
- Query for the list of computers owned by each user from the previous step
- Query for the list of applications installed on each computer from the previous step
- Query for the metadata about each application from the previous step
Of course, the API designer might add a query applicationsByOrganizationId to get this data directly, but that doesn’t solve the general problem – every time a new use case is discovered, an API designer has to add a new endpoint and write custom code to support it or the user has to make multiple round trips and do all sorts of complicated joining logic on the client side.
How we fixed it
Using GraphQL how it was intended solves this problem beautifully. Instead of returning IDs, which are basically pointers to other parts of the graph, a more reasonable schema would look like this:
Now, the client can make one straightforward query to get all the data it needs:
This also makes the server side implementation of each one of these types much easier – resolver logic only has to be implemented once per type, instead of once per business use case.
To see a more complete example, take a gander at the relevant section of our style guide.
How we enforce the fix
Of the three pitfalls discussed in this post, this is the one with the least automated enforcement, since – more than anything – it’s just a new mindset about how to design GraphQL APIs. However, we did come up with a couple of guidelines that we always look for in code review:
- Rarely offer id fields in GraphQL types. Instead, just add an edge to the whole object. If the client just needs the ID, they can query for the ID on the type itself.
- Don’t be afraid to add extra fields to some type. Unlike a traditional API where the same logic runs every time, the code backing these fields only gets executed when someone wants the field.
- There should be one type per platonic ideal of a business object. Instead of returning a “UserById” type, return a “User” type. The client decides what fields are important to them – and permissions should be enforced at a different layer.
Pitfall #3: Friendly denials of service
What we did wrong
Unlike a traditional REST API which has a finite number of possible routes, GraphQL allows a client to request arbitrary information in infinite ways. This is nice for the client but makes it hard to ensure that even a friendly client doesn’t accidentally make a request that makes a million database requests. I can neither confirm nor deny whether I accidentally wrote a query that did just that.
It’s relatively straightforward to estimate the cost of a query if you know exactly how many resources the query will return. For example, GitHub’s API docs explain how to do GraphQL costing in a pretty clever way.
However, since our GraphQL schema used to include lists of arbitrary length, it was impossible to estimate the cost of a query ahead of time. When you have a users field that returns all of the users in some organization, it’s ok when there are 100 users, but is likely to cause a problem when there are 100,000. We didn’t worry at all about pagination in the early days of Vanta, but as our customer base and complexity grew, we recognized a need for it.
How we fixed it
We wouldn’t know which queries to optimize and which code-paths are hot without monitoring. We use Datadog APM fairly heavily to monitor our GraphQL API’s performance and understand when queries are performing slower than expected. We can even see which parts of the query are taking especially long to resolve. This monitoring let us know that we should focus on two major themes: pagination and dataloaders.
There are some holy wars when it comes to GraphQL pagination, but we landed on the Relay spec since it met our needs for cursor-based pagination.
When we started using the Relay spec for pagination, we noticed that well-intentioned engineers were adding new, un-paginated fields faster than we were converting old fields to paginated versions! Once again, tooling came to our aid.
We introduced a lint rule that complains whenever we introduce a new list type that isn’t a Relay edge. We cap the number of nodes returned on each edge, so this ensures that we’re never returning lists of unbounded length.
We found, though, that not all lists need to be paginated. Sometimes, we know a list is going to be small no matter what. For those cases, we introduced a @tinylist directive which lets the linter know “this list is of small, constant-ish length, no need to paginate.”
Now, a developer who needs a new list type either must implement pagination or face the kindly wrath of a code reviewer asking why a list that is definitely not of constant length is marked as a @tinylist.
Queries often request the same data many times in the same request. Consider the following query:
If there are n users in the system and every user is friends with every other user, then a naive implementation will make n^2 database calls to serve this query, since every user needs to look up the name of each of their friends. However, since friends are shared among users, this is quite redundant; once you’ve looked up a name for some user, you shouldn’t have to look it up again in the same request.
The dataloader pattern resolves this problem. Instead of greedily making all of the expensive calls when we need them, we queue up the requests, de-duplicate them, and then make them all at once. Our key insight was that the dataloader pattern is not an “as needed” pattern – since clients can make arbitrary requests, we want to dataload nearly all the time. Wherever possible, we enforce that our resolvers use dataloaders to load the data they need. To maintain development velocity while requiring dataloading, we’ve invested in some generic higher-order functions to make it easy to convert a database query into a dataloader. Other companies have taken this a step further and autogenerated dataloaders.
Along with this blog post, we’ve open-sourced our GraphQL style guide and eslint plugin to share with weary travelers. Not all of these rules will make sense for everyone, and some probably don’t make sense to anyone. But please feel free to use them as an inspiration for your own GraphQL journey. We welcome pull requests – and if you’re interested in working with us, check out the available jobs on our jobs page!
Special thanks to Ellen Finch, Utsav Shah, Neil Patil, and the whole Vanta engineering team for their help editing this blog post and – more importantly – implementing these ideas in our product.