Writing Good Design Docs

Date published: September 1, 2019

Every major technical project failure can be linked / root-caused to a poorly hashed out design doc. But why is this the case? It seems like a truism, but we still get them wrong all the time. I wanna take a stab at this by explaining the intent of a good design doc and how to write one. There's a bunch of these already written, but here's my take.

What Role Does a Design Doc Play?

Commute intent and effects

What does "this" try to do? Why does a lack of it cause problems? In a short sentence, how will this work? What will the end product look / work like?

Aggregator for cross functional needs/requirements

If you ever worked in a team with more than 10 ICs, you can see how fast communication breaks down in the midst of a software stack that has grown in complexity. As the number of engineers working on different parts of the software stack grows, so does the probability that introducing a new system might have cascading effects on others. The design doc serves as a early-stage contract to draw lines and resolve such discrepancies in words than code.

Good Design Doc Traits

Source of Truth before the product and its documentation are complete

Offline discussions will diverge and email threads will get lost, but having a design doc anchors key decision making onto the doc itself. This is especially true when said feature(s) hinge on cross functional efforts with cascading effects if things go off the rails.

If there's a disagreement (and it will happen!), it can be discussed, iterated upon and resolved on the document itself. Another benefit of doing this is, say for projects with long time horizons, newer team members of the project gain better context on the Whys while ramping up.

A new but experienced engineer should be able to read it and implement a system that's almost correct without jumping over much informational hurdles.

A design doc should theoretically have 80-90% of the technicalities ironed out.

We know unknown unknowns happen, but that's also the whole point of having a design doc: scoping the work involved, asking for the unknowns to be known and scoping them even before the first line of code is written. In other words, you should have all the pieces like an IKEA manual. An engineer should be able to read the design doc, gain full context and be able to implement it on his/her own autonomously.

How

"Ok ok, just cut out the whys and tell me how to write a good design doc already!"

Here's a template that I follow quite religiously:

Title

First Impressions matter a lot. Make it understandable. People are going to search for it in the engineering/product shared team drive. This helps with discoverability.

Some examples include:

"Adding Remote Build Execution by Default Support for stack X"
"Migrating our Billing Services from cloud provider A to B"

I like my titles prefixed with "DESIGN DOC:", so "DESIGN DOC: < title here>", but that's just me :)

You should also insert a small table / section that list the author(s), team(s) as well as dates associated with this design doc.

Introduction

In 3 sentences or less, describe the concrete feature itself. This should briefly touch on who will be using the system and how the system is going to accomplish it.

Problems / Pain Points

Describe the pain points here. Is there a suboptimal solution today? Or is this a new feature meant to enable something that hasn't already existed? Examples include:

"Current machine learning model owners will need to redownload unchanged model assets and maintain consistency between the two storage systems" Go on and describe, if applicable, how a workaround is needed to achieve this, but scales poorly with the increasing number of ML engineers iterating on different experiments etc.
"Current build times take up to x hours. This is a huge engineering productivity loss." Go on to describe how this could have n-order effects of developers not willing to test some modules more rigorously before pushing it out for review etc., causing build errors further downstream in the engineering development workflow, which is more costly to fix. (Yes, this is still a thing!)

Proposed Solution

This is like a prelude of how the system is going to work. In one paragraph, describe how this system will be "plugged into" the existing architecture.

System Diagram

This should complement the Proposed Solution section above. It serves as a visual aid to contextualize and help the reviewer digest the section above.

There are probably lots of tools out there, but I personally like draw.io.

Generally speaking, by the time the reviewer gets to this section, he/she will have a bias on liking/hating the system, which leads me to the next section...

Implementation Details

This is the long one. In my experience, this section will take up over 50% of the design doc itself.

Be as detailed as possible. Break down the system into separate sections, and dive deep into the details. Be very concrete on the technical tools and decisions made to tackle the subcomponents that make up the sum of the system here.

Timelines

This is very important if you operate on different sprint wavelengths and when the project requires cross functional effort within your organization. At larger companies, this might require executive / upper management signoff as it has implications to higher level company roadmaps. More on signoffs below.

Signoff

Yay, bureaucracy! This shouldn't be a surprise if this was a bigger startup or big co. I think of it as more of a disciplinary thing to run it by key stakeholders. You'll be surprised how often some engineers would write a design doc and then run away to implement it without notifying others.

Just set up a 3xn table like below:

Team	Approver	Date (author to put date in which they are approved
Security
Infrastructure
Payments

Alternatives Considered

Why should you not do this? What are other alternative solutions out there? List them here.

This could be something like choosing a particular library over another. For example:

"If we decided to use CapN'Proto over Protobuf as our RPC framework, describe why Protobuf was, at one point, considered. Why did you not choose Protobuf (hint: it has 0 encoding cost!)."
"We could use Twilio over Zendesk API. However, Twilio is operationally speaking more expensive than Zendesk. Here is the breakdown chart (or a link to it). This also means we'll need to maintain a call center and roll our own hotline service. It's capital inefficient to do so."

If there's more talking points, add Pros and Cons subsections under them so you can get into the nitty gritty things that could be showstoppers for why it wasn't the first choice in the first place. You get the idea.

Open Questions

The ??? section where you list known unknowns that could have some impact on the project. What are some big question marks regarding the proposal above that you haven't figured out yet? Examples include:

"Should we use TOML or YAML for configurations in this microservice?"
"When will the Google Cloud Platform T4 GPU instances be ready? Check with infrastructure team"

References/Notes

This is where you can stick all related:

Google Doc links
product documentation (if any)
meeting notes/minutes
whiteboard scribblings

Feel free to make this section as verbose as possible regarding your findings. These serve as a great resource for people wanting to dig deeper on materials that you've used to source your findings/support your decisions.

You may choose to add scripts/prototypes here too if you already have some kind of MVP.

Closing Thoughts

Cool, you now know what are some major things. I wish I could link you to a couple good design docs that my team and I have written, but that'll probably get me in trouble with past and current employers. Maybe you gotta wait until I write one up for an open source project/idea I have. Until then, happy hacking.

Zheng Hao Tan