photo credit Neramitevent@pixabay
Many different companies and posts talk about why and how to run Game Days. I won’t rehash all of that in this post, instead I will give a basic intro and link to some sources then dive into some more specific Game Days I have recently been involved with.
A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened
Below are a few recent Game Day examples, with some details of what we did, what we learned, how we broke something in a “safe” manner, and a runbook to run your own. In all cases, one important factor of running the Game Day is having a way to stop if the simulated incident starts to cause a real incident. When planning a simulated emergency make sure you have a planned way to escape out of the test if something unexpected is occurring.
For various reasons some of our systems were relatively slow to deploy. This means if something bad happened, it could take awhile to revert, pass CI, deploy etc… We finally got access to a rollback
tool which was much faster than the normal deploy process to jump back to a recently deployed version of the code. While the tool was available for awhile many folks had their first experience using it during a real incident! Trying to find docs and understand the commands while stressed about some service being broken. Not the ideal way to learn… We got a group together with a preplanned and scheduled time, we then broke a ‘hidden’ endpoint on production, used rollback, fixed the code and showed how to “roll forward” back into the standard development flow. Since our tool did a bunch of cool things like freezing the deploy when you roll back, alerting various folks that it was going on, etc we got to see and feel the full experience. We had folks running the commands who had never had to do a rollback in production before, which made the exercise really valuable.
We have a feature that I will cover more at some point in a future post, called chaos traces, similar to chaos monkey it lets you inject some chaos into your system to see how it behaves. In this Game Day, after the release of a new partner integration, we used our chaos trace tool to validate how the integration handle timeouts and errors. The code author and the PR reviewer paired up following the deploy and used chaos traces to verify that when the integration had an API Error or if the integration was running slow that the app handled it how everyone expected. By injecting say 3 seconds of latency into the respond time, we verify that the app would handle the timeout and give the customer a reasonable UX experience.
I didn’t participate directly in this Game Day, but I helped write some of the code and reviewed PRs. As a new service was being integrated into a high traffic flow, before the new service went live the team wanted to load test it. In the past we have often used simulated load, but it is hard to realistically generate traffic that looks like production… We tried something new and did a Game Day around production load… The team had areas of site that would soon make the new experience visible fire and forget all get requests to the new services without displaying and results… They even had a scale factor so we took traffic on a normal day and generated 3X production traffic to the new backing service… Since all the requests were fire and forget it didn’t slow down the user experience and let us work on performance tunning the new service under real heavy load.
Caching is hard, eventually you might have to break a bad cache… If this cache has data for a few million users, sometimes very bad things happen to performance while the system tries to rebuild a cache. This is a future Game Day, that we have not yet done, but are currently planning after running into a cache issue. We were able to quickly bust the cache and fix a service that was in a invalid state, but it got us thinking… As the service and data set keeps growing, how long can we rely on that? At what point will we break cache and start to see a cascading of timeouts throughout all of our client services as we buckle under the load… Seems like a perfect thing to Game Day, we plan to put a feature toggle in place to let us ramp up cache_skips
for a percentage of requests. Turn up the dial and ensure we can hit 100% without to large of an impact to our performance and stability.
After some recent issues with our Event Bus, we put a number of protections in place that we thought would ensure our service could handle a significant unplanned outage without causing any interruption to clients. We made several changes to how we were handling events
OK, great next time there is an incident our service will keep humming along without any issue… Well, would it really hold up? How could we be sure? Of course by now, y’all realize that instead of waiting for the next time our events queue has an incident, we can run a Game Day! That is what we did, we made a toggle to disable our access to the event queue server… Then flipped it off, we got to see if our protections worked verifying each doing what was expected… If we were wrong we could very quickly restore the system… During the incident the week prior, our internal background job storage overflowed in about 15m as we helplessly watched in horror… As we watched for 15 minutes during this Game Day our new monitor and connected alert showed the estimated time remaining we could buffer all events before running into problems. Showing how we could now handle hours of downtime, giving the teams plenty of time to get paged, and work on a resolution if needed.
Game Days can be used to work through what could be a difficult or exceptional situation in a controlled manner at a planned time and the support of a team. When done well, they can teach folks how to be better prepared for incidents, better understand their systems, and have more empathy for how other folks process and understand system in the stress of a real incident. They are a great way to ensure everyone on the team is ready when something unexpected next happens.