29 - The Scoop: Inside the Longest Atlassian Outage of All Time

Credit to @SeriousBug#6848 for sharing the link for this week's Dendron Reading Series in our #what-are-you-reading channel

After two weeks of outages, Atlassian status page now lists JIRA as being back to operational. In this article, Gergely Orosz details the timeline of events, what caused the outage (only one little script!), who was impacted, and what we can learn. Written a little over a week into the outage, this reflection not only diagnoses what went wrong to cause the outage, but also what went wrong in the response by Atlassian.

Both a lack of transparency and delays in communication have done little to comfort those affected. If you are responsible for keeping infrastructure up and running, what plans might your team put in place to ensure that your customers stay informed and up-to-date when your services are unavailable? If customers depend on you in any capacity, how can you communicate to them when you will be unavailable and when service will resume?

Interested in more tales of outage disasters? @kevins8#0590 shared how one of its largest AWS outages to S3 was the result of a typo.


Backlinks