John Allspaw Taking Human Performance Seriously In Software
The conference started off with a heady fulfilling talk from John Allspaw about the impact of human performance in automatable systems. Wrapping the crux of his argument around the 3 Mile Island nuclear disaster, he drew parallels to how automation in software suffers from those same issues and how the automators should attack these problems. Main takeaways here are spending resources on why engineers went down the rabbit holes they did during an incident because there is some valid thought process that led them there, treating each individual on an incident as someone who is bringing unique knowledge to that particular incident, and spend time doing postmortems to extract more value than just trying make sure something doesn’t happen again
Nora Jones Chaos Engineering Traps
Nora Jones continued down the incident response maturity path sharing stories about her time as a Chaos Engineer and the learnings that she got out of them. Structured around the traps that chaos engineering exploration can fall into she offered actionable insight around how to set up your chaos engineering experiments so that they provide the most value to your team. The biggest takeaway from these traps for me was setting up your experiments with everyone that will be on the experiment, not hiding what the experiments will be doing to your team so they can focus on the vulnerabilities and not on the response, and creating an organizational model to not let chaos engineers come in and create action items for product teams but rather to be a separate entity that presents its findings to the teams they are running experiments with.
David Calavera – observability and performance analysis with BPF
Berkeley Packet Filter is an incredibly powerful tool but historically it’s been difficult to get started with and required knowledge of C. David did a great job showing how accessible the tool has become and how you can leverage it to get visibility into where your application is actually spending its time. His quick overview of flame graphs and how you can generate them by sampling stack traces with BPF illustrated how you can easily discover low-hanging fruit when looking to improve your application’s performance. Whether that’s an inefficient algorithm (on-CPU time) or a slow external dependency (off-CPU) you have a great starting place to spot those wins and understand their composition.
Nida Farrukh The Power and Creativity of Postmortem Repair Items
Nida Farrukh had a fantastic talk about extracting maximum value from your incidents. Starting with a simple reframing of outages, she talked about how outages are unplanned investment that your company can extract value from by having a postmortem process that actively lets your team do data collection and creative action iteming to extract that value. The three main pillars of data collection in the postmortem process are user impact, timeline refinement, and root cause analysis. By collecting these three things along with your mtt*’s you can easily start to create action items that fix more than just the symptoms of your outages but drill into why your response wasn’t as good as it could have been.
Piotr Holubowicz A story of one SLO
In this talk Piotr Holubowicz spoke on how Google created an SLO for a service that previously had an implicit one. Talking through all of the use cases of the product, he described how they cut down on all of their metrics to only provide SLO’s on the things that actually impacted the customer experience from as little direction as possible.
Dave Cadwallader Explain it Like I’m Five – What I Learned Teaching Observability to My Kids
A definite audience favorite, Dave Cadwaller brought up his son to talk about how they started learning observability concepts around Prometheus together. Built around the idea of sharing the tooling of your livelihood with those you care about, he taught his son the ideas of Prometheus by building a monitoring and alerting platform based off whether or not their garage door was open. Along the way they troubleshot issues like an engineer and got to play with graphing in Grafana.
Kenny Johnston Automation + Artisanship: the Future of Runbooks
Kenny Johnston continued the day with a really fantastic talk on developing runbooks in a way to let your engineers continue to be artisans. Finding the balance between the automatable parts of an ops role and the artisanal decisions that go into solving complex software problems is a challenge not only because of the race to automate everything but also because most engineers gain job satisfaction from intrinsic motivations that too much automation can start to strip away. Building runbooks with this in mind will let your engineers not feel like people pushing buttons but more as complex decision makers that are letting automated systems take the logistical challenges away from them.
Matty Stratton Fight, Flight, or Freeze — Releasing Organizational Trauma
Doing a talk on trauma is hard, relating personal trauma to organizational trauma is even harder. Matty Stratton drew excellent parallels around the trauma that people feel in post traumatic situations with the way that organizations behave after incidents. In the same way that post traumatic stress interrupts the normal sympathetic nervous system, incidents can create an organizational depression or hyper arousal that makes them respond to incidents poorly in the future and can put them into a state of not being able to continue making changes to their application because of fear of incidents. By creating a blameless postmortem culture, many of these traumatic events can be dealt with in a healthy way allowing your organization to return to normal operations.
Andrew Newdigate Practical Anomaly Detection using Prometheus
A statistical talk built around using stats to trigger prometheus alerts, Andrew Newdigate talked through how GitLab wrote the model to account for seasonality of data and setting standard deviation error bounds allowed him to trigger alerts based on the same way that humans are able to do visual anomaly detection on graphs.
Evan Chan – Histograms at scale
Evan wrapped the conference up by walking us through the storage of histograms in Prometheus and illustrated their immense storage requirements when correctly bucketing the data. Promoting them to a first-class type in your metrics provider can save you tremendous amounts— 50x in his organization’s use-case— of storage while increasing their resolution and utility.
Jason Dixon provided some closing remarks explaining the history of the conference and expressing gratitude to everyone involved in hosting us. I’m glad we were able to finally attend this conference and hope that it continues to thrive. The single-track format and intimate venue really does provide shared context for all attendees and allows one to develop new connections with a lot of great people.