Outages are inevitable. It is how we respond that can make or break our company. In this post, we will talk about how Service Catalogs can impact your incident response process and make it more effective.
Track Team Ownership
When a company has just a handful of services, it can be relatively easy to figure out who to call when something breaks. But when companies are at the stage of having dozens of services to manage, figuring out who to page or reach out to can be a challenge. And when you’re being impacted by an incident, time is of the essence.
Within FireHydrant, you have the ability to create teams and each team can be tied to the various services that they own. When services are impacted, this relation allows you to quickly find a subject matter expert. According to Carbonite, the minimum cost of an incident is $147 per minute for SMB organizations while Enterprise organizations are impacted to the tune of up to $9,000 per minute according to Vertiv. While some companies fall lower and higher on the spectrum of cost, either way: the sooner you can get the right person resolving the incident, the better.
Identify Suspicious Code
We don’t believe that a service catalog should be limited to just Microservices, so we allow you to relate services, environments, and functionalities and tie it all together to the teams that manage it.
We’ve found that Change Logs are critical to identifying suspicious code which can be essential to incident resolution. When integrating software development programs like GitHub or Kubernetes to FireHydrant, we can start to relate code to the various Services in the catalog, too.
Tracking changes with something like the FireHydrant change log will help your team identify suspicious deployments faster than digging through logs or recent GitHub pull requests. By integrating your entire CI/CD pipeline into FireHydrant you can see how a change propagated through each step of your pipeline, giving you insight into deployments that triggered unplanned downtime.
If an incident is kicked off to say “Logging In” is down, we can now suggest the services that power that end user functionality that may be causing the outages, and code that may be responsible.
Enable non-engineers to more effectively kick off incidents
This idea is closely related to mapping, and also talks about a topic mentioned in our last blog post is that of “Pitchfork alerting.” Frequently, alerting like this comes from Sales, Customer Success, or Support in lieu of your alerting or monitoring providers. It becomes a challenge when there isn’t an effective way to rally the appropriate teams to efficiently respond to an incident.
When your team builds out such thorough functionality mapping, you’re enabling the entire company to help with incident response. Now customer support can kick off an incident to alert you that “Logging in” is down, and they can rest assured that they’re getting the correct subject matter expert with minimal effort.
Gain Valuable Insights Through Reporting
Many engineering teams keep Error Budgets top of mind when working through incidents and improving resilience, but finding the data isn’t always quick and easy.
Whenever you kick off an incident using FireHydrant, we automatically track which services are impacted (whether they’re microservices, functionalities, or environments). Tracking incidents in this way can give you insights into the services that are most brittle across your application, giving you easy reports to back up why you need to focus on the reliability of certain parts of your application.
Leveraging Service Catalogs is an excellent way to enable your team to work more efficiently during an incident and to ultimately reduce time to resolution during an incident.