In the previous post, we took a look at what observability is and how to build it into your application. Just to recap, observability is developing insights into the system based on external signs. For example, the car dashboard has a service engine light, low-pressure indicators, RPM for the engine, etc. All of this can quickly help us determine if the car is in a condition to drive. We don’t need to actually look at every individual part of the car every single day.
As with anything in our industry, software exists for the business needs. We always need to be cognizant of that. Any initiative that doesn’t align with the business needs loses steam quickly and rightly so. After all, we have a limited amount of resources to spend at any given time. The domain driven design movement plays right into it. So, we need to understand why having a more observable system can help us.
Confidence in the system
When we hear people say their system works, what does that mean? Does it mean their smoke tests worked? Does it mean there are zero errors? Can it withstand Chaos Monkey? I highly recommend taking a look at Principles of Chaos Engineering to better understand the Chaos Monkey tool. That deserves a separate discussion in and of itself for some other time.
Sometimes a system may not be throwing any errors but may actually be doing something it is not supposed to. How do we catch that? Very often, this unexpected behavior is not captured in the logs or the monitoring tools. So, going by a no news is good news attitude, the team claims that the system is working. It is similar to walking into a room full of mess with the lights turned off: since I can’t see anything, I am conveniently going to assume everything is just great.
Having multiple levels of health checks, connectivity checks, and performance checks along with observing demo data points helps us provide a basis for the “it’s working” claim. It is not a claim anymore, since we have data to back it up. It becomes a fact. This can reduce the number of customer complaints related to code issues and help the business in building confidence in the system.
For example: If I have an e-commerce site along with typical telemetries around errors, availability, and response times, we can create a dummy order and check the data around it periodically to make sure everything is fine. If any part of the order process goes beyond the acceptable thresholds we can raise the appropriate alarms.
This is how we can build confidence in our system with knowledge instead of guesses.
Higher confidence in the system paves the way for more advanced deployment models such as canary deployment and blue-green deployment. Both of these deployment models require some level of testing before the new features go completely live. If we can subject new nodes to production loads and observe how the system behaves with the new changes, we reduce the friction between existing codebase and the new changes coming in. All of this means we can deploy new code more reliably and rapidly with minimum to no downtime, thus achieving a true continuous deployment for the system.
Understand changes that affect business KPIs (key performance indicators)
KPIs tell us how the business is doing, as they point to key health issues that need to be addressed. Some examples of KPIs could be the number of active customers, cost per customer, customer attrition, etc. Let’s think of customer attrition from the perspective of a social media site, Twitter. What if it took more than 30-40 seconds to make a tweet live? That would significantly impact how many tweets can be generated, affect Twitter’s popularity, and eventually cause customer attrition. In this case, we can see a relation between latency and customer attrition. Understanding this, we can see why Twitter must have made a move to migrate to Scala and JVM from Rails. This is not a trivial undertaking and the company’s existence can depend on it. Could they have done it without gathering performance metrics? Could anyone do this without having a before and after picture?
Observability brings the backend problems to the forefront by making them measurable, which is beneficial because those problems can be absolutely detrimental to the business. On the other hand, fixing those issues proactively can drive the business forward.
We saw a relation between technical metrics and KPIs. This means if I want to improve my KPIs, I can target my technical metrics because I can tie them to a piece of code. So when I hear the system is slow, I can quickly create an understanding of what that means based on the metrics and start the analysis.
Let’s take a look at an example. If my response time for a request averages 100 milliseconds over a week but jumps to 2-3 seconds after I push a new feature, then I know the system’s performance has been affected significantly. It might not be a rollback worthy deployment but is certainly worth taking a look to ensure it doesn’t get any worse. I am likely to know the cause as well based on the timelines and other useful observability metrics and logging. What does that do?
I can now clearly see what I need to achieve. If I can throw hardware at something, I can try that and I have ways to test that. If it requires a code change, I can push those changes through the same rigor. I know that I am not done until I bring the response time down to the acceptable range. Without realistically knowing that range, this would have been hard to achieve.
Faster software reflects positively on sales
With a clear understanding of what fast or slow means for the application, let’s take a look how that can affect the bottom line of the business.
On a social media site, how expensive is the complete shutdown of the site? How many users are lost by the site with just regular sluggish performance? It is very easy to get a bad reputation and the downward spiral begins there. Would you buy claims of performance tuning from a software consulting firm if their own site often suffers from some serious lag, random crashes, etc. I may see an item I am interested in on a vendor site but end up on Amazon anyways because they made it incredibly easy and fast to search items and purchase them. They couldn’t have arrived at a premier experience without gathering tons of metrics about usages, the performance of the pages, etc. What would happen if Amazon experiences just 10% slow down in their checkout process? How many carts will get abandoned? All of this has a direct impact on the business and its viability. As software engineers, it is important to understand the impact of these things on the business.
Here’s a result from Pinterest’s frontend performance project in March 2017: 40% drop in perceived wait time, 15% increase in SEO traffic yielded a 15% increase in signups. They go into the details in this Pinterest Engineering blog post.
Both of these examples were originally mentioned in the Practical Monitoring book by Mike Julian.
Business priorities determination
It is very common for development teams to be in discussion with product teams about the next set of features. We can always make these conversations more data-driven if we have the usage statistics. In our e-commerce example, if we find that most users are utilizing the search functionality on the site for discovery instead of the navigation system, we can use this insight to make the search faster and more useful. We reduce the priority on everything related to the navigation system.
Justification for refactoring
Everyone cringes when they look at their old code. Engineers want to immediately start refactoring. Business doesn’t see any value in it as they think nothing changes from the end users perspective. The maintenance argument works only with folks with some development experience. The friction creeps in. How can we allow time for this activity from the business perspective? We can find a middle ground in the observability metrics. For example, we can prioritize activities that are going to improve the performance of the login process over other types of changes.
Generate a complete picture for A/B testing
The observable nature of the system can contribute to A/B test experiments effectively. Technical data metrics can be tied to usage to generate information that can help understand the stress points in the system. Fortunately, there are tons of tools to conduct these experiments. Optimizely is one such tool I have seen being used effectively.
Provide steps towards auto-healing
Observability is knowing the speed of your car by looking at the speedometer and not the spinning wheel. It won’t necessarily fix problems but it will provide good insight into those problems that could prove crucial in resolving. Auto-healing could be hard to generalize because it can change per context, per architecture, per tech stack, etc. To come up with a truly auto-healing system can be a daunting task but the path to that destination goes through observability.
We have seen how observability helps you build better software to drive value for users and the business. When we understand a system’s behavior we can operate better. We can deploy faster, build greater confidence in our system, understand KPIs better, and drive sales.