Klever’s Continuous Software Monitoring to Improve Service Quality, by Site Reliability Engineer Kadu Barral
How Klever keeps eyes constantly open to ensure high performance & availability of services, and how this monitoring helps not just to solve problems but also enable continuous performance improvement
You've probably heard about DevOps practices and how it helps to continuously deliver better products and services. If not, I invite you to read this article, with a slightly deeper explanation about this.
Today we’ll be talking about Monitoring, which is an important part of the application lifecycle used in DevOps.
For years, monitoring was just thought of as a way to discover when problems happen and had only one function - to alert the responsible team to solve it - teams usually worked as firefighters and after that nobody talked about it.
We don’t think that is a Klever way to work, since monitoring gives us a lot of insights and just using this during troubleshootings is such a waste of precious data and information.
In this article series our goal is not (at least for now 😉) to write about how we implement Site Reliability Engineering practices, but instead show to you two performance optimizations recently performed by the Klever DevOps team, delivered to the Klever ecosystem and why it was only possible with Observability.
Globally Distributed Services
One method that we use to monitor availability of our services is testing each one of them with Synthetic tests.
Synthetic Monitoring is a technique that uses ways to emulate the user’s behavior to ensure the operation of the systems monitored.
User’s actions flow are emulated through another software, usually scripts, and running repeatedly at specified intervals of time for performance measurements such as: functionality, availability and response time.
In short words, we have “robots” in different parts of the world emulating our application functionalities, and if any bot can’t complete the task, an alert is sent to our team who then take proactive measures.
This works great to discover problems at the time it happens but also gives us the Response Time of our services in all locations, and more, the monitoring tool also splits the time spent in each part of HTTP transactions.
With this data we could observe that for some locations, the requests were taking too long in three phases: connect, tls and transfer.
Solution to Global Latency Problem
Reducing connection and transfer time is challenging. The solution is to “move” the backend applications as close as possible to clients to cut network latency.
As Klever has a worldwide user base and clients in every part of the globe, replicating our services for all regions could be expensive and hard to maintain. Because of this, we decided to migrate from our traditional Load Balance to Google’s GCLB that provides cross-region load balancing with points of presence (PoP) around the world.
Once the connections are made in the nearest PoP, all traffic to our kubernetes backend and databases travels in Google’s internal network, we also moved TLS termination from ingress-nginx to edge load balancer to reduce TLS time. With this move, the average response time was cut in half as shown in the chart below.
Swap API Response Time
This optimization of Klever Swap API response time is inherently different from the first example since it doesn’t involve any infrastructure change.
Our monitoring stack assisted us in identifying the bottlenecks and together with the Swap squad we made changes in the source code resulting in a 60% enhancement in response time for the most called API method - the one that is responsible for Swap’s active keypairs listing.
With a Klever move and decreased server CPU consumption, the speed of all Swaps in Klever significantly increased.
In distributed software architectures such as those used by Klever to serve our users globally, it’s sometimes difficult to find what exact step in the entire distributed computing process that is the root cause of any potential slowness detected.
For this reason, distributed tracing is critical for understanding how a request moves across multiple services, packages, and infrastructure components.
In this specific scenario, with distributed tracing we realize that the API was constantly called, even without a Swap transaction, and found a solution.
Optimizing Swap API Response Time
This is a typical case of monolith where one service is responsible for more than one objective. Our Swap functionality is fantastic, easy to use and was one of our first services in the app. At that time, the functionality to return coin prices was built inside the Swap API.
With our product evolution, user growth and increase of coins available for Swap, we realized that it was necessary to split this functionality in another service. Essentially to separate and distribute the prices and swap services to make the entire Swap feature faster, more reliable and efficient.
After this change, other screens in the application can consult current prices without impacting the overall Swap performance and we were able to scale the service separately depending on demand.
In this article we presented two examples of how continuously monitoring our services helps us to bring high quality and constantly evolving products to our community of Klever users.
Of course that monitoring entails much more than what we’ve covered today and the basics is also very important:
Ensuring that all Klever services are up and running is the heart that keeps our blood flowing.
A good monitoring strategy reduces our MTTA and MTTR and lets our 24x7 on-call globally distributed team always remain aware of possible problems and scale better for the future through proactive measures. Our CTO Bruno talked a little bit about it in his article here.
Keep tracking microservices is a challenge in distributed software architectures and monitoring is a non stop cycle of progress for the Klever DevOps team. But that is a talk for another time!
Kadu Relvas Barral
Klever’s Site Reliability Engineer
Kadu has over 15 years professional experience in IT, and spent the last 10 years working in insurance, telecommunication and financial companies trying to find what was broken in their systems before others figured it out. Marathonist in free time, likes to run and to keep applications running.