Observability can be described as a broad umbrella that covers many moving parts. Service meshes can cover a lot without us having to write one line of code.
The microservices community is abuzz about service meshes, and observability. This Trend Report will explore how a service mesh and a good observation stack can help us overcome the most pressing issues we face when working in microservices.
Common Microservices Challenges
It is difficult to adopt microservices. It is difficult to troubleshoot and make sure that they run smoothly. Day 2 operations are a major overhead for microservices. Let’s look at some of the issues that make microservices challenging to use.
Debugging errors in distributed systems is a nightmare
It is difficult to debug errors due to the distributed nature of microservices. A monolith can be debugged by simply looking through logs and stack trace. It’s not so simple with microservices. If there are errors, it is possible to look through the logs of a microservice and not find the exact problem. It could instead mention an error in the request or response from a dependent service.
This means that we will need to trace the entire network to determine which microservice is the cause of the problem. This can be a time-consuming task.
It’s not easy to identify bottlenecks within the system
It is easy to identify performance bottlenecks in monoliths by profiling your application. Profiling can often be enough to identify which methods are taking the longest time in your codebase. This allows you to focus your optimization efforts on a small section of code. It can be difficult to identify which microservice is slowing down the whole system.
Each microservice may appear to be performing well when tested separately. In real-world scenarios, however, each microservice may have a different load. There may be core microservices on which a lot of microservices rely. These scenarios are difficult to reproduce in an isolated testing environment.
It is difficult to maintain a microservice dependency tree
Microservices are all about the speed at which new software can be released. However, microservices can have many downstream effects. There are several events that can cause functionality to be broken:
- Release a dependent microservice prior to its upstream dependency
- Remove an API that is no longer supported by a legacy system
- Release microservices that break API compatibility
If there isn’t a clear dependency tree between microservices, these events can be difficult to avoid. It is easier to notify the right teams and plan better releases when there is a dependency tree.
Observability: The solution to the microservices problem
Observability can solve all the problems I have mentioned. Jay Livens explains that observability refers to the ability to determine the current state of a system based on its logs and metrics. It is a system that monitors the health of our application and generates alerts when there are failure conditions. This information allows us to debug problems whenever they occur.
Figure 1 – Components of an observation stack with open source examples
These components will be found in any observability stack:
- Source for metrics/logs– An agent or library that we use to generate data
- Log shipper is an agent that transports the data to a storage unit. It is often embedded in the metric source.
- Collector/store is a stateful service that stores the data generated
- Dashboards: fancy charts that make it easy to understand and digest the data
- Alert manager is the service that triggers notifications
There are many powerful open-source tools that can simplify the process of creating an observation stack.
Capturing network telemetry is an important aspect of observability. This can help us solve many of the problems we discussed in our initial discussion. The developers are usually responsible for generating the telemetry data. Telemetry is a tedious and costly process that can lead to errors. Developers are responsible for implementing security features, and making communication resilient against failures.
Our developers should only write application code. It is important to reduce the complexity of microservices networking to the platform. This could be achieved by using a service mesh such as Istio or Linkerd.
A service mesh is an architectural design that controls and monitors microservices’ networking and communication.
Two major components of a service mesh are the data plane (control plane) and the data plane (data plane). The control plane manages all network traffic generated by our microservices. A service mesh injects a proxy sidecar alongside each of our microservices to achieve this. This sidecar, usually Envoy is responsible for transparently intercepting all traffic through the service. The control plan is responsible for configuring proxies. The control plane never receives any application traffic.
Figure 2 shows how the service mesh architecture abstracts away all of the complexity we discussed earlier. It is possible to use a service mesh without writing a single line code. The service mesh assists us in managing multiple aspects of a microservices-based architecture. Some of these notable benefits include:
- Get a complete view of traffic flow
- Controlling network traffic
- Secure microservices communication
Get a complete overview of traffic flow
Figure 3 shows App A making a request for App B. The Envoy proxy proxies are sitting next to each app and intercepting the request. They have full visibility of the traffic through these microservices. These proxies have the ability to inspect the traffic and gather information such as the number of requests made and the response code for each request.
A service mesh, in other words, can answer questions such as:
- Which service is it talking to?
- What is the microservice’s request throughput?
- What is the error rate for each API?
Service meshes are more than passive observers. It can actively shape all network traffic. Sidecars that use Envoy proxy proxies are HTTP-aware. All requests flow through them, so they can be set up to provide useful features such as:
- Automatically retires – ability to replay a request when a network error occurs
- Circuit breaking– Blacklisting an unhealthy copy of an upstream microservice which has stopped responding
- Request for rewriting – Ability to modify request URLs and set headers when certain conditions are met
It doesn’t stop there. Proxies can also be used to split traffic according to a weight. We can set the proxy to send 95% to the stable service and the rest to the canary. This allows us to simplify the release process and power advanced practices such as canary deployments.
Secure Microservices Communication
Security is another great benefit of a service mesh. Sidecar proxies can also be set up to use mutual TLS. This allows all network traffic to be encrypted automatically in transit. The service mesh control plane automates the task of rotating and managing the certificates necessary for mTLS.
Service meshes can be used to help with access control by allowing certain services to talk to each other. This can be used to eliminate a variety of security flaws like man-in the-middle attacks.
Figure 5 – A service mesh can protect network traffic
What does a Service Mesh do to help with observation?
We have just seen how a service mesh captures telemetry data. Let’s look a little deeper to see what kinds of uses this data can power.
We have discussed the difficulty of debugging microservices. Distributed tracing, which is the process of recording the request’s lifecycle, can help solve this problem. It is possible to pinpoint the root cause of the problem by using just one graph.
Service meshes collect network trace information and send it to tools such as Jaeger. You only need to forward a few HTTP headers into your application code. That’s it!
Traffic Flow Metrics
The service mesh is able to collect three of the four golden signals that are needed to monitor a service’s health.
- Request throughput The number of requests that are being served by each microservice
- Response error – The percentage of unsuccessful requests
- Response times – The time it takes for a microservice responds; this histogram can be used to extract n percentages of latency
Although there are many other metrics that a service mesh can collect, these are the most important. These metrics can be used for many different purposes. These include:
- Scaling based on advanced parameters such as request throughput
- Advanced traffic control features such as circuit breaking and rate limiting can be enabled
- Automated canary deployments and A/B Testing
Service meshes can be used to automatically build a network topology by combining traffic flow metrics with trace data. This is, if you ask me – a lifesaver. The network topology allows us to see the entire microservice dependency tree at once. It can also show the health of our cluster’s network. This is a great way to identify bottlenecks in your application.
Observability can be described as a broad umbrella that covers many moving parts. Service meshes are a great tool that covers a lot of ground and doesn’t require us to write any code. A service mesh is basically a tool that helps us:
- Generating distributed tracing data to simplify debugging
- Serving as a source of critical metrics such as the golden signals for monitoring microservices
- Establishing a network topology