How We Reduced a Request Time by 133 with Tracing and Elastic APM

You are working in a microservice and/or serverless environment? Debugging your application is a nightmare? You think that having an APM service will cost you thousands of dollars? If you are in either (or all) of this situation, I think you will find this blog post handy!

Why did I implement a tracing solution at Sipios?

A couple of months back, at Sipios, we started a new project to help French companies to face the economic consequences of the Coronavirus crisis. French companies could apply for a loan provided by the French “regions”, as long as the amount asked by all companies was under the limit.

Traditionally, it is really painful to apply for a loan and you have to fill lots of paperwork. In this very hard time for companies, we had to create a seamless application that any customer could use and enjoy. That’s why performance was important to not create any additional frustration for our users.

At first, everything went well, the code was running fast. But after a couple of weeks, we started to see some latencies of several seconds for computing the total amount asked by the companies. It even started to cause errors because of timeouts.

If we look at our code, it was not doing something very complex. I have reproduced the situation, with the following example:

public boolean getAmountSpentByCustomerInCountry(Country country) {
    List<Customer> customers = customerRepository.findByCountry(country);

    return customers
        .stream()
        .map(customer -> customer.getTransactions().stream().map(Transaction::getAmountInCents).reduce(0L, Long::sum))
        .reduce(0L, Long::sum);
}

The implementation seems nice and clean, right? So why is it taking so long?

As we had implemented an APM (Application Performance Management) that traces our requests, we were able to see instantly what was going on under the hood:

Query before optimization:

We can see that we were running hundreds of queries to the database. Even if each of them is really small, it causes overhead and has a big impact on performance. It is typical of an N+1 queries problem. For more information on N+1 queries and how to eliminate them, I invite you to read the great article of Yann Briançon on how to eliminate them: https://www.sipios.com/blog-tech/eliminate-hibernate-n-plus-1-queries.

Now that we know what is going on, it is easy to improve our performance by combining our request into one:

@Repository
public interface CustomerRepository extends JpaRepository<Customer, String> {
    @Query("SELECT sum(t.amountInCents) FROM Transaction t JOIN t.customer c WHERE c.country > :country")
    Long getTotalAmount(@Param("country") Country country);
}

And if we look at our performance, we can see a 133 reduction of the time of our request (our 23 seconds request is down to a 0.165 seconds one)

Query after optimization:

Now that we know why it is incredibly helpful to have an APM solution, let’s see how it can be implemented!

How does the Elastic APM works?

APM solutions are often very expensive. Yet, with Elastic APM, you have a cheap but effective solution at hand. It is open-source, so you can host your Elastic infrastructure on your own machines and be compliant with your data management policy.

To collect and analyze your traces, the Elastic APM solution is decomposed in a few components:

an Elastic APM agent: its goal is to run as a sidecar of the application and to collect the traces produced by the application
an Elastic APM server: it is a lightweight tool whose role is to receive the traces sent by APM agents, and transforms them into ElasticSearch documents. It can be scaled independently, and provide a buffer to ElasticSearch
an ElasticSearch instance to store and index your traces
a Kibana instance to visualize your traces

Let’s see how to implement this kind of solution with two Spring microservices!

Step 1: Configure your Elastic Infrastructure

You can find a configuration to launch the Elastic Infrastructure with docker-compose on the following GitHub repository: https://github.com/clementdessoude/elastic-apm-demo.

Here is the code snippet:

version: '2.2'
services:
  apm-server:
    image: docker.elastic.co/apm/apm-server:7.7.0
    depends_on:
      elasticsearch:
        condition: service_healthy
      kibana:
        condition: service_healthy
    cap_add: ['CHOWN', 'DAC_OVERRIDE', 'SETGID', 'SETUID']
    cap_drop: ['ALL']
    ports:
      - 8200:8200
    networks:
      - elastic
    command: >
      apm-server -e
        -E apm-server.rum.enabled=true
        -E setup.kibana.host=kibana:5601
        -E setup.template.settings.index.number_of_replicas=0
        -E apm-server.kibana.enabled=true
        -E apm-server.kibana.host=kibana:5601
        -E output.elasticsearch.hosts=["elasticsearch:9200"]
    healthcheck:
      interval: 10s
      retries: 12
      test: curl --write-out 'HTTP %{http_code}' --fail --silent --output /dev/null http://localhost:8200/

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.7.0
    environment:
      - bootstrap.memory_lock=true
      - cluster.name=docker-cluster
      - cluster.routing.allocation.disk.threshold_enabled=false
      - discovery.type=single-node
      - ES_JAVA_OPTS=-XX:UseAVX=2 -Xms1g -Xmx1g
    ulimits:
      memlock:
        hard: -1
        soft: -1
    volumes:
      - esdata:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastic
    healthcheck:
      interval: 20s
      retries: 10
      test: curl -s http://localhost:9200/_cluster/health | grep -vq '"status":"red"'

  kibana:
    image: docker.elastic.co/kibana/kibana:7.7.0
    depends_on:
      elasticsearch:
        condition: service_healthy
    environment:
      ELASTICSEARCH_URL: http://elasticsearch:9200
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
    ports:
      - 5601:5601
    networks:
      - elastic
    healthcheck:
      interval: 10s
      retries: 20
      test: curl --write-out 'HTTP %{http_code}' --fail --silent --output /dev/null http://localhost:5601/api/status

volumes:
  esdata:
    driver: local

networks:
  elastic:
    driver: bridge

Step 2: Make your microservices talk with each another

If a process spans on several microservices, you will want to have a unique trace covering the whole process, with spans on the different microservices implicated in the process.

To propagate the trace between the different microservices, a header (called traceparent) is added in the requests made between our several components. This header will be interpreted by the Elastic APM agent to build the history of the process, and link the spans of this process to a single trace.

Elastic follows the W3C recommendation to do so.

As this header is not added by default in every library, you will often have to add it yourself. The following example is how I configured it with Spring Webflux. Even if this example is specific to Java, the logic is the same in every language.

public class BaseClient {
  protected final ApplicationProperties applicationProperties;
  protected final String baseUrl;

  protected WebClient client;

  public BaseClient(ApplicationProperties applicationProperties, String baseUrl) {
    this.applicationProperties = applicationProperties;
    this.baseUrl = baseUrl;
  }

  @PostConstruct
  protected void setup() {
    this.client = WebClient.builder().filter(addTracingHeaders()).baseUrl(baseUrl).build();
  }

  private ExchangeFilterFunction addTracingHeaders() {
    return (clientRequest, next) -> {
      Span span = ElasticApm.currentSpan();

      AtomicReference<String> headerName = new AtomicReference<>();
      AtomicReference<String> headerValue = new AtomicReference<>();
      span.injectTraceHeaders(
        (name, value) -> {
          headerName.set(name);
          headerValue.set(value);
        }
      );

      ClientRequest filtered = ClientRequest.from(clientRequest).header(headerName.get(), headerValue.get()).build();

      return next.exchange(filtered);
    };
  }
}

Step 3: Add log correlation

Logs are another way to have observability of your system and understand what is going on. However, if you run in a microservice environment, with multiple instances of your application, it can be difficult to know what logs are in the context of a single request. But there is a simple solution: enabling log correlation!

If you enable log correlation, the Elastic APM agent will inject the trace id and the span id in the Mapped Diagnostic Context (MDC). Every log in the context of a request will have the same trace id. Thus, it will be possible to filter on this parameter and see the logs of a single request.

Since Elastic APM v7.4, you can even switch from the APM console to the logs in one click!

log_correlation_compressed_2

To enable log correlation, you can set the ELASTIC_APM_ENABLE_LOG_CORRELATION environment variable to true when starting the Elastic APM server.

Tradeoffs

Elastic APM is awesome, and complete the range of tools provided by Elastic. It is particularly useful if you have already an ELK stack for visualizing your logs.

Nevertheless, it is still a young tool and comes with a few drawbacks. One of the biggest ones is the lack of integration with some libraries, Spring Webflux for instance, or using it with Kafka. You can still add your own implementation, but it will cost you some time. You can find some documentation on how to do it and configure your own transactions and spans with the Java APM Agent right here: https://www.elastic.co/guide/en/apm/agent/java/current/public-api.html#api-start-transaction

Other alternatives, like New Relic or Datadog, are easier to configure with your environment, but are expensive, particularly when you are dealing with lots of hosts, when Elastic APM is free, except for maintenance costs.

Conclusion

After this article, I hope you are convinced how an APM solution can greatly improve the observability of your systems and helps you to detect bugs and latencies instantaneously.

With the Elastic suite, you have an easy solution to implement an effective, cheap, self-managed solution.

So I hope you are eager to implement one in your projects!

Feel free to comment. I will be happy to answer your questions if you have any.