EPeak Daily

Logs, metrics, and the evolution of observability at Coinbase


Up till mid 2018, the one supported solution to generate customized metrics for an utility at Coinbase was to print to straightforward output. Every printed line would work its means via a collection of pipelines the place they could possibly be analyzed, reworked, and parsed till lastly touchdown on an index in our monolithic self-managed Elasticsearch cluster, the place engineers may construct visualizations and dashboards with Kibana.

This workflow represents one of the best of many worlds. Log entries are straightforward to generate, and since they’ll comprise fields with limitless cardinality at microsecond granularity, it’s attainable to slice, cube, and combination these entries to diagnose the foundation explanation for even essentially the most advanced points.

style="display:block; text-align:center;" data-ad-format="fluid" data-ad-layout="in-article" data-ad-client="ca-pub-4791668236379065" data-ad-slot="8840547438">

In 2016, we reaffirmed our choice to self-manage Elasticsearch. On the time, managed options like Amazon Elasticsearch Service have been nonetheless of their infancy (for instance, IAM coverage modifications would require a full redeploy of the cluster) and we have been reluctant to belief not but mature third social gathering vendor expertise on the time with the possibly delicate nature of our functions’ customary output information. We constructed elaborate automation to handle and blue-green deploy our Elasticsearch information, grasp and coordinating nodes with our current deployment pipeline and adjust to our self-enforced 30 day undertaking.

Visualizing our elaborate Elasticsearch deployment technique.

Managing this newly automated Elasticsearch ourselves really labored out wonderful for over a 12 months. As a company of thirty engineers serving up two reasonably trafficked internet functions, each our indexing (~100GB per day) and search quantity (just a few queries per minute) have been comparatively tame.

Backend API requests per minute over the course of 2017.

Nonetheless throughout 2017, curiosity in cryptocurrency skyrocketed, catapulting our reasonably trafficked internet functions into the highlight. This large surge in visitors to our buyer dealing with functions (60x!) had the downstream impact of an equal surge in downstream log entries. Earlier technical challenges contributed to this surge of recent entries as engineers labored additional time so as to add diagnostics to our platform (see our earlier scaling weblog publish for extra element). Throughout this era, a single Coinbase API name may generate as much as 100 separate log entries, at occasions leading to terabytes of log information per hour.

The cluster grew to become unwieldy. As the scale of the engineering crew elevated dramatically, the mixed weight of terabytes per day of logs, a monolithic cluster, and unrestricted entry patterns led to outages that halted engineering productiveness and created operational hurdles.

The logs expertise for engineers throughout Elasticsearch outages.

It grew to become clear {that a} monolithic Elasticsearch cluster wouldn’t be adequate for our observability wants. A few of our issues included:

  • A single unhealthy Kibana question (howdy, Timelion) may trigger a hiccup extreme sufficient to have an effect on all the engineering group.
  • Sure points have been tough to diagnose or forestall — Elasticsearch, Kibana and X-Pack lack the flexibility to investigate sluggish/costly queries by person nor the controls to stop costly queries like multi-week aggregations.
  • A lot of Elasticsearch’s load-related failure situations (like a collection of very costly queries) would go away the cluster in a disabled state even after load had completely subsided. Our solely recourse to get better from these crippled states was to carry out a full cluster restart.
  • Elasticsearch doesn’t have a separate administration interface, which means that in failures we’d be unable to carry out fundamental diagnostics on the cluster. Opening a assist request with Elastic requires diagnostic dumps that we have been unable to generate till the cluster had recovered.
  • Alerting on information saved in Elasticsearch information was not intuitive — instruments like Elastalert and Watcher don’t enable engineers to interactively construct alerts and have been tough to combine with PagerDuty and Slack.
  • We have been compelled to cut back log retention to cut back the impression of large queries and pace up cluster restoration following failures.
  • Apart from utility server customary output, we have been additionally storing and querying safety information like AWS VPC Move Logs and Linux Auditd data which required separate safety controls and exhibited totally different efficiency traits.
  • Our two largest inner functions consumed 80–90% of our log pipeline capability, decreasing the efficiency and retention for different smaller functions in our surroundings.

We selected to unravel these issues in two methods:

  1. Introduce a brand new platform to supply engineers with options which might be impractical to supply with Elasticsearch (particularly cheap aggregations, alerting, and downsampling for lengthy retention).
  2. Break up our self-managed, monolithic Elasticsearch cluster into a number of managed clusters segmented by enterprise unit and use case.


Based mostly on our challenges self-managing Elasticsearch mixed with our brief experiments at growing a dependable blue-green deploy technique for Prometheus backends influenced our choice to decide on a managed service for our metrics supplier. After evaluating a number of suppliers, we settled on Datadog.

Not like log platforms which retailer information as discrete structured occasions, metric shops like Datadog retailer numerical values over time in a matrix of time by tag/metric worth. In different phrases, in Datadog, in case you ship a single gauge or counter metric 100,000 occasions in a 10s interval, it’ll solely be saved as soon as per tag worth mixture whereas on a logs platform that very same worth would end in 100,000 separate paperwork.

On account of this tradeoff, metrics shops like Datadog enable for terribly quick aggregations and lengthy retention at the price of lowered granularity and cardinality. Particularly with cardinality, as a result of each extra tag worth added to a given metric requires a brand new column to be added to the time vs tag/metric matrix, Datadog makes it costly to breach their 10ok distinctive tag mixture per host restrict.

Regardless of these tradeoffs, we’ve discovered Datadog to be a near-perfect complement to our inner logs pipeline.

  • Creating and iterating on alerts is quick and intuitive. Internally, we’ve constructed a device to mechanically monitor and codify our Datadog displays and dashboards to stop unintentional modifications.
  • Pulling up advanced dashboards with metrics from the previous a number of months is guilt-free and near-instantaneous.
  • Options like distribution metrics enable engineers to simply perceive the efficiency from a block of code with world percentiles.
  • Limitations on cardinality may be complicated for engineers who need to preserve observe of excessive cardinality values like user_id or pockets deal with.

Metrics vs Logs

A typical matter internally, we’ve discovered the excellence between Metrics and Logs to be essential. At a excessive degree the instruments have comparable characteristic units — each enable functions to ship arbitrary information which can be utilized to mix visualizations into fancy dashboards. Nonetheless, past their fundamental characteristic units, there are main variations within the options, efficiency, and retention of those instruments. Neither device can assist 100% of use instances, so it’s possible that engineers might want to leverage each with the intention to present full visibility into their functions.

In brief, we expect that Metrics ought to be used for dashboards and alerts; whereas Logs ought to be used as an investigative device to seek out the foundation explanation for points.

An desk we use internally as an example the distinction between Logs and Metrics.

Securing the Datadog Agent

From a safety perspective, we’re comfy utilizing a third-party service for metrics, however not for logs. It’s because metrics are typically numeric values related to string keys and tags, in comparison with logs which comprise whole traces of ordinary output.

Datadog provides operators a collection of tight integrations — in case you present the Datadog agent entry to a bunch’s /proc listing, Docker socket, and AWS EC2 metadata service, you’ll be supplied wealthy metadata and system stats hooked up to each metric you generate. Working a third social gathering agent like Datadog on each host in your infrastructure, nonetheless, carries some safety danger no matter vendor or product, so we selected to take a safer strategy to using this expertise.

We took a number of actions with the intention to acquire most publicity to those Datadog integrations whereas additionally decreasing the chance related to operating a 3rd social gathering agent.

  • Relatively than use the pre-built Docker container, we constructed our personal stripped down model with as few non-compulsory third social gathering integrations as attainable.
  • By default, the agent is opt-in — tasks have to explicitly select to permit the Datadog container.
  • On each host, put the container on a separate “untrusted” community bridge with out entry to different containers on the host, our VPC, or the EC2 AWS metadata service. We arduous code the `DD_HOSTNAME` atmosphere to the host’s occasion ID to permit the AWS integration to proceed working.
  • Run a particular Docker socket proxy service on hosts to allow the Datadog container integration with out exposing the container’s probably secret atmosphere variable values.
An outline of how we lock down the native Datadog agent.


Growing a robust metrics basis at Coinbase helped to alleviate a few of the issues we skilled with Elasticsearch as many workloads naturally migrated to Datadog. Now not less than when a difficulty did happen on the monolithic cluster, engineers had information they might fall again on.

However the Elasticsearch points and outages continued. As engineer headcount continued to develop, logging continued to be a supply of frustration. There have been about 7 new engineering productiveness impeding incidents in This autumn 2018. Every incident would require operational engineers to step via elaborate runbooks to close down dependent companies, totally restart the cluster, and backfill information as soon as the cluster had stabilized.

The basis trigger of every incident was opaque — may it’s a big aggregation question by an engineer? A safety service gone rouge? Nonetheless the supply of our frustration was clear — we’d jammed so many use instances into this single Elasticsearch cluster that working and diagnosing the cluster had turn out to be a nightmare. We would have liked to separate our workloads with the intention to pace incident analysis and scale back the impression of failures after they did happen.

Functionally sharding the cluster by use case appeared like an amazing subsequent step. We simply wanted to decide between investing additional within the elaborate automation we’d put in place to handle our current cluster, or reapproach a managed answer to deal with our log information.

So we selected to reevaluate managed options for dealing with our log information. Whereas we’d beforehand determined in opposition to utilizing Amazon Elasticsearch Service on account of what we thought of on the time to be a restricted characteristic set and tales of questionable reliability, we discovered ourselves intrigued by its simplicity, authorized vendor standing, and AWS ecosystem integration.

We used our current codification framework to launch a number of new clusters. Since we leverage AWS Kinesis customers to jot down log entries to Elasticsearch, merely launching duplicate customers pointed on the newly launched clusters allowed us to rapidly consider the efficiency of Amazon Elasticsearch Service in opposition to our most heavy workloads.

Our analysis of Amazon Elasticsearch Service went easily, indicating that the product had matured considerably over the previous two years. In comparison with our earlier analysis, we have been blissful to see the addition of occasion storage, the assist of recent variations of Elasticsearch (solely a minor model or two behind at most), in addition to varied different small enhancements like on the spot IAM coverage modification.

Whereas our monolithic cluster relied closely on X-Pack to supply authentication and permissions for Kibana. Amazon Elasticsearch Service depends on IAM to deal with permissions at a really coarse degree (no doc or index degree permissions right here!). We have been capable of work round this lack of granularity by dividing the monolith into seven new clusters, 4 for the vanilla logs use case, and three for our varied safety crew use instances. Every cluster’s entry is managed by leveraging a cleverly configured nginx proxy and our current inner SSO service.

“Kibana Selector” — our means of directing engineers to the suitable functionally sharded cluster.

Migrating a crew of over 200 engineers from a single, easy to seek out Kibana occasion (kibana.cb-internal.fakenet) to a number of separate Kibana situations (one for every of our workloads) offered a usability problem. Our answer is to level a brand new wildcard area (*.kibana.cb-internal.fakenet) at our nginx proxy, and use a undertaking’s Github group to direct engineers the suitable Kibana occasion. This fashion we are able to level a number of smaller organizations on the identical Elasticsearch cluster with the choice to separate them out as their utilization grows.

Our present log pipeline structure overview.

FFunctionally sharding Elasticsearch has not solely had a large impression on the reliability of our logging pipeline, however has dramatically lowered the cognitive overhead required by the crew to handle the system. In the long run, we’re thrilled at hand over the duty of managing runbooks, tooling, and a fickle Elasticsearch cluster to AWS in order that we are able to give attention to constructing the subsequent era of observability tooling at Coinbase.

Right this moment, we’re specializing in constructing the subsequent era of observability at Coinbase — if a majority of these challenges sound fascinating to you, we’re hiring Reliability Engineers in San Francisco and Chicago (see our discuss at re:invent in regards to the sorts of issues the Reliability crew is fixing at Coinbase). We have now many different positions out there in San Francisco, Chicago, New York, and London — go to our careers web page at http://coinbase.com/careers to see if any positions spark your curiosity!

This web site could comprise hyperlinks to third-party web sites or different content material for data functions solely (“Third-Occasion Websites”). The Third-Occasion Websites should not below the management of Coinbase, Inc., and its associates (“Coinbase”), and Coinbase is just not accountable for the content material of any Third-Occasion Website, together with with out limitation any hyperlink contained in a Third-Occasion Website, or any modifications or updates to a Third-Occasion Website. Coinbase is just not accountable for webcasting or every other type of transmission obtained from any Third-Occasion Website. Coinbase is offering these hyperlinks to you solely as a comfort, and the inclusion of any hyperlink doesn’t indicate endorsement, approval or advice by Coinbase of the positioning or any affiliation with its operators.

Until in any other case famous, all photos supplied herein are by Coinbase.

Logs, metrics, and the evolution of observability at Coinbase was initially printed in The Coinbase Weblog on Medium, the place individuals are persevering with the dialog by highlighting and responding to this story.

Supply hyperlink

Comments are closed.

Hey there!

Sign in

Forgot password?

Processing files…