In this quick start, we will walk through the initial steps of exploring data using Honeycomb. Once you have finished the tutorial you will be better-equipped to begin sending and visualizing your own data with our integrations and SDKs.
We will assume the role of a DevOps professional who is debugging reports of application slowness from users. To this end, we will work with the Slow App dataset, which your account is provided access to by default, to pinpoint the source of the issue. The Slow App dataset contains events which describe various aspects of HTTP requests from users that our application has served.
This walkthrough is also available to watch in video form if you prefer.
In this tutorial, we will:
Direct instructions have been denoted in purple boxes like so:
This is a direct instruction.
The estimated time to completion, including reading, is 25 minutes.
You’ll need a Honeycomb account in order to follow along with this tutorial. If you don’t have one yet, please sign up for one. By default, users have access to the Slow App dataset. If you landed here from the Honeycomb UI, you should be presently in the Query Builder for that dataset and are ready to begin.
As mentioned above, in this scenario we are playing the role of someone in charge of developing and maintaining a web application. We have received reports of application slowness from quite a few of our users, so we are currently investigating. We assume that the integration to send request data to Honeycomb has already been set up, and we are exploring the data our app has sent (and is continuing to send) to Honeycomb. At the end of the tutorial, we will link to some references to help you send this data yourself, but for now we will focus on querying existing data.
We know that the events we have sent to Honeycomb have several properties which may be relevant to debugging. Here is a short description of some of the fields we will work with here.
|Field Name||Description||Example Value|
||The URL routing pattern matching the request.||
||The total amount of time in milliseconds spent serving the request.||200|
||The amount of time in milliseconds the request spent calling a fraud detection service on which our application depends.||50|
||The amount of time in milliseconds the request spent on database queries.||15|
Make sure you have the Slow App Dataset open in another tab or window to follow along.
Note: For the tutorial, the time range we are querying over (which is usually configurable, defaulting to the last few hours) will be locked to a pre-defined range to ensure consistency.
Honeycomb is built to be fast at calculations such as averages, counts, and percentiles. This encourages a workflow which is exploratory and fluid even with large amounts of information.
The slowness reported to us could be caused by a lot of factors, so let’s start asking some questions, and then answering them with queries that we have built.
We’ll start with:
“How long are requests taking in general?”
To visualize this question, we will use the
AVG() (average) function from the
Note: If you are concerned about using averages vs. percentiles, we also
P99() etc. style
functions – but will use averages here
for simplicity’s sake.
CALCULATE box in the UI and select
AVG(). We are then prompted to
select which numeric metric from the data we would like to visualize an average
for. In this case we are interested in
response_time_ms, the total time to
serve a request, so click that and then click on “Run” on the right-hand side of
the Query Builder.
There seems to be a noteworthy increase in volatility, and a gradual increase in response time, in the latter portion of the graph. This seems to be when our users likely started encountering slowness in the website. Since we can see the slowness is measured on our server side as well, we know that the issue is likely within our app.
Note that you can click View Raw Data, on the top right hand side of the visualization, to see the raw data which has been passed to Honeycomb.
This can be useful for exploring data in an unfamiliar dataset, or for eyeballing patterns that may be easier to spot in tabular form. You can get back to the graph view by clicking View Graph in the same location.
Click View Raw Data and have a look around at the raw data. When you’re finished, click View Graph in the same location to return to the graph.
Raw data is good to get a feel for things, but there is a lot of it – too much for humans to take in on their own.
By continuing to use Honeycomb’s aggregation and visualization facilities, we can get answers much faster.
Let’s start with asking about one of the usual suspects:
“Did a recent deploy break the app?”
We can see that there is a marker on the graph indicating when our most recent deploy happened. But the increase in latency started happening quite a while before our recent deploy, so this issue seems unrelated to the newly deployed code.
“OK, but is it our app’s fault, or is there an issue with a service we are calling out to?”
We know that our app calls out to a fraud detection service, provided by a 3rd
party. The time required to make these requests is tracked in the field
fraud_latency_ms mentioned above. We can take a look at that side-by-side with our
existing average of total request latency.
Click on the
CALCULATE box to switch to edit mode again, so we can add another
AVG. This time, select
fraud_latency_ms. Click “Run” again.
Hm, the time to call the fraud service looks uncorrelated. It hovers around 20-50 milliseconds, so that doesn’t explain our issue. We can rule that out.
CALCULATE box and click the “X” on the left hand side of
AVG(fraud_latency_ms) in the Builder to stop including it in our queries.
“What about the database, then?”
We know that we have an additional parameter of
mysql_latency_ms which represents
the latency talking to the database. So let’s take a look at the
AVG for that
CALCULATE text box to add another
AVG, this time for
Interesting! The increase in total request latency seems directly correlated to an increase in MySQL latency.
Our issue seems to be related to database latency, but what is causing the database to respond sluggishly?
We noted above that there is a field,
endpoint, which describes the URL
routing pattern. Since we have this field available, we can use a
endpoint to see these averages calculated per endpoint pattern -
i.e., we can visualize the average request latency for every reported endpoint
pattern all at once.
Additionally, by setting
ORDER to order by round trip time descending, we will
also receive a tabular view (located below the chart) of which endpoint patterns
have the highest latency.
This data will be visualized as many distinct colored lines on the chart. If any route is contributing to the problem more than the others in our app, we should be able quickly identify it.
BREAK DOWN box in the Query Builder, and select
Then, click the
ORDER box and select
AVG(response_time_ms) desc. Then, click
If we scroll down to the tabular view ordering by
we can see that the slowest endpoint routes begin with
/products/:product_name. We can also mouse over the rows in the table to see
the associated lines highlighted in the visualization.
Mouse over the rows in the table below the graphs to highlight which lines they correspond to in the visualization.
As we can see, Honeycomb allows us to explore fields which have a lot of unique possible values, and quickly spot patterns in the data. Columns with a lot of possible values are said to have high cardinality and Honeycomb is well equipped to explore this type of data.
We seem to have significantly narrowed down the source of our slowness. It is
related to endpoints with
:product_name in them, and sluggish database
lookups. Perhaps we need to add an index to the table which is used to store
product information, as
:product_name is likely translated into the
section of a SQL query. Once we have rolled out this solution, we can continue
to use Honeycomb to verify if the issue is fixed, or if further debugging is
If you want to, feel free to continue playing with the Slow App dataset (it
should always be accessible to your account via the direct link(s) in this
tutorial). For instance, try adding
P99(response_time_ms) to your
CALCULATE to see the high percentiles
visualized alongside the averages we have been using. Or, try
BREAK DOWN by
other fields like
hostname. Have fun, you don’t need worry about breaking
anything in this dataset. It’s meant for experimentation and learning.
If you encounter any issues or have questions for us, please send us a message in the Intercom chat box on the right hand side of the screen, or send an e-mail to firstname.lastname@example.org. We’d love to hear from you! If you are struggling or confused, we are happy to help.
Otherwise, proceed to sending your first event or rigging up our various SDKs and integrations for your technology of choice.