Profiling applications’ performance with simple debug logs

It’s a magical moment when a guy wearing a black cape pulls something surprising out of a hat. Could that pigeon have been there the whole time? For me, it’s the same kind of excitement when an advanced log analytics tool pulls a surprising insight from simple debug logs.

 In this post, I would like to show how Application Insights Analytics can help you track the performance of code flows in your app using only the simplest debug logs. The blog post is entirely based on pre-populated data in the Application Insights Analytics demo site so you can experiment hands on with everything you read here.

Simple Debug Logs

Let’s start by looking at a short snippet out of a simplified application debug log.

short snippet from a debug log

Commonly, in addition to a factual message, any debug log would at least have a timestamp column, a correlation column that keeps trace lines in the right logical sequence at any level of concurrency and despite the code implementation being asynchronous. I’ve highlighted two lines that seem to begin and end a process of anomaly detection that spans across a cache lookup, a storage query, sampling, data analysis, storing in a DB and some sort of enrichment. The following technique allows me to track the performance of the code flow between these two trace lines or any other two.

 We’ll start by creating a query that keeps just the lines that starts and ends the flow we want to benchmark over time.

let startDate = startofday(datetime("2017-02-01"));
let endDate = startofday(datetime("2017-02-07"));
traces
| where timestamp > startDate and timestamp < endDate
| where (message startswith "Processing anomaly detection") or (message == "Done processing")
| project timestamp , operation_Id , message 
| order by operation_Id, timestamp asc
| take 10

log snippet 2 Once you remove the closing take 10 statement, the query returns ~290K flows. We’ll use that in the next steps of this exploration.

From Text to Metric

Next step is rather trivial, the query will now calculate the duration of the operation by subtracting the earlier timestamp from the later, per operation. Once converted to long, a timestamp is the number of ticks. To convert to milliseconds, you’ll need to divide by 10,000.

let startDate = startofday(datetime("2017-02-01"));
let endDate = startofday(datetime("2017-02-07"));
traces
| where timestamp > startDate and timestamp < endDate
| where (message startswith "Processing anomaly detection") or (message == "Done processing")
| project timestamp , operation_Id , message 
|   summarize count(), durationInMS=tolong(max(timestamp)-min(timestamp))/10000 by operation_Id
//  Filter out operations that do not have both a beginning and an end
|   where count_ == 2
| project-away count_
| take 10

durations1 

And finally, to add the duration column back to the original table we will use join. For the sake of efficiency, we’ll use materialize() to optimize the performance of the query.

let startDate = startofday(datetime("2017-02-01"));
let endDate = startofday(datetime("2017-02-07"));
let T = materialize(
traces
| where timestamp > startDate and timestamp < endDate
| where (message startswith "Processing anomaly detection") or (message == "Done processing")
| project timestamp , operation_Id , message );
T
| where message == "Done processing"
| join kind=inner (T  
|   summarize count(), durationInMS=tolong(max(timestamp)-min(timestamp))/10000 by operation_Id
// Filter out operations that do not have both a beginning and an end
|   where count_ == 2
|   project operation_Id , durationInMS ) on operation_Id
| project-away operation_Id1
| take 10

 durations2

Beware: Metrics Are Addictive

And now that we have moved from a textual to a metric domain, we can dive deeper in a numeric investigation of the code flow’s performance. For example, we can now draw a 6 days’ chart of the average durations in time bins of 5 minutes. 

let startDate = startofday(datetime("2017-02-01"));
let endDate = startofday(datetime("2017-02-07"));
let T = materialize(
traces
| where timestamp > startDate and timestamp < endDate
| where (message startswith "Processing anomaly detection") or (message == "Done processing")
| project timestamp , operation_Id , message );
T
| where message == "Done processing"
| join kind=inner (T 
| summarize count(), durationInMS=tolong(max(timestamp)-min(timestamp))/10000 by operation_Id
// Filter out operations that do not have both a beginning and an end
| where count_ == 2
| project operation_Id , durationInMS ) on operation_Id
| project-away operation_Id1
| summarize avg(durationInMS) by bin(timestamp, 5m)
| render timechart

durations chart

Eyeballing this chart for a few seconds reveals a suspected daily seasonality. To validate our assumptions regarding the seasonal nature of the code flow’s behavior, we’ll apply advanced time series analysis in our query. The key operator for that purpose is make-series.

let startDate = startofday(datetime("2017-02-01"));
let endDate = startofday(datetime("2017-02-07"));
let T = materialize(
traces
| where timestamp > startDate and timestamp < endDate
| where (message startswith "Processing anomaly detection") or (message == "Done processing")
| project timestamp , operation_Id , message );
T
| where message == "Done processing"
| join kind=inner (T  
|      summarize count(), durationInMS=tolong(max(timestamp)-min(timestamp))/10000 by 
operation_Id
//    Filter out operations that do not have both a beginning and an end
|      where count_ == 2
|      project operation_Id , durationInMS ) on operation_Id
| project-away operation_Id1
// Apply time series analysis to detect hidden seasonality in the result
| make-series duration=avg(durationInMS) default=0 on timestamp in range(startDate, endDate, 5m) 
by "All"
| project periods = series_periods(duration, 4.0, 500.0, 2)

 periods1

The result is conclusive. The code flow duration manifests a seasonal behavior with two periodicities of 24h and 12h (remember that every time bin in the aggregation stands for 5 minutes). The human eye can also see a potentially abnormal spike in the duration of the code flow at 2017-02-06T04:05, but given that the count for workflows in that bin is only 6 the motivation for drilling further in is questionable.

periods2

Wrap Up

In this walkthrough, we saw how 40GB of simple debug logs can be projected from a pure textual domain to a metric domain where we can measure the elapsed duration between any two points in a repeating sequence of debug lines, ultimately manifesting a code flow. The duration metric can then be investigated through numeric analysis and advanced time series to detect degradations, seasonality and trends.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s