Measuring the duration of each step in your code flow using simple debug logs

In my previous blog post I demonstrated how you can use log analytics to measure the duration between arbitrary points in your code flow using simple debug logs only.

Today, I want to take it one step further and demonstrate how you can extend your debug log with an additional column showing the duration between a line and the one preceding it. As it is always the case, once the new metric column is in place we’ll have a variety of tools to analyze the metric behavior. In this case, we’ll be able to analyze the performance of the code flow deeply. Among other questions, we’ll be able to determine which are the more expensive parts in the work flow and how they change over time. The blog post is entirely based on pre-populated data and script in the Application Insights Analytics demo site so you can experiment hands on with everything you read here.

Here we go

Our starting point is once again a very simple query that returns the following results.

let startDate = startofday(datetime("2017-02-01"));
let endDate = startofday(datetime("2017-02-07"));
traces
| where timestamp > startDate and timestamp < endDate
| where operation_Id == "QaXuZhd2b2c=" 
| project timestamp , operation_Id , message 

post2-1

After adding the new column, our debug log will look like this

durations2-2

Exciting, right? Well, let’s get to it.


Like in the previous walkthrough, we start our journey with a simple query that lists trace lines in a given time range. By ordering the log lines by operation_Id (or any other activity correlator) and a timestamp, we neutralize the effect of concurrency and asynchronous processing on the sequence.

let startDate = startofday(datetime("2017-02-01"));
let endDate = startofday(datetime("2017-02-07"));
traces
| where timestamp > startDate and timestamp < endDate
| project timestamp , operation_Id , message
| order by operation_Id, timestamp asc
| take 30

For each row, we’ll now create a new column containing the duration in milliseconds that passed since a certain point in time. This results in an ascending-ordered column. Subtracting the value in line n+1 from the value of line n would produce the duration of the code flow between the two lines. Which is what we want. We also create a layman’s hash code (in the form of string length) per trace line and title it messageId. This value will allow us to uniquely identify a row within the scope of a single activity in a reasonably robust manner. I hope this part will become a little clearer in a few paragraphs.

let startDate = startofday(datetime("2017-02-01"));
let startDateLong=tolong(startDate);
let endDate = startofday(datetime("2017-02-07"));
traces
| where timestamp > startDate and timestamp < endDate
| project timestamp , operation_Id , message
| order by operation_Id, timestamp asc
| extend offsetInMS = tolong(timestamp) - startDateLong + tolong(24h)
| extend messageId = strlen(message)
| take 10

post2-3

Next, for each operation_Id (that’s interchangeable with any other activity correlation you  choose, or ‘null’ if you don’t worry about concurrency), we create an ordered list of offsets and a matching list of messageId with one to one correlation between the two. 

| summarize offsets = makelist(offsetInMS, 1000000), messageIds = makelist(messageId, 1000000) by operation_Id

post2-4

The above screen capture demonstrates nicely that the number of trace lines aggregated per operation isn’t necessarily equal (which makes the challenge more interesting). The next step is to actually compute the sequence of differences for the offsets sequence. The results of this calculation will populate a new column – durationFor that, we’ll use Finite Impulse Response filter (a.k.a fir) with (-1, 1) as the coefficients. Which basically means: Duration(n) = Offset(n) – Offset(n-1)

| extend duration = fir(offsets, dynamic([1,-1]), false)

post2-5

Notice how the first value in each of the progression sequences is actually the first term in the sequence of offsets. That’s simply because Duration(1) = Offset(1– 0 = Offset(1). If you scroll up a bit, you would see that the offset column was defined as 

| extend offset = tolong(timestamp) - startDateLong + tolong(24h)

Now we want to transpose the table so that each tuple of [operation_Id[i], offset, messageIds[i], duration[i]] become a separate row. At the same time, we also set the first term in each sequence to zero and convert the units of the duration column from ticks to milliseconds. Given above definition for offset, we can safely assume that a duration that is bigger than 1 hour is necessarily related to the first line in a sequence, or a recycle of the same opeartionId for a new operation and therefore should be zeroed.

| mvexpand offsets , messageIds, duration limit 1000000
| project  operation_Id, offset = tolong(offsets), messageId = messageIds,
 duration =
iff(tolong(duration) >= tolong(1h) or isnull(duration), 0, tolong(duration)/10000)

post2-6

Now we have the duration column and we need to glue it back to the original trace logs table using a join operation. Once again we use materialize() to optimize the query  by guiding the query engine to reuse cached by-products for the join operation. Here is the full query for that:

// To add the duration column back to the original table we will use join and operation_Id as the correlator.
// Use of materialize() optimizes the performance of the query when joining a set of records with itself. 
let startDate = startofday(datetime("2017-02-01"));
let startDateLong=tolong(startDate);
let endDate = startofday(datetime("2017-02-07"));
let T = materialize(
traces
| where timestamp > (startDate+1s) and timestamp < endDate
| project timestamp , operation_Id , message 
| order by operation_Id, timestamp asc
| extend offsetInMS = tolong(timestamp) - startDateLong 
| extend messageId = strlen(message) 
| project timestamp , operation_Id , message, offsetInMS , messageId);
T
| join (T 
| summarize offsets = makelist(offsetInMS, 1000000), messageIds = makelist(messageId, 1000000) by operation_Id 
| extend duration = fir(offsets, dynamic([1,-1]), false)
| mvexpand offsets , messageIds, duration limit 1000000
| project tostring(operation_Id), messageId = tolong(messageIds), offsetInMS = tolong(offsets), duration = iff(tolong(duration) >= tolong(1s) or isnull(duration), 0, tolong(duration)/10000)) on operation_Id, offsetInMS, messageId
| project-away operation_Id1, offsetInMS1, messageId, messageId1
| take 100

And we get the result promised above

durations2-2

We’re there!

We are now in the metric domain which is where we love to be! Now we can apply numeric analysis to the code flow in any granularity. For example, we can track how the performance of each of the steps in the workflow changes over time. In other words, we can segment the newly created duration metric by the message field. For an effective segmentation we will need to reduce the cardinality (== number of distinct values) of the message column. There are several ways to achieve that and in this sample I’m showing a rather simple one. 

let startDate = startofday(datetime("2017-02-01"));
let startDateLong=tolong(startDate);
let endDate = startofday(datetime("2017-02-07"));
let T = materialize(
traces
| where timestamp > startDate and timestamp < endDate
| project timestamp , operation_Id , message 
| order by operation_Id, timestamp asc
| extend offsetInMS = tolong(timestamp) - startDateLong +tolong(24h) 
| extend messageId = strlen(message) 
// Reduce the cardinality of the message column to achieve an efficient segmentation
| extend message=iff(message has "Query DailyPerformanceAggregation", "Query DailyPerformanceAggregation", iff(message has "Publishing finished", "Publishing finished", iff(message has "Query Dependency_data_isUniform", "Query Dependency_data_isUniform", iff(message has "time series results", "Found time series results", iff(message has "Query Request_data_isUniform-False", "Query Request_data_isUniform", iff(message has "Enriching ", "Enriching detections", iff(message has "DailyEmailPublishTime for app", "DailyEmailPublishTime for app", iff(message has "correlated detections for", "Got X correlated detections for app", iff(message has "Finished processing anomaly detection", "Finished processing anomaly detection", iff(message has "Query AnomalyDetectionDiscoverJob", "Query AnomalyDetectionDiscoverJob", iff(message has "Starting data analyzing for profile", "Starting data analyzing for profile", "Other")))))))))))
| project timestamp , operation_Id , message, offsetInMS , messageId);
T
| join (T 
| summarize offsets = makelist(offsetInMS, 1000000), messageIds = makelist(messageId, 1000000) by operation_Id 
| extend duration = fir(offsets, dynamic([1,-1]), false)
| mvexpand offsets , messageIds, duration limit 1000000
| project tostring(operation_Id), messageId = tolong(messageIds), offsetInMS = tolong(offsets), duration = iff(tolong(duration) >= tolong(1h) or isnull(duration), 0, tolong(duration)/10000)) on operation_Id, offsetInMS, messageId
| project-away operation_Id1, offsetInMS1, messageId, messageId1
| summarize avg(duration) by message, bin(timestamp, 1h) 
| render timechart

And finally, this is what we labored for… a chart depicting the duration of  all of the sub-activities in my app over the course of 6 days!

post2-7.PNG

Several interesting behaviors stand out in the chart. Some sub-activities manifest a seasonal spike with a daily cadence. For such, the duration of ‘Query Daily Performance Aggregation’ peeks every day at 3AM. Traces associated with ‘Query Request_data_isUniform’ manifest a clear trend of performance improvement towards the 2’nd half of the week. However, to the most part, the chart seems somewhat chaotic because of the highly distributed scale on y axis. Time series analysis can pick and choose which of the graphs we want to keep on the chart by many characteristics. For example, here is a simple way to keep just those graphs in which the max value in a desire range.

| make-series percentile90=percentile(duration, 90) default=0 on timestamp in range(startDate, endDate, 1h) by message 
| extend st = stat(percentile90)
| where st.max < 2000
| render timechart

post2-8

With only the faster sub-activities remaining in the chart it is now easier to see that ‘Finished anomaly processing’ has been abnormally slow on February 6th. In the same manner i could rewrite the query to keep just those graphs that manifest slower sub-activities.

| make-series percentile90=percentile(duration, 90) default=0 on timestamp in range(startDate, endDate, 1h) by message 
| extend st = stat(percentile90)
| where st.max > 5000
| render timechart

post2-9

Now we can see that ‘Publishing finished’ was slow and somewhat chaotic at the beginning of the week and then flattened out.

It a good place to end this walk-through. Thank you for reading this far. I hope that the overall coolness of these investigation capabilities has been conveyed. Here’s a challenge for those who would like to take this further. See if you can experiment with applying advanced shape detection to isolate the sub-activities that manifest spikes, step jumps and trends.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s