Talking Point: April 2021

Distributed Tracing

Ok, what is hell is "distributed tracing" ?

Microservices Architecture is on the rise and is extensively used to power applications and services that we use on a daily basis. Netflix, Amazon, eBay, to name a few, are based on microservice architecture.

With micro-service architecture a user request will typically span multiple services across different servers before stitching the response and sending it back to the user. The problem with this is monitoring, debug-ability, reduced global visibility.

Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. It refers to methods of observing requests as they propagate through distributed systems. It’s a diagnostic technique that reveals how a set of services coordinate to handle individual user requests. Distributed tracing requires that software developers add instrumentation to the code of an application.

OpenTracing provides API specification allowing to add instrumentation to the application code in a vendor neutral manner.

Cost Of Instrumentation

Services would usually talk to each other through some sort of IPC. There are many frameworks to allow such communication. Many frameworks provide inbuilt support for instrumentation, making it simple to enable distributed tracing. These framework usage and performance are studied extensively. The OpenTracing Blog is probably a good place to start.

So now let's come to the significant question "what is the cost of instrumentation" ?
What is the performance impact of adding instrumentation to the application code ?

For measuring the cost we will use the jaeger tracer and JMH to capture the metrics.

JMH provides an API to consume cpu cycles varying linearly with token value specified. We will use this to mock a long running job which will be instrumented.

// We need to consume or return the result to avoid JVM dead code optimization
 public long processLongJob(long token) {
     Blackhole.consumeCPU(token);
     return token;
 }

We will measure the metrics without any instrumentation, with NoOpTracer and JaegerTracer with default initialization values.

Benchmark	Param: token	AverageTime (ns/ops)	Error (99.9%)	Cost (Impact %)
NoInstrumentation	1000	1664.84177	49.940273	0
NoInstrumentation	5000	8324.19976	107.233129	0
NoInstrumentation	10000	16606.7095	60.983405	0
NoInstrumentation	50000	83249.7561	2113.50206	0
NoInstrumentation	100000	166090.738	1693.40512	0
NoInstrumentation	500000	829908.745	6188.7579	0

NoOpTracer	1000	1648.86056	4.521151	-0.960
NoOpTracer	5000	8280.19528	86.68164	-0.529
NoOpTracer	10000	16569.8697	134.783752	-0.222
NoOpTracer	50000	83254.7547	645.070768	0.006
NoOpTracer	100000	166211.527	979.291791	0.073
NoOpTracer	500000	833341.912	14473.6203	0.414

JaegerTracer	1000	2373.87521	2.188993	42.589
JaegerTracer	5000	10059.0192	18.01787	20.841
JaegerTracer	10000	18803.8006	63.222248	13.230
JaegerTracer	50000	87511.7573	651.314285	5.120
JaegerTracer	100000	172132.755	959.72657	3.638
JaegerTracer	500000	846543.298	8904.63822	2.004

As seen above No Instrumentation and NoOpTracer scores are comparable with virtually no impact on performance while JaegerTracer costs ~1.5x but decreases with increase in CPU cycles consumed.

Lets look at the average time per fixed-CPU-cycles. As CPU cycles consumed varies linearly with the token value we can divide the time by the token. Below table summarizes the above table into scores for the different instrumentation techniques

Token	Score (AverageTime / Token)
Token	NoInstrumentation Score	NoOpTracer Score	JaegerTracer Score
1000	1.664841774	1.648860562	2.373875211
5000	1.664839952	1.656039056	2.011803834
10000	1.660670949	1.656986968	1.880380062
50000	1.664995121	1.665095094	1.750235147
100000	1.660907376	1.662115265	1.721327553
500000	1.65981749	1.666683825	1.693086597

Plotting the table with token on X Axis and the scores on Y axis it becomes clear that the more intensive the code being instrumented the lesser it has impact on performance.

Using the above measurements instrumenting any function whose total execution time > 1 ms, the cost of instrumentation is negligible.

That's all folks. Till next time.

Talking Point

April 14, 2021

Distributed Tracing : What is the "cost" of instrumentation ?

Distributed Tracing

Cost Of Instrumentation