Telemetry and Observability

Overview

The Telemetry and Observability module provides comprehensive tracing and monitoring capabilities for the AI Agent Development Kit (ADK). Its primary role is to collect, manage, and export detailed tracing data related to large language model (LLM) calls, tool executions, and agent lifecycle events. This ensures visibility into the internal workings of agent interactions, enabling developers and operators to diagnose performance issues, understand execution flows, and audit system behavior.

At its core, the module leverages OpenTelemetry integration to instrument and emit structured trace spans, which can be collected by monitoring backends for visualization and analysis. The tracing specifically captures:

This observability layer is vital for understanding the dynamic behavior of AI agents, especially in complex workflows involving multiple tools and sub-agents.


Core Concepts and Purpose

Why Telemetry and Observability?

AI agents interact with LLMs and various tools asynchronously and can involve multiple nested calls. Without proper observability, it is challenging to:

By integrating OpenTelemetry, the module provides automatic instrumentation that records spans representing meaningful operations. These spans are enriched with attributes describing the operation's context, inputs, outputs, and identifiers, facilitating detailed trace analysis.

Key Trace Elements


How It Works

Initialization and Span Processor Registration

The module maintains a local tracer provider instance configured with registered span processors. These processors define how and where the trace data is exported (e.g., to a logging backend, monitoring service).

Example from internal/telemetry/telemetry.go:

func AddSpanProcessor(processor sdktrace.SpanProcessor) {
    localTracerConfig.mu.Lock()
    defer localTracerConfig.mu.Unlock()
    localTracerConfig.spanProcessors = append(localTracerConfig.spanProcessors, processor)
}

func RegisterTelemetry() {
    once.Do(func() {
        traceProvider := sdktrace.NewTracerProvider()
        localTracerConfig.mu.RLock()
        spanProcessors := localTracerConfig.spanProcessors
        localTracerConfig.mu.RUnlock()
        for _, processor := range spanProcessors {
            traceProvider.RegisterSpanProcessor(processor)
        }
        localTracer = tracerProviderHolder{tp: traceProvider}
    })
}

Starting Traces

When an agent begins a significant operation, such as invoking an LLM model or executing a tool, the telemetry module starts one or more spans using both the local tracer and the global tracer (if set). This dual-span mechanism ensures compatibility and flexibility in trace collection.

func StartTrace(ctx context.Context, traceName string) []trace.Span {
    tracers := getTracers()
    spans := make([]trace.Span, len(tracers))
    for i, tracer := range tracers {
        _, span := tracer.Start(ctx, traceName)
        spans[i] = span
    }
    return spans
}

The returned spans are then used to record attributes and end the span after operation completion.

Tracing LLM Calls

For each LLM request initiated by an agent, the module records:

This detailed capture allows for replaying or analyzing the exact input-output pairs of model calls.

func TraceLLMCall(spans []trace.Span, agentCtx agent.InvocationContext, llmRequest *model.LLMRequest, event *session.Event) {
    for _, span := range spans {
        attributes := []attribute.KeyValue{
            attribute.String(genAiSystemName, systemName),
            attribute.String(genAiRequestModelName, llmRequest.Model),
            attribute.String(gcpVertexAgentInvocationID, event.InvocationID),
            attribute.String(gcpVertexAgentSessionID, agentCtx.Session().ID()),
            attribute.String(gcpVertexAgentEventID, event.ID),
            attribute.String(gcpVertexAgentLLMRequestName, safeSerialize(llmRequestToTrace(llmRequest))),
            attribute.String(gcpVertexAgentLLMResponseName, safeSerialize(event.LLMResponse)),
        }
        // Additional attribute settings...
        span.SetAttributes(attributes...)
        span.End()
    }
}

Tracing Tool Executions

Similarly, when an agent executes a tool (e.g., a function tool or artifact loader), the telemetry module traces:

For merged or aggregated tool calls, a separate tracing method is used to represent collective execution.

func TraceToolCall(spans []trace.Span, tool tool.Tool, fnArgs map[string]any, fnResponseEvent *session.Event) {
    // Sets attributes including tool name, description, call ID, and response.
    // Ends the span after setting attributes.
}

Interactions With Other Modules

The Telemetry and Observability module is tightly integrated with several key system components:

This integration ensures that telemetry is seamlessly embedded in the agent runtime lifecycle without requiring explicit instrumentation in client code.


Usage and Extension

External systems or developers can extend telemetry capabilities by registering custom span processors. This allows exporting trace data to various backends, such as Jaeger, Zipkin, or cloud monitoring platforms.

The public API exposed via the telemetry package allows clients to register span processors before any trace emission:

// RegisterSpanProcessor registers a span processor to the local tracer provider.
func RegisterSpanProcessor(processor sdktrace.SpanProcessor)

This flexibility ensures that telemetry data can be routed appropriately per deployment requirements.


Visualization of Telemetry Workflow

sequenceDiagram
participant Agent as Agent
participant Telemetry as Telemetry Module
participant LLM as LLM Model
participant Tool as Tool
participant Session as Session Service
Agent->>Telemetry: StartTrace("llm_request")
Telemetry->>Agent: Return spans
Agent->>LLM: Send LLM Request
LLM-->>Agent: LLM Response
Agent->>Telemetry: TraceLLMCall(spans, context, request, responseEvent)
Telemetry->>Telemetry: Set span attributes and End spans
Agent->>Telemetry: StartTrace("execute_tool")
Telemetry->>Agent: Return spans
Agent->>Tool: Execute with arguments
Tool-->>Agent: Tool Response
Agent->>Telemetry: TraceToolCall(spans, tool, args, responseEvent)
Telemetry->>Telemetry: Set span attributes and End spans
Agent->>Session: Access session ID and event ID

This sequence illustrates the lifecycle of tracing around LLM calls and tool executions within an agent's operation, highlighting the telemetry module's role in span lifecycle management and attribute annotation.


Summary of Key Files


Related Topics