Telemetry and Observability

Overview

The Telemetry and Observability module provides comprehensive tracing and monitoring capabilities for the AI Agent Development Kit (ADK). Its primary role is to collect, manage, and export detailed tracing data related to large language model (LLM) calls, tool executions, and agent lifecycle events. This ensures visibility into the internal workings of agent interactions, enabling developers and operators to diagnose performance issues, understand execution flows, and audit system behavior.

At its core, the module leverages OpenTelemetry integration to instrument and emit structured trace spans, which can be collected by monitoring backends for visualization and analysis. The tracing specifically captures:

Invocation of LLM models, including request and response details.
Execution of tools called by agents, including input arguments and output.
Aggregated or merged tool call events.
Agent session and invocation identifiers for correlation.

This observability layer is vital for understanding the dynamic behavior of AI agents, especially in complex workflows involving multiple tools and sub-agents.

Core Concepts and Purpose

Why Telemetry and Observability?

AI agents interact with LLMs and various tools asynchronously and can involve multiple nested calls. Without proper observability, it is challenging to:

Track how input messages propagate through agents and tools.
Understand latency and error patterns in LLM or tool invocations.
Correlate session events with specific agent invocations.
Debug issues in multi-agent or distributed scenarios.

By integrating OpenTelemetry, the module provides automatic instrumentation that records spans representing meaningful operations. These spans are enriched with attributes describing the operation's context, inputs, outputs, and identifiers, facilitating detailed trace analysis.

Key Trace Elements

Spans: Represent discrete timed operations (e.g., an LLM request or a tool execution).
Attributes: Key-value pairs attached to spans that describe metadata (e.g., tool name, request model, session ID).
Tracer Providers: Manage span creation and export.

How It Works

Initialization and Span Processor Registration

The module maintains a local tracer provider instance configured with registered span processors. These processors define how and where the trace data is exported (e.g., to a logging backend, monitoring service).

Span processors must be registered before any events are emitted.
The module respects both local and global OpenTelemetry tracer configurations.
The initialization is guarded to run once, ensuring a singleton tracer provider.

Example from internal/telemetry/telemetry.go:

func AddSpanProcessor(processor sdktrace.SpanProcessor) {
    localTracerConfig.mu.Lock()
    defer localTracerConfig.mu.Unlock()
    localTracerConfig.spanProcessors = append(localTracerConfig.spanProcessors, processor)
}

func RegisterTelemetry() {
    once.Do(func() {
        traceProvider := sdktrace.NewTracerProvider()
        localTracerConfig.mu.RLock()
        spanProcessors := localTracerConfig.spanProcessors
        localTracerConfig.mu.RUnlock()
        for _, processor := range spanProcessors {
            traceProvider.RegisterSpanProcessor(processor)
        }
        localTracer = tracerProviderHolder{tp: traceProvider}
    })
}

Starting Traces

When an agent begins a significant operation, such as invoking an LLM model or executing a tool, the telemetry module starts one or more spans using both the local tracer and the global tracer (if set). This dual-span mechanism ensures compatibility and flexibility in trace collection.

func StartTrace(ctx context.Context, traceName string) []trace.Span {
    tracers := getTracers()
    spans := make([]trace.Span, len(tracers))
    for i, tracer := range tracers {
        _, span := tracer.Start(ctx, traceName)
        spans[i] = span
    }
    return spans
}

The returned spans are then used to record attributes and end the span after operation completion.

Tracing LLM Calls

For each LLM request initiated by an agent, the module records:

Model name.
Request configuration (e.g., token limits, sampling temperature).
Full request content (filtered to exclude inline data).
Corresponding LLM response.
Invocation and session identifiers for correlation.

This detailed capture allows for replaying or analyzing the exact input-output pairs of model calls.

func TraceLLMCall(spans []trace.Span, agentCtx agent.InvocationContext, llmRequest *model.LLMRequest, event *session.Event) {
    for _, span := range spans {
        attributes := []attribute.KeyValue{
            attribute.String(genAiSystemName, systemName),
            attribute.String(genAiRequestModelName, llmRequest.Model),
            attribute.String(gcpVertexAgentInvocationID, event.InvocationID),
            attribute.String(gcpVertexAgentSessionID, agentCtx.Session().ID()),
            attribute.String(gcpVertexAgentEventID, event.ID),
            attribute.String(gcpVertexAgentLLMRequestName, safeSerialize(llmRequestToTrace(llmRequest))),
            attribute.String(gcpVertexAgentLLMResponseName, safeSerialize(event.LLMResponse)),
        }
        // Additional attribute settings...
        span.SetAttributes(attributes...)
        span.End()
    }
}

Tracing Tool Executions

Similarly, when an agent executes a tool (e.g., a function tool or artifact loader), the telemetry module traces:

Tool name and description.
Input arguments passed to the tool.
Tool response content.
Unique identifiers related to the tool call.

For merged or aggregated tool calls, a separate tracing method is used to represent collective execution.

func TraceToolCall(spans []trace.Span, tool tool.Tool, fnArgs map[string]any, fnResponseEvent *session.Event) {
    // Sets attributes including tool name, description, call ID, and response.
    // Ends the span after setting attributes.
}

Interactions With Other Modules

The Telemetry and Observability module is tightly integrated with several key system components:

Agents (agent package): The agent lifecycle and invocation flow trigger telemetry events for LLM calls and tool executions. The agent's context provides session and invocation identifiers used in trace attributes.
Session Management (session package): Session IDs and event IDs from session events are attached to spans for trace correlation.
Tools (tool package): Tool executions are instrumented to emit tracing spans with detailed argument and response data.
LLM Models (model package): The module accesses LLM request and response data structures to serialize and record trace content.
OpenTelemetry SDK: The module configures and utilizes OpenTelemetry SDK components such as tracer providers and span processors.

This integration ensures that telemetry is seamlessly embedded in the agent runtime lifecycle without requiring explicit instrumentation in client code.

Usage and Extension

External systems or developers can extend telemetry capabilities by registering custom span processors. This allows exporting trace data to various backends, such as Jaeger, Zipkin, or cloud monitoring platforms.

The public API exposed via the telemetry package allows clients to register span processors before any trace emission:

// RegisterSpanProcessor registers a span processor to the local tracer provider.
func RegisterSpanProcessor(processor sdktrace.SpanProcessor)

This flexibility ensures that telemetry data can be routed appropriately per deployment requirements.

Visualization of Telemetry Workflow

sequenceDiagram
participant Agent as Agent
participant Telemetry as Telemetry Module
participant LLM as LLM Model
participant Tool as Tool
participant Session as Session Service
Agent->>Telemetry: StartTrace("llm_request")
Telemetry->>Agent: Return spans
Agent->>LLM: Send LLM Request
LLM-->>Agent: LLM Response
Agent->>Telemetry: TraceLLMCall(spans, context, request, responseEvent)
Telemetry->>Telemetry: Set span attributes and End spans
Agent->>Telemetry: StartTrace("execute_tool")
Telemetry->>Agent: Return spans
Agent->>Tool: Execute with arguments
Tool-->>Agent: Tool Response
Agent->>Telemetry: TraceToolCall(spans, tool, args, responseEvent)
Telemetry->>Telemetry: Set span attributes and End spans
Agent->>Session: Access session ID and event ID

This sequence illustrates the lifecycle of tracing around LLM calls and tool executions within an agent's operation, highlighting the telemetry module's role in span lifecycle management and attribute annotation.

Summary of Key Files

internal/telemetry/telemetry.go: Implements the core telemetry logic, including tracer provider setup, span creation, and attribute enrichment for LLM calls and tool executions.
telemetry/telemetry.go: Provides the public API for registering span processors and delegates to the internal implementation.