Skip to content

otlpgrpc receiver leaks bufio.NewReaderSize memory under connection churn #15086

@chrisleavoy

Description

@chrisleavoy

Component(s)

receiver/otlp

What happened?

Describe the bug

Memory leak in gRPC server where bufio.Reader buffers allocated in newFramer() are never cleaned up when connections close, despite proper goroutine cleanup.

Steps to reproduce

1. Collector config

Save as collector.yaml:

receivers:
    otlp:
        protocols:
            grpc:
                endpoint: 0.0.0.0:4317
                read_buffer_size: 524288
                write_buffer_size: 32768
                keepalive:
                    enforcement_policy:
                        min_time: 5m
                        permit_without_stream: false
                    server_parameters:
                        time: 2h
                        timeout: 20s
                        max_connection_idle: 3m
                        max_connection_age: 5m
                        max_connection_age_grace: 45s

processors:
    batch:
        send_batch_max_size: 16384

exporters:
    debug: {}

extensions:
    pprof:
        endpoint: 0.0.0.0:1777

service:
    extensions: [pprof]
    pipelines:
        traces:
            receivers: [otlp]
            processors: [batch]
            exporters: [debug]

Run the collector:

otelcol-contrib --config collector.yaml

2. Connection-churn client

Save as main.go:

package main

import (
 "context"
 "log"
 "sync"
 "time"

 "go.opentelemetry.io/otel"
 "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
 "go.opentelemetry.io/otel/sdk/resource"
 sdktrace "go.opentelemetry.io/otel/sdk/trace"
 semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
 "google.golang.org/grpc"
 "google.golang.org/grpc/credentials/insecure"
)

func main() {
 const (
  endpoint = "127.0.0.1:4317"
  workers  = 32
  runtime_ = 45 * time.Minute
 )

 deadline := time.Now().Add(runtime_)
 var wg sync.WaitGroup

 for worker := 0; worker < workers; worker++ {
  wg.Add(1)
  go func(workerID int) {
   defer wg.Done()

   for time.Now().Before(deadline) {
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)

    exporter, err := otlptracegrpc.New(ctx,
     otlptracegrpc.WithEndpoint(endpoint),
     otlptracegrpc.WithTLSCredentials(insecure.NewCredentials()),
     otlptracegrpc.WithDialOption(grpc.WithBlock()),
    )
    if err != nil {
     cancel()
     log.Printf("worker %d: exporter create failed: %v", workerID, err)
     time.Sleep(250 * time.Millisecond)
     continue
    }

    tp := sdktrace.NewTracerProvider(
     sdktrace.WithBatcher(exporter),
     sdktrace.WithResource(resource.NewWithAttributes(
      semconv.SchemaURL,
      semconv.ServiceName("otlp-grpc-repro"),
     )),
    )

    otel.SetTracerProvider(tp)
    tracer := otel.Tracer("repro")

    for i := 0; i < 10; i++ {
     _, span := tracer.Start(ctx, "repro-span")
     span.End()
    }

    if err := tp.ForceFlush(ctx); err != nil {
     log.Printf("worker %d: force flush failed: %v", workerID, err)
    }
    if err := tp.Shutdown(ctx); err != nil {
     log.Printf("worker %d: shutdown failed: %v", workerID, err)
    }

    cancel()
    time.Sleep(200 * time.Millisecond)
   }
  }(worker)
 }

 wg.Wait()
}

3. Inspect retained heap after GC

After the client has run long enough to create meaningful churn:

curl -s 'http://127.0.0.1:1777/debug/pprof/heap?gc=1' > heap.pb.gz
go tool pprof -top -sample_index=inuse_space heap.pb.gz

What we expect when the issue reproduces:

  • bufio.NewReaderSize remains visible after forced GC
  • cumulative stack includes:
google.golang.org/grpc/internal/transport.newFramer
google.golang.org/grpc/internal/transport.NewServerTransport
google.golang.org/grpc.(*Server).newHTTP2Transport

Control experiment that did not reproduce

For comparison, a grpc-go-only harness with the same ReadBufferSize, keepalive settings, and server-driven connection churn stayed flat after quiescence plus forced GC.

That is the main reason I think this issue should start here instead of in grpc-go.

Missing Cleanup in http2Server.Close()

File: internal/transport/http2_server.go:1269-1292

func (t *http2Server) Close(err error) {
    t.mu.Lock()
    if t.state == closing {
        t.mu.Unlock()
        return
    }
    t.state = closing
    streams := t.activeStreams
    t.activeStreams = nil
    t.mu.Unlock()
    t.controlBuf.finish()
    close(t.done)
    if err := t.conn.Close(); err != nil && t.logger.V(logLevel) {
        t.logger.Infof("Error closing underlying net.Conn during Close: %v", err)
    }
    channelz.RemoveEntry(t.channelz.ID)
    for _, s := range streams {
        s.cancel()
    }

    // ❌ MISSING: No cleanup of t.framer!
    // The framer holds a bufio.Reader that is never released
}

The Framer Lifecycle

  1. Created in NewServerTransport (line 172):

    framer := newFramer(conn, writeBufSize, readBufSize, ...)
  2. newFramer allocates bufio.Reader (http_util.go:419):

    func newFramer(conn io.ReadWriter, writeBufferSize, readBufferSize int, ...) *framer {
        var r io.Reader = conn
        if readBufferSize > 0 {
            r = bufio.NewReaderSize(r, readBufferSize)  // ← Allocation
        }
        f := &framer{
            reader: r,  // ← bufio.Reader stored
            // ...
        }
        return f
    }
  3. Stored in http2Server (line 83):

    type http2Server struct {
        framer *framer  // ← Never cleaned up!
        // ...
    }
  4. NOT cleaned up when Close() is called

No Cleanup Method Exists

The framer struct has no cleanup/close method:

type framer struct {
    writer    *bufWriter
    fr        *http2.Framer
    headerBuf []byte
    reader    io.Reader    // ← This is the bufio.Reader!
    dataFrame parsedDataFrame
    pool      mem.BufferPool
    errDetail error
}

// ❌ No cleanup() method exists

I will note, that the lack of cleanup methods here doesn't necessarily guarantee a memory leak. The above report was created with the help of both codex and claude. They weren't able to conclusively identify the real cause or fix. However, since the issue is easily reproducible in the latest version of the collector, I figured I better submit an upstream issue for more guidance.

Collector version

v0.149.0

Environment information

Environment

  • collector v0.148.0
  • Go version: 1.25.1
  • gRPC version: v1.79.0

OpenTelemetry Collector configuration

Log output

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions