Prometheus Client Libraries: Instrument Python/Node.js/Go Applications

การเลือก Prometheus client library สำหรับภาษาที่ใช้งานเป็นการตัดสินใจที่ส่งผลต่อ reliability, memory footprint และความสะดวกในการ maintain ระยะยาว แต่ละ library มีปรัชญาและข้อจำกัดเฉพาะ — เช่น Python รองรับ multi-process mode ที่พิเศษ, Go มี CounterVec ที่ต้อง pre-register, Node.js ใช้ async collection เป็นค่าเริ่มต้น

บทความนี้เจาะลึก internals ของ client library หลัก 3 ตัวที่ Prometheus maintain อย่างเป็นทางการ — prometheus_client สำหรับ Python, prom-client สำหรับ Node.js และ client_golang สำหรับ Go — พร้อม advanced patterns ที่ทีมส่วนใหญ่ไม่ได้ใช้งานเต็มศักยภาพ เช่น custom collector, exemplars, pushgateway และ multi-process mode

เปรียบเทียบ Official Client Libraries

คุณสมบัติ	Python	Node.js	Go
Package	prometheus_client	prom-client	client_golang
Multi-process mode	✅ (gunicorn/uwsgi)	❌ (cluster mode ต้องจัดการเอง)	N/A (single process)
Default metrics	Process + Platform	Event loop, GC, heap	Go runtime + process
Async collection	Sync	Async (Promise)	Sync
Exemplars support	✅	✅ (v14+)	✅
Pushgateway client	✅	✅	✅
Native histogram	❌	❌	✅ (experimental)

Python: Multi-process Mode

Python มีข้อจำกัดที่ไม่เหมือนภาษาอื่น — เมื่อรันด้วย gunicorn หรือ uwsgi แบบ multi-worker แต่ละ worker เป็น process แยก ทำให้ metric ที่เก็บใน memory ไม่ sync กัน หาก scrape แล้วเจอ worker ไหนก็จะได้ค่าของ worker นั้นอย่างเดียว

วิธีแก้คือเปิด multi-process mode ซึ่งจะให้ทุก worker เขียน metric ลงไฟล์ที่ shared directory แล้ว aggregate ขึ้นมาตอน scrape:

# export ตัวแปร env ก่อนรัน gunicorn
export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus-multiproc
mkdir -p $PROMETHEUS_MULTIPROC_DIR

# gunicorn config (gunicorn.conf.py)
def child_exit(server, worker):
    from prometheus_client import multiprocess
    multiprocess.mark_process_dead(worker.pid)

# ใน app
from prometheus_client import multiprocess, CollectorRegistry, generate_latest

registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)

@app.route('/metrics')
def metrics():
    data = generate_latest(registry)
    return Response(data, mimetype='text/plain')

ข้อควรระวัง: Gauge ใน multi-process mode ต้องเลือก mode ของตัวเอง (all, liveall, livesum, min, max, sum) — เพราะ aggregation ข้าม process ไม่ชัดเจนเท่า Counter หรือ Histogram

from prometheus_client import Gauge

# มี worker หลายตัว — แต่ละตัวมีค่าเอง ให้แสดงค่าล่าสุดจาก worker ที่ยัง alive
active_users = Gauge(
    'active_users',
    'Active users',
    multiprocess_mode='livesum'
)

Node.js: Async Collection และ Cluster Mode

prom-client ใช้ async API — register.metrics() return Promise ซึ่งเหมาะกับ pattern ของ Node.js แต่ต้องระวังหากใช้กับ cluster mode เพราะ worker แต่ละตัวเก็บ metric แยกเช่นกัน

วิธีจัดการกับ cluster mode ใน prom-client คือใช้ AggregatorRegistry ซึ่ง master process รวบรวม metric จากทุก worker ผ่าน IPC:

const cluster = require('cluster');
const client = require('prom-client');
const express = require('express');

if (cluster.isMaster) {
  const aggregator = new client.AggregatorRegistry();
  const numCPUs = require('os').cpus().length;
  for (let i = 0; i < numCPUs; i++) cluster.fork();

  const app = express();
  app.get('/metrics', async (req, res) => {
    try {
      const metrics = await aggregator.clusterMetrics();
      res.set('Content-Type', aggregator.contentType);
      res.send(metrics);
    } catch (ex) {
      res.status(500).send(ex.message);
    }
  });
  app.listen(9090);
} else {
  const counter = new client.Counter({
    name: 'worker_requests_total',
    help: 'Requests per worker',
  });
  require('./worker');
}

AggregatorRegistry รวม Counter เป็น sum, Gauge เป็น sum (ปรับได้), Histogram เป็น sum ของ bucket — เหมาะกับการ expose metrics รวมของทั้ง app ผ่าน endpoint เดียว

Go: Pre-register และ Performance

client_golang ผลักดันให้ใช้ *Vec pre-register เพราะ Go ไม่มี runtime reflection แบบ Python — การ lookup metric ด้วย label ใช้ map ซึ่งเร็ว แต่การ allocate ใหม่ใน hot path ลด throughput ได้มาก

// ❌ anti-pattern: allocate labels ทุก request
func handler(w http.ResponseWriter, r *http.Request) {
    httpRequests.WithLabelValues(r.Method, r.URL.Path).Inc()
}

// ✅ pattern ที่ดีกว่า: ใช้ curry + cache
func makeHandler() http.HandlerFunc {
    counter := httpRequests.MustCurryWith(prometheus.Labels{"service": "api"})
    return func(w http.ResponseWriter, r *http.Request) {
        counter.WithLabelValues(r.Method, r.URL.Path).Inc()
    }
}

สำหรับ hot path ที่ rate สูงกว่า 100k req/s ให้พิจารณา pre-compute label value แบบ cached:

var getCounter = httpRequests.MustCurryWith(prometheus.Labels{"method": "GET"})
var getRoot = getCounter.WithLabelValues("/")

func rootHandler(w http.ResponseWriter, r *http.Request) {
    getRoot.Inc()  // zero allocation in hot path
}

Custom Collector — เมื่อ Counter/Gauge ไม่พอ

บางครั้ง metric ต้อง fetch จาก external source ที่ไม่ควร poll ตลอดเวลา เช่น database row count, queue depth จาก Redis การใช้ Gauge แล้ว update ทุกครั้งสิ้นเปลือง — ใช้ Custom Collector ให้ library เรียกเฉพาะตอน scrape:

// Go: Custom Collector ดึง queue depth จาก Redis
type QueueCollector struct {
    client *redis.Client
    desc   *prometheus.Desc
}

func NewQueueCollector(c *redis.Client) *QueueCollector {
    return &QueueCollector{
        client: c,
        desc: prometheus.NewDesc(
            "queue_depth",
            "Redis queue depth",
            []string{"queue"}, nil,
        ),
    }
}

func (c *QueueCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.desc
}

func (c *QueueCollector) Collect(ch chan<- prometheus.Metric) {
    for _, name := range []string{"jobs", "emails", "webhooks"} {
        n, _ := c.client.LLen(ctx, name).Result()
        ch <- prometheus.MustNewConstMetric(
            c.desc, prometheus.GaugeValue, float64(n), name,
        )
    }
}

// register ตอน app start
prometheus.MustRegister(NewQueueCollector(redisClient))

ประโยชน์: ไม่ต้องมี goroutine คอย update, metric ถูก fetch ตอน scrape เท่านั้น ลด load เมื่อ Prometheus scrape น้อย และไม่เก็บค่าเก่าที่อาจล้าสมัย

Pushgateway — สำหรับ Short-lived Jobs

งาน batch หรือ cronjob ทำงานสั้น ๆ แล้วจบ ไม่มี endpoint ให้ Prometheus scrape — ใช้ Pushgateway เป็น intermediary ให้ job push metric ไปที่ gateway แล้ว Prometheus scrape gateway

# Python — push จาก cronjob
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

registry = CollectorRegistry()
duration = Gauge('job_duration_seconds', 'Job duration', registry=registry)
success = Gauge('job_last_success_timestamp', 'Last success', registry=registry)

# รัน job...
import time
start = time.time()
do_work()
duration.set(time.time() - start)
success.set_to_current_time()

push_to_gateway(
    'pushgateway.internal:9091',
    job='nightly-etl',
    grouping_key={'instance': 'etl-01'},
    registry=registry,
)

ข้อควรระวังของ Pushgateway:

Pushgateway เก็บค่าล่าสุดตลอดเวลา — ต้อง DELETE เมื่อ job ถูกยกเลิก ไม่เช่นนั้นค่าเก่าจะคงอยู่
ใช้ grouping_key เพื่อแยก metric ระหว่าง instance ที่รัน parallel
ไม่ควรใช้แทน scrape ปกติ — เหมาะกับงาน short-lived เท่านั้น เพราะ Pushgateway เป็น single point of failure
รัน Pushgateway HA ด้วย persistent storage ถ้า metrics ไม่ควรหาย

Exemplars — เชื่อม Metrics กับ Traces

Exemplar คือ trace ID ที่แนบมากับ Histogram observation ตัวอย่าง — ทำให้คลิกจาก Grafana panel ไปยัง trace ได้ทันที ช่วย debug ได้เร็วกว่าการหา trace เองมาก

// Go + OpenTelemetry tracing
import (
    "github.com/prometheus/client_golang/prometheus"
    "go.opentelemetry.io/otel/trace"
)

func handler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()

    // ... business logic ...

    spanCtx := trace.SpanContextFromContext(r.Context())
    traceID := spanCtx.TraceID().String()

    httpDuration.(prometheus.ExemplarObserver).ObserveWithExemplar(
        time.Since(start).Seconds(),
        prometheus.Labels{"trace_id": traceID},
    )
}

Grafana v7.4+ แสดง exemplar บน graph panel เป็นจุดเล็ก — คลิกจะเปิด Tempo/Jaeger ไปที่ trace นั้นเลย สำหรับ production มีประโยชน์มากเพราะ trace sampling มักเก็บแค่ 1% ของ trace ทั้งหมด — exemplar ทำหน้าที่บันทึก trace ID ของ observation ที่ช้าที่สุดในแต่ละ bucket

Testing Instrumented Code

เมื่อ instrument แล้วต้อง test ว่า metric ถูก emit จริง — ตัวอย่าง unit test pattern ใน Python:

import pytest
from prometheus_client import REGISTRY, CollectorRegistry
from myapp import process_order

def test_order_metric_increments():
    registry = CollectorRegistry()
    # ใช้ custom registry ใน test เพื่อไม่รบกวน global state
    counter = Counter('orders_created_total', 'Orders', registry=registry)

    process_order(order_id='123', _counter=counter)

    assert registry.get_sample_value('orders_created_total') == 1.0

def test_histogram_records_duration():
    registry = CollectorRegistry()
    hist = Histogram('process_duration_seconds', '', registry=registry,
                     buckets=(0.1, 0.5, 1.0))

    with hist.time():
        import time
        time.sleep(0.2)

    # check bucket count
    assert registry.get_sample_value(
        'process_duration_seconds_bucket',
        {'le': '0.5'}
    ) == 1

Go ใช้ testutil package ของ client_golang:

import "github.com/prometheus/client_golang/prometheus/testutil"

func TestHandler(t *testing.T) {
    reg := prometheus.NewRegistry()
    counter := prometheus.NewCounter(prometheus.CounterOpts{Name: "requests_total"})
    reg.MustRegister(counter)

    handler := NewHandler(counter)
    handler.ServeHTTP(httptest.NewRecorder(), httptest.NewRequest("GET", "/", nil))

    if got := testutil.ToFloat64(counter); got != 1 {
        t.Errorf("want 1, got %v", got)
    }
}

Default Metrics ที่ Built-in

แต่ละ library มี default metrics ที่เปิดใช้ฟรี — ใช้ประโยชน์ได้ทันทีโดยไม่ต้อง instrument เพิ่ม:

Python: process_* (CPU, memory, FDs), python_gc_*, python_info
Node.js: nodejs_eventloop_lag_seconds, nodejs_heap_size_total_bytes, nodejs_gc_duration_seconds, process_*
Go: go_goroutines, go_memstats_*, go_gc_duration_seconds, go_info, process_*

metric สำคัญที่ควรตั้ง alert เบื้องต้น: event loop lag (Node.js), goroutine count (Go), memory RSS (ทุกภาษา), GC pause (Node.js/Go)

Unofficial Libraries และภาษาอื่น

สำหรับภาษาที่ไม่มี official client ยังมี community-maintained หลายตัว:

Java/Kotlin: micrometer (Spring Boot default) — มี adapter ไป Prometheus + Datadog + CloudWatch ในตัว
Rust: prometheus-client — official, คล้าย Go API
PHP: promphp/prometheus_client_php — รองรับ APC, Redis backend
Ruby: prometheus-client-mmap — multi-process ผ่าน mmap
.NET: prometheus-net — รองรับ ASP.NET Core middleware built-in

หลักการเลือก: เลือก library ที่ active maintained (commit ใน 6 เดือนล่าสุด), มี test coverage สูง และรองรับ OpenMetrics format — ไม่ใช่แค่ Prometheus format รุ่นเก่า

Pitfalls ที่พบบ่อย

Duplicate registration — สร้าง metric ซ้ำใน hot reload ทำให้ error “Duplicated timeseries” — ใช้ unregister() หรือ singleton pattern
Label values จาก user input — ใส่ raw user-controllable value เป็น label ทำให้ cardinality ระเบิด — ต้อง validate/normalize ก่อน
Metric name ที่ invalid — Prometheus ยอมรับแค่ [a-zA-Z_:][a-zA-Z0-9_:]* — ถ้าใส่ hyphen จะ silently fail
Blocking I/O ใน custom collector — Collect() ถูกเรียกตอน scrape หาก block นาน Prometheus จะ timeout
ไม่ handle concurrency — บางภาษาไม่ thread-safe by default — ใช้ mutex หรือ atomic ในโค้ด custom collector
Forget to persist state — Counter รีเซ็ตเมื่อ restart — ถ้าต้องการค่าสะสมจริงต้องใช้ external storage

สรุป

Official Prometheus client libraries ครอบคลุม use case ส่วนใหญ่ของ production application — แต่ละภาษามีข้อจำกัดเฉพาะที่ต้องเข้าใจ เช่น Python multi-process mode, Node.js cluster aggregation และ Go pre-registered vector labels ที่เพิ่ม performance

Advanced patterns ที่ควรใช้เมื่อเหมาะสม: custom collector สำหรับ expensive metric, pushgateway สำหรับ short-lived job, exemplar สำหรับเชื่อม metric กับ trace และ test pattern ที่ใช้ custom registry แยกจาก global เพื่อ isolate test