Instrumentation: เพิ่ม Metrics ใน Application

การจะรู้ว่าแอปพลิเคชันทำงานดีหรือไม่ เพียงแค่เช็คว่า process ยังรันอยู่ไม่พอ เพราะแอปอาจรันได้แต่ตอบช้า มี error เยอะ หรือใช้ทรัพยากรเกินงบ การใส่ instrumentation — การเพิ่มโค้ดเก็บ metrics จากภายในแอป — คือวิธีที่ทำให้ทีมเห็นสิ่งที่เกิดขึ้นจริงในเชิงตัวเลข ไม่ใช่แค่การเดา

บทความนี้อธิบายแนวคิดของ application instrumentation ประเภทของ metrics ที่ควรเก็บ methodology การเลือก metrics (USE, RED) พร้อมตัวอย่างโค้ดใน Node.js, Python และ Go โดยใช้ Prometheus client libraries และข้อควรระวังที่พบบ่อยในการ instrument production workloads

Instrumentation คืออะไร

Instrumentation คือการฝังโค้ดลงในแอปพลิเคชันเพื่อเก็บข้อมูลการทำงาน (telemetry) ออกมาในรูปแบบ metrics, traces หรือ logs โดย metrics เป็นตัวเลขเชิงปริมาณที่ aggregate ตามช่วงเวลา เช่น จำนวน request ต่อวินาที, latency p95 ของ API, จำนวน connection pool ที่ใช้งาน

การ instrument แตกต่างจากการ monitor จากภายนอก (blackbox) เพราะให้มุมมองจากภายในแอปเอง — เห็นว่า business logic ทำงานอย่างไร, query ไหนช้า, feature flag ไหนเปิดอยู่ ซึ่งข้อมูลพวกนี้ไม่สามารถเก็บจากภายนอกได้

ประเภท Metrics ใน Prometheus

Prometheus และระบบที่รองรับ OpenMetrics format กำหนด metric types 4 แบบที่ครอบคลุม use case ทั่วไป:

Type	ลักษณะ	ตัวอย่างการใช้งาน
Counter	ค่าเพิ่มขึ้นอย่างเดียว รีเซ็ตเมื่อ process restart	จำนวน request รวม, จำนวน error รวม
Gauge	ค่าขึ้นลงได้ตามสถานะปัจจุบัน	จำนวน goroutine, memory usage, connection pool ปัจจุบัน
Histogram	แบ่ง observation เข้า bucket ตามขนาด	latency distribution, request size, response size
Summary	คำนวณ quantile ภายใน client	latency percentile เมื่อไม่ต้องการ aggregate ข้าม instance

ส่วนใหญ่ควรใช้ Histogram แทน Summary เพราะ Histogram รองรับการ aggregate ข้าม instance ผ่านฟังก์ชัน histogram_quantile() ในขณะที่ Summary คำนวณ quantile แยกแต่ละ instance ไม่สามารถรวมกันได้

USE Method และ RED Method

ก่อน instrument ต้องตัดสินใจว่าจะเก็บ metric อะไรบ้าง มี 2 framework ยอดนิยมที่ช่วยให้คิดเป็นระบบ ได้แก่ USE Method ของ Brendan Gregg สำหรับ resource และ RED Method ของ Tom Wilkie สำหรับ service

USE Method (Utilization, Saturation, Errors)

Utilization — เปอร์เซ็นต์ที่ resource ถูกใช้งาน เช่น CPU 60%, memory 75%
Saturation — ปริมาณงานที่รออยู่ใน queue เช่น run queue length, I/O wait
Errors — จำนวน error event เช่น disk error, network drop

USE เหมาะกับ infrastructure-level metrics — ใช้ดูว่า resource ไหนกำลังจะเต็ม ใช้ระบุคอขวด

RED Method (Rate, Errors, Duration)

Rate — จำนวน request ต่อวินาที
Errors — จำนวน request ที่ fail ต่อวินาที หรือเปอร์เซ็นต์ error
Duration — เวลาที่ใช้จัดการ request (latency distribution)

RED เหมาะกับ request-driven service เช่น HTTP API, gRPC, job worker — ตอบคำถาม “บริการยังทำงานได้ดีไหม” จากมุมมองผู้ใช้ ใน service ที่ทำหลายอย่างควร instrument ทั้ง USE และ RED คู่กัน

ตัวอย่าง Node.js ด้วย prom-client

Library prom-client เป็น official client ของ Prometheus สำหรับ Node.js — มี default metrics (event loop lag, heap, GC) built-in และรองรับ custom metrics ทุก type

const express = require('express');
const client = require('prom-client');

const app = express();
const register = new client.Registry();

// Default metrics (CPU, memory, event loop lag, GC)
client.collectDefaultMetrics({ register });

// RED: Rate + Errors — Counter
const httpRequestTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register],
});

// RED: Duration — Histogram
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5, 10],
  registers: [register],
});

// Middleware เก็บ metrics ทุก request
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route ? req.route.path : req.path,
      status_code: res.statusCode,
    };
    httpRequestTotal.inc(labels);
    end(labels);
  });
  next();
});

// Expose /metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.get('/api/orders', (req, res) => {
  res.json({ orders: [] });
});

app.listen(3000);

เมื่อรันแล้วเข้าที่ http://localhost:3000/metrics จะเห็น output แบบ Prometheus text format ที่พร้อมให้ Prometheus scrape

ตัวอย่าง Python ด้วย prometheus_client

ใน Python library prometheus_client รองรับทั้ง sync (Flask, Django) และ async (FastAPI, Starlette) — ตัวอย่างต่อไปนี้ใช้ Flask

from flask import Flask, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time

app = Flask(__name__)

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=(0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5, 10)
)

IN_PROGRESS = Gauge(
    'http_requests_in_progress',
    'HTTP requests currently being processed',
    ['method', 'endpoint']
)

@app.before_request
def before():
    from flask import request, g
    g.start = time.time()
    IN_PROGRESS.labels(request.method, request.path).inc()

@app.after_request
def after(response):
    from flask import request, g
    elapsed = time.time() - g.start
    REQUEST_LATENCY.labels(request.method, request.path).observe(elapsed)
    REQUEST_COUNT.labels(request.method, request.path, response.status_code).inc()
    IN_PROGRESS.labels(request.method, request.path).dec()
    return response

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

@app.route('/api/users')
def users():
    return {'users': []}

if __name__ == '__main__':
    app.run(port=5000)

Gauge http_requests_in_progress เป็นตัวอย่างการวัด concurrency — เมื่อรวมกับ Counter และ Histogram จะได้ภาพครบทั้ง rate, errors, duration และ saturation ของ HTTP layer

ตัวอย่าง Go ด้วย client_golang

package main

import (
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequests = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total HTTP requests",
		},
		[]string{"method", "path", "status"},
	)

	httpLatency = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "HTTP request latency",
			Buckets: []float64{0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5},
		},
		[]string{"method", "path", "status"},
	)
)

type statusRecorder struct {
	http.ResponseWriter
	status int
}

func (r *statusRecorder) WriteHeader(code int) {
	r.status = code
	r.ResponseWriter.WriteHeader(code)
}

func instrument(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		rec := &statusRecorder{w, 200}
		start := time.Now()
		next.ServeHTTP(rec, r)
		labels := prometheus.Labels{
			"method": r.Method,
			"path":   r.URL.Path,
			"status": strconv.Itoa(rec.status),
		}
		httpRequests.With(labels).Inc()
		httpLatency.With(labels).Observe(time.Since(start).Seconds())
	})
}

func main() {
	mux := http.NewServeMux()
	mux.HandleFunc("/api/products", func(w http.ResponseWriter, r *http.Request) {
		w.Write([]byte(`{"products":[]}`))
	})
	mux.Handle("/metrics", promhttp.Handler())
	http.ListenAndServe(":8080", instrument(mux))
}

Pattern statusRecorder ช่วยดึงค่า HTTP status code มา label ใน metrics — สำคัญเพราะ error rate คำนวณจาก status=~"5.." ซึ่งไม่สามารถเก็บได้ถ้าไม่แยก label

Business Metrics นอกเหนือจาก Request Layer

Application metrics ไม่ได้จำกัดแค่ HTTP layer — ควร instrument ตัวแปรทางธุรกิจที่สำคัญเพื่อเชื่อม infrastructure กับ impact ต่อผู้ใช้จริง ตัวอย่าง:

จำนวน order ที่สร้างสำเร็จต่อชั่วโมง (Counter: orders_created_total)
มูลค่ายอดขายรวม (Counter: revenue_total_baht)
จำนวนผู้ใช้ที่ active อยู่ (Gauge: active_users)
Queue depth ของงาน background (Gauge: job_queue_depth)
เวลาประมวลผล background job (Histogram: job_duration_seconds)
อัตรา conversion funnel (Counter: funnel_step_total พร้อม label step)

Business metrics ช่วยตอบคำถามระดับผู้บริหาร เช่น “deploy ครั้งล่าสุดทำให้ยอด order ลดลงไหม” ได้ทันทีโดยไม่ต้อง query database

Label Cardinality — ข้อควรระวังที่สำคัญที่สุด

ทุก metric ที่มี label จะสร้าง time series แยกต่อแต่ละค่า label ที่ไม่ซ้ำกัน ถ้าใช้ label ที่มีค่าหลายล้านแบบ (เช่น user ID, request ID, UUID) Prometheus จะมี time series ระดับหลายสิบล้าน ซึ่งทำให้ memory footprint พุ่งและ query ช้าลงอย่างมาก

❌ ห้าม label ด้วย user_id, session_id, order_id, UUID, email, timestamp
❌ ห้าม label ด้วย URL ที่มี path parameter ดิบ เช่น /users/12345 — ต้อง normalize เป็น /users/:id
✅ ใช้ label ที่มีค่าจำกัด เช่น method (GET/POST), route template, status_code, region, env
✅ ถ้าต้องติดตาม entity ระดับลึก ใช้ distributed tracing (trace ID ไม่ใช่ label ใน metric)

กฎง่าย ๆ: ถ้าจำนวน unique combination ของ label values เกิน 10,000 แสดงว่าออกแบบผิด ต้องรีวิว

Naming Convention

ตาม Prometheus best practice metric name ควรสื่อความหมายและเป็นไปตาม convention เดียวกันทั้งระบบ:

ใช้ snake_case เช่น http_request_duration_seconds
prefix ด้วย namespace ของ service เช่น payment_service_transactions_total
Counter ลงท้ายด้วย _total
Histogram และ Summary ที่วัดเวลาใส่ suffix _seconds (ไม่ใช่ ms)
ขนาด byte ใส่ _bytes (ไม่ใช่ KB/MB)
หลีกเลี่ยงชื่อซ้ำระหว่าง service — ใช้ label แทนถ้าเป็น metric เดียวกันข้าม service

OpenTelemetry — มาตรฐานแทน Prometheus SDK

OpenTelemetry (OTel) เป็น framework กลางที่ CNCF ผลักดันเพื่อรวม metrics, traces และ logs ให้ใช้ API เดียวกัน — แทนที่จะผูกกับ Prometheus SDK โดยตรง การใช้ OTel SDK ทำให้สามารถ export ไปได้ทั้ง Prometheus, Datadog, New Relic, Grafana Cloud โดยไม่ต้องแก้โค้ดแอป

// Node.js + OpenTelemetry Metrics
const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');

const exporter = new PrometheusExporter({ port: 9464 });
const meterProvider = new MeterProvider({ readers: [exporter] });
const meter = meterProvider.getMeter('my-service');

const requestCounter = meter.createCounter('http_requests_total');
const requestDuration = meter.createHistogram('http_request_duration_seconds');

// ใช้งาน
requestCounter.add(1, { method: 'GET', route: '/api/orders', status: 200 });
requestDuration.record(0.123, { method: 'GET', route: '/api/orders' });

Pattern นี้แยก instrumentation ออกจาก backend — ถ้าอนาคตย้ายจาก Prometheus ไปใช้ OTLP endpoint ของ vendor อื่น เปลี่ยนแค่ exporter ไม่ต้องแก้ business code

Expose Metrics Endpoint อย่างปลอดภัย

endpoint /metrics เป็นข้อมูลภายในที่ไม่ควรเปิดสู่ public internet เพราะอาจเปิดเผยโครงสร้างระบบ อัตราใช้งาน และ timing ที่โจมตีได้:

ให้ Prometheus scrape ผ่าน private network เท่านั้น
Bind endpoint เฉพาะ internal interface หรือใช้ sidecar proxy
หากต้อง expose ผ่าน ingress ให้ใช้ basic auth หรือ mTLS
ตัว /metrics ใน Kubernetes มักใช้ ServiceMonitor ของ Prometheus Operator ซึ่งดึงภายใน cluster อยู่แล้ว

Performance Overhead

instrumentation มี overhead แต่ปกติน้อยมากเมื่อใช้ client library ที่ดี — อย่างไรก็ตามควรระวัง pattern ที่ทำให้ช้าลง:

สร้าง label combination ใหม่ใน hot path แต่ละ request → cache metric handle ไว้
ใช้ string concatenation แบบ dynamic เป็น label value → normalize ก่อน
เก็บ histogram ที่มี bucket เยอะเกินจำเป็น → เลือก bucket ให้พอเหมาะกับ SLO
Expose endpoint แบบ sync ใน event loop → prom-client ใช้ await register.metrics() อยู่แล้ว

Pitfalls ที่พบบ่อย

Instrument เฉพาะ happy path — ลืม instrument error path ทำให้ error rate ดูต่ำเกินจริง
ใช้ Counter แทน Gauge หรือกลับกัน — Counter ต้องขึ้นอย่างเดียว ถ้า logic ต้องลด ต้องเปลี่ยนเป็น Gauge
Bucket ของ Histogram ไม่เหมาะ — ถ้า SLO คือ 300ms แต่ bucket ใกล้สุดคือ 500ms จะวัด p95 ไม่ได้
ลืม reset เมื่อ deploy — Counter รีเซ็ตเมื่อ restart ทำให้กราฟมี drop แต่ rate() จัดการให้อยู่แล้ว
ไม่แยก label status code — รวม 2xx กับ 5xx เข้าด้วยกันทำให้ error rate คำนวณไม่ได้
Expose /metrics สู่ internet — เสี่ยง information disclosure

Integration กับ CI/CD

Instrumentation ควรเป็นส่วนของ deploy pipeline — เมื่อเพิ่ม feature ใหม่ ต้องถามเสมอว่า “metric อะไรจะบอกว่า feature นี้ทำงานถูกต้อง” ตัวอย่าง checklist ก่อน merge:

Feature ใหม่มี metric rate/error/duration ครบไหม
มี alert rule ใน Prometheus/Alertmanager รองรับไหม
Dashboard ที่เกี่ยวข้องมี panel ใหม่ไหม
Runbook อธิบายวิธีจัดการ alert ไว้หรือยัง

สรุป

Application instrumentation เป็นรากฐานของระบบ observable — การใส่ Counter, Gauge, Histogram ที่ออกแบบตาม USE/RED method ช่วยให้ทีมตอบคำถามเชิงวิเคราะห์ที่เจาะลึกถึง business outcome ไม่ใช่แค่สถานะ process

จุดที่ต้องระวังที่สุดคือ label cardinality, การเลือก bucket ของ histogram ให้สอดคล้องกับ SLO และการไม่ expose metrics endpoint สู่ public การใช้ OpenTelemetry SDK แทน Prometheus client โดยตรงเพิ่ม portability ให้ระบบในระยะยาว