Prometheus Configuration: scrape_configs, relabeling, targets

การกำหนดค่า (configuration) ของระบบเก็บ metric เป็นหัวใจสำคัญที่ทำให้ monitoring ทำงานตามที่เราต้องการ เพราะไฟล์ config คือจุดที่เราบอกระบบว่าจะไปดึงข้อมูลจากที่ไหน, ทุกกี่วินาที, จะกรอง metric ใดออก, จะเปลี่ยน label อย่างไรก่อนเก็บ การเข้าใจโครงสร้างของ prometheus.yml และเครื่องมืออย่าง scrape_configs, relabel_configs, และ metric_relabel_configs จะช่วยให้จัดการระบบขนาดใหญ่ได้อย่างมีระเบียบและยืดหยุ่น

บทความนี้จะพาไปรู้จักโครงสร้างของไฟล์ config ทั้งหมด ตั้งแต่ global settings, การ scrape targets แบบต่าง ๆ, ระบบ relabeling ที่ทรงพลัง, ไปจนถึงตัวอย่างจริงที่ใช้ได้ใน production พร้อม best practices ที่ควรทำตามเพื่อให้ config scalable และแก้ไขง่าย

โครงสร้างไฟล์ Configuration หลัก

ไฟล์ config มีโครงสร้างเป็น YAML ที่แบ่งออกเป็นหลาย section หลัก โดย section ที่สำคัญที่สุดคือ global (ค่าเริ่มต้นสำหรับทุก scrape), scrape_configs (รายการ target ที่จะเก็บ metric), rule_files (ไฟล์ recording/alerting rules), และ alerting (config สำหรับส่ง alert ไปยัง Alertmanager)

global:
  scrape_interval: 15s        # ความถี่ scrape ค่าเริ่มต้น
  scrape_timeout: 10s         # timeout ต่อ scrape
  evaluation_interval: 15s    # ความถี่ evaluate rules
  external_labels:
    cluster: 'production'
    region: 'ap-southeast-1'

rule_files:
  - 'rules/*.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager:9093'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-01:9100'
          - 'node-02:9100'

Global Settings ที่ควรรู้

scrape_interval — ความถี่ในการเก็บ metric ค่าเริ่มต้น 15 วินาที เหมาะกับงานทั่วไป ถ้า metric เปลี่ยนช้าอาจใช้ 30-60 วินาทีเพื่อประหยัด resource
scrape_timeout — เวลาสูงสุดที่รอ target ตอบกลับ ต้องไม่เกิน scrape_interval ปกติตั้งไว้ 10 วินาที
evaluation_interval — ความถี่ในการประเมิน rule ต่าง ๆ ตั้งให้เท่ากับ scrape_interval มักจะได้ผลดี
external_labels — label ที่ใส่ให้ทุก time series เมื่อส่งออกไปยังระบบอื่น (เช่น remote write) เหมาะสำหรับระบุ cluster/region

scrape_configs: เก็บ Metric จากหลายแหล่ง

scrape_configs เป็น list ของ job แต่ละ job คือกลุ่ม target ที่มีลักษณะคล้ายกัน (เช่น web servers ทั้งหมด, database servers ทั้งหมด) ในแต่ละ job สามารถระบุ targets, endpoint path, authentication, และ relabeling rules เฉพาะ job นั้น ๆ ได้

scrape_configs:
  - job_name: 'web-servers'
    scrape_interval: 10s         # override จาก global
    metrics_path: '/metrics'     # default คือ /metrics
    scheme: 'https'              # http หรือ https
    basic_auth:
      username: 'prom'
      password_file: '/etc/secrets/prom_password'
    tls_config:
      ca_file: '/etc/ssl/ca.pem'
      insecure_skip_verify: false
    static_configs:
      - targets:
          - 'web-01.example.com:9100'
          - 'web-02.example.com:9100'
        labels:
          environment: 'production'
          service: 'web'

ประเภทของ Service Discovery (SD)

นอกจาก static_configs ที่ระบุ target แบบ hard-code แล้ว ระบบยังรองรับ dynamic service discovery ที่ค้นหา target ให้อัตโนมัติ เหมาะกับสภาพแวดล้อมที่ instance เปลี่ยนแปลงบ่อย ๆ เช่น Kubernetes, cloud auto-scaling

static_configs — ระบุ target แบบตรง เหมาะกับ infrastructure ที่ไม่เปลี่ยน
file_sd_configs — อ่านรายชื่อ target จากไฟล์ JSON/YAML ที่ถูก update โดย external tool
http_sd_configs — ดึงรายชื่อ target จาก HTTP endpoint อื่น
kubernetes_sd_configs — ค้นหา pod/service ใน Kubernetes cluster อัตโนมัติ
consul_sd_configs — ดึง target จาก Consul service catalog
ec2_sd_configs — ค้นหา EC2 instance ใน AWS ตาม tag/region
dns_sd_configs — resolve DNS SRV record เพื่อหา target

ตัวอย่าง file_sd_configs

# ใน prometheus.yml
scrape_configs:
  - job_name: 'node'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
        refresh_interval: 30s

# ไฟล์ /etc/prometheus/targets/nodes.json
[
  {
    "targets": ["10.0.1.10:9100", "10.0.1.11:9100"],
    "labels": {
      "env": "production",
      "role": "web"
    }
  },
  {
    "targets": ["10.0.2.10:9100"],
    "labels": {
      "env": "staging",
      "role": "db"
    }
  }
]

Relabeling: ระบบแก้ไข Label ที่ทรงพลัง

Relabeling เป็นเครื่องมือสำคัญที่ทำให้สามารถแก้ไข, เพิ่ม, ลบ label ของ target หรือ metric ก่อนเก็บข้อมูลได้ แบ่งเป็น 2 ประเภทคือ relabel_configs (ทำงานก่อน scrape — ปรับ label ของ target) และ metric_relabel_configs (ทำงานหลัง scrape — ปรับ label ของแต่ละ metric หรือกรอง metric ออก)

Actions ของ Relabeling

replace (default) — แทนที่ค่า label เดิมหรือสร้าง label ใหม่จาก regex
keep — เก็บเฉพาะ target/metric ที่ label ตรงกับ regex
drop — ทิ้ง target/metric ที่ label ตรงกับ regex
labelmap — copy label ทั้งกลุ่มตาม regex pattern
labeldrop — ลบ label ที่ชื่อตรงกับ regex
labelkeep — เก็บเฉพาะ label ที่ชื่อตรงกับ regex (ลบที่เหลือ)
hashmod — hash ค่า label แล้วเก็บ modulo เพื่อใช้แบ่ง workload

ตัวอย่าง relabel_configs ที่ใช้บ่อย

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-01.example.com:9100']
    relabel_configs:
      # 1. ดึง hostname จาก __address__ มาเป็น label instance
      - source_labels: [__address__]
        regex: '([^:]+):.*'
        target_label: 'instance'
        replacement: '$1'

      # 2. เพิ่ม label environment จากชื่อ host
      - source_labels: [__address__]
        regex: '.*-(prod|staging|dev)\..*'
        target_label: 'environment'
        replacement: '$1'

      # 3. เก็บเฉพาะ target ที่เป็น production
      - source_labels: [environment]
        regex: 'production'
        action: 'keep'

      # 4. ลบ label ที่ไม่ต้องการ
      - regex: '__meta_.*'
        action: 'labeldrop'

ตัวอย่าง metric_relabel_configs

metric_relabel_configs ทำงานหลัง scrape ใช้สำหรับกรอง metric ที่ไม่ต้องการออก หรือปรับ label ของ metric ที่เก็บเข้ามา มีประโยชน์มากในการลด cardinality และลดขนาด storage

scrape_configs:
  - job_name: 'kubernetes-pods'
    # ... other config ...
    metric_relabel_configs:
      # 1. ทิ้ง metric ที่มี prefix ชื่อที่ไม่ใช้
      - source_labels: [__name__]
        regex: 'go_gc_.*|process_virtual_memory_max_bytes'
        action: 'drop'

      # 2. เปลี่ยนชื่อ label จาก container_name เป็น container
      - source_labels: [container_name]
        target_label: 'container'
        action: 'replace'

      # 3. ลบ label ที่มี cardinality สูง
      - regex: 'id|uid|pod_template_hash'
        action: 'labeldrop'

Special Labels ที่ระบบเพิ่มให้อัตโนมัติ

ระหว่าง scrape ระบบจะเพิ่ม label พิเศษที่ขึ้นต้นด้วย __ (double underscore) เพื่อใช้ในขั้นตอน relabel label เหล่านี้จะถูกลบอัตโนมัติก่อนเก็บข้อมูลจริง เช่น __address__, __metrics_path__, __scheme__, __meta_*

__address__ — ที่อยู่ของ target (host:port)
__metrics_path__ — URL path ที่จะไป scrape (ค่าเริ่มต้น /metrics)
__scheme__ — http หรือ https
__param_<name> — query string parameter ที่จะใส่ในการ scrape
__meta_* — metadata จาก service discovery (เช่น Kubernetes pod labels, EC2 tags) ใช้ใน relabel แล้วจะถูกลบ

ตัวอย่าง Kubernetes Service Discovery

การใช้ service discovery ของ Kubernetes เป็นตัวอย่างที่ดีของการใช้ relabel_configs อย่างเต็มประสิทธิภาพ เพราะเก็บ pod/service/endpoint ทุกตัวใน cluster แล้วกรองเฉพาะที่ต้องการ

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: 'pod'
    relabel_configs:
      # เก็บเฉพาะ pod ที่มี annotation scrape=true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: 'keep'
        regex: 'true'

      # ใช้ annotation path แทน default path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: 'replace'
        target_label: '__metrics_path__'
        regex: '(.+)'

      # ใช้ annotation port แทน default port
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: 'replace'
        regex: '([^:]+)(?::\d+)?;(\d+)'
        replacement: '$1:$2'
        target_label: '__address__'

      # copy pod labels เข้ามาเป็น labels
      - action: 'labelmap'
        regex: '__meta_kubernetes_pod_label_(.+)'

      # เพิ่ม namespace, pod_name เป็น label
      - source_labels: [__meta_kubernetes_namespace]
        target_label: 'namespace'
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: 'pod'

Recording Rules และ Alerting Rules

Rule files เป็นไฟล์แยกที่ใช้กำหนด recording rules (pre-compute query ที่ซับซ้อน) และ alerting rules (เงื่อนไขที่จะสร้าง alert) การแยกออกจาก config หลักทำให้จัดการง่าย เวอร์ชันควบคุมใน git ได้สะดวก

# ไฟล์ rules/app-rules.yml
groups:
  - name: 'app-recording'
    interval: 30s
    rules:
      # Recording rule: pre-compute request rate per service
      - record: 'job:http_requests:rate5m'
        expr: sum by (job) (rate(http_requests_total[5m]))

  - name: 'app-alerts'
    rules:
      - alert: 'HighErrorRate'
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))
          > 0.05
        for: 10m
        labels:
          severity: 'warning'
        annotations:
          summary: 'High error rate on {{ $labels.job }}'
          description: 'Error rate is {{ $value | humanizePercentage }}'

Best Practices ในการจัดการ Configuration

แยก config ตามความรับผิดชอบ — ใช้ scrape_config_files หรือ file-based SD เพื่อแยก config ของแต่ละทีม/service ออกจากไฟล์หลัก
เก็บ config ใน version control — ทุกการเปลี่ยนแปลงควรผ่าน git + code review เพื่อ track การเปลี่ยนแปลงและ rollback ได้
ทดสอบ config ก่อน apply — ใช้ promtool check config prometheus.yml และ promtool check rules rules/*.yml ก่อน reload
ใช้ SIGHUP reload — ส่ง signal SIGHUP หรือเรียก /-/reload endpoint เพื่อ reload config โดยไม่ต้อง restart process
จำกัด cardinality ด้วย metric_relabel — กรอง metric ที่ cardinality สูงหรือไม่ใช้ออก เพื่อประหยัด memory และ disk
ใช้ external_labels ใน HA setup — ถ้ามีหลาย instance ของระบบ monitoring ต้องใส่ external_labels ที่ไม่ซ้ำกันเพื่อแยกข้อมูล
ระมัดระวัง regex — regex ใน relabel_configs ต้อง match ทั้งสาย (anchored) โดย default ไม่ใช่ partial match

การตรวจสอบและ Debug Config

เครื่องมือที่ช่วยตรวจสอบ config และ debug ปัญหาต่าง ๆ มีดังนี้

# ตรวจสอบความถูกต้องของ config file
promtool check config /etc/prometheus/prometheus.yml

# ตรวจสอบ rules
promtool check rules /etc/prometheus/rules/*.yml

# ตรวจสอบ metric names ในไฟล์
promtool check metrics < metrics.txt

# Reload config โดยไม่ restart (ต้องเปิด --web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload

# ดู target/relabel ผ่าน web UI
# เปิด http://localhost:9090/targets  ดูสถานะแต่ละ target
# เปิด http://localhost:9090/service-discovery  ดู label ก่อน relabel

สรุป

ไฟล์ configuration เป็นศูนย์กลางของระบบ monitoring ที่กำหนดว่าจะเก็บ metric จากที่ไหน อย่างไร และจะแปลง label ต้นทางให้เป็นอย่างไรก่อนเก็บ การเข้าใจ scrape_configs, relabel_configs, และ metric_relabel_configs อย่างลึกซึ้งทำให้สามารถจัดการระบบขนาดใหญ่ได้อย่างมีระเบียบ ยืดหยุ่น และ scale ได้ตามการเติบโตของ infrastructure

การใช้ service discovery ร่วมกับ relabeling rules ที่ออกแบบมาดี ช่วยลดภาระในการจัดการ target โดยอัตโนมัติ และช่วยให้สามารถกรอง metric ที่ไม่จำเป็นออกเพื่อประหยัด resource ซึ่งเป็นคุณสมบัติสำคัญของระบบ monitoring ที่ดีในสภาพแวดล้อม production