Prometheus Service Discovery: Auto-discover Targets จาก Cloud

ในระบบ infrastructure สมัยใหม่ที่ใช้ Kubernetes, cloud auto-scaling, หรือ container orchestration การจัดการ target ที่ต้องเก็บ metric แบบ manual เป็นเรื่องยากและไม่ scalable เพราะ instance เกิดและดับได้ตลอดเวลา Service Discovery (SD) คือเครื่องมือที่ Prometheus ใช้ค้นหา target อัตโนมัติจากแหล่งต่าง ๆ ทำให้ระบบ monitoring สามารถปรับตัวตามการเปลี่ยนแปลงของ infrastructure ได้อย่างต่อเนื่องโดยไม่ต้องแก้ไข config ทุกครั้งที่มี instance ใหม่

บทความนี้จะพาไปรู้จักประเภทต่าง ๆ ของ Service Discovery ที่รองรับ ตั้งแต่ Kubernetes SD ที่ใช้งานใน cluster, EC2 SD ที่ค้นหา instance ใน AWS, Consul SD สำหรับ service mesh, ไปจนถึง File-based SD และ HTTP SD ที่ยืดหยุ่นที่สุด พร้อมตัวอย่าง config และ best practices ที่ควรทำตามในแต่ละกรณี

Service Discovery คืออะไร และทำไมต้องใช้

Service Discovery คือระบบที่ให้ Prometheus ไปถามแหล่งข้อมูลภายนอก (เช่น Kubernetes API, AWS API, Consul) ว่าตอนนี้มี instance/service อะไรที่ต้องเก็บ metric บ้าง แล้วระบบจะ update target list อัตโนมัติตาม refresh_interval ที่กำหนด ข้อดีคือไม่ต้อง hard-code IP/hostname ทำให้ระบบรองรับ auto-scaling, rolling update, และ container lifecycle ได้อย่างไร้รอยต่อ

ตัวอย่างสถานการณ์ที่ Service Discovery เป็นประโยชน์มาก เช่น ระบบที่มี pod ใน Kubernetes เกิดและดับหลายร้อยตัวต่อวัน, ระบบ auto-scaling group ของ EC2 ที่เพิ่ม/ลด instance ตาม load, หรือระบบ microservice ที่มี service registry ผ่าน Consul/etcd ทุกกรณีเหล่านี้การทำ static config จะกลายเป็นภาระใหญ่ทันที

Kubernetes Service Discovery

Kubernetes SD เป็นประเภทที่ใช้บ่อยที่สุดในระบบ cloud-native สามารถค้นหา resource ใน cluster ได้หลายระดับ เช่น node, service, pod, endpoints, ingress โดยแต่ละ role จะได้ target ที่มี metadata (label/annotation) ติดมาด้วย ซึ่งใช้ทำ relabeling เพื่อกรองและปรับ label ต่อได้

Roles ใน Kubernetes SD

node — ค้นหาทุก node ใน cluster เหมาะสำหรับ Node Exporter, kubelet metrics
service — ค้นหา service ทั้งหมดใน cluster ได้ IP ของ service
pod — ค้นหา pod ทั้งหมด ใช้กับ application ที่ expose metric endpoint
endpoints — ค้นหา endpoint ของ service (pod IP + port) เหมาะสำหรับ scrape application ที่อยู่เบื้องหลัง service
endpointslices — รุ่นใหม่ของ endpoints scale ได้ดีกว่าใน cluster ขนาดใหญ่
ingress — ค้นหา ingress objects เหมาะกับการตรวจ external endpoint

ตัวอย่าง Kubernetes SD Config แบบเต็ม

scrape_configs:
  # Scrape tất cả Pod ที่มี annotation ระบุให้ scrape
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: 'pod'
        namespaces:
          names: ['default', 'production', 'monitoring']
    relabel_configs:
      # เก็บเฉพาะ pod ที่มี annotation prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: 'keep'
        regex: 'true'

      # ใช้ annotation prometheus.io/path เป็น metrics_path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: 'replace'
        target_label: '__metrics_path__'
        regex: '(.+)'

      # ใช้ annotation prometheus.io/port เป็น port ใน address
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: 'replace'
        regex: '([^:]+)(?::\d+)?;(\d+)'
        replacement: '$1:$2'
        target_label: '__address__'

      # Copy pod labels ทั้งหมดเข้ามา
      - action: 'labelmap'
        regex: '__meta_kubernetes_pod_label_(.+)'

      # เพิ่ม namespace, pod, node
      - source_labels: [__meta_kubernetes_namespace]
        target_label: 'namespace'
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: 'pod'
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: 'node'

Pod Annotation Pattern

รูปแบบที่นิยมใช้คือให้แต่ละ pod ระบุว่าต้องการให้ scrape หรือไม่ผ่าน annotation ทำให้ developer ควบคุมได้เองว่า application ของตนเองจะถูก monitor หรือไม่

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
      labels:
        app: my-app
        tier: backend
    spec:
      containers:
        - name: my-app
          image: my-app:latest
          ports:
            - containerPort: 8080
              name: metrics

EC2 Service Discovery

EC2 SD ค้นหา instance ใน AWS อัตโนมัติตาม tag และ region ทำให้ไม่ต้อง maintain list ของ IP เองในระบบ auto-scaling ที่ instance เปลี่ยนแปลงบ่อย

scrape_configs:
  - job_name: 'ec2-nodes'
    ec2_sd_configs:
      - region: 'ap-southeast-1'
        access_key: '${AWS_ACCESS_KEY}'    # หรือใช้ IAM role
        secret_key: '${AWS_SECRET_KEY}'
        port: 9100
        filters:
          - name: 'tag:Environment'
            values: ['production']
          - name: 'tag:MonitoringEnabled'
            values: ['true']
    relabel_configs:
      # ใช้ Name tag เป็น instance label
      - source_labels: [__meta_ec2_tag_Name]
        target_label: 'instance'

      # ใช้ Role tag เป็น job label ต่าง ๆ
      - source_labels: [__meta_ec2_tag_Role]
        target_label: 'role'

      # เพิ่ม availability zone
      - source_labels: [__meta_ec2_availability_zone]
        target_label: 'az'

      # Copy all tags เข้ามาเป็น labels
      - action: 'labelmap'
        regex: '__meta_ec2_tag_(.+)'

Meta Labels ที่ EC2 SD ให้มา

__meta_ec2_instance_id — instance ID ของ EC2
__meta_ec2_availability_zone — AZ ที่ instance อยู่
__meta_ec2_instance_type — ประเภท instance (t3.medium, c5.xlarge)
__meta_ec2_private_ip — Private IP ของ instance
__meta_ec2_public_ip — Public IP (ถ้ามี)
__meta_ec2_tag_<tagname> — ทุก tag ของ instance
__meta_ec2_vpc_id — VPC ที่ instance อยู่

Consul Service Discovery

Consul เป็น service registry ที่นิยมใช้ใน microservice architecture การใช้ Consul SD ทำให้สามารถค้นหา service ทั้งหมดที่ลงทะเบียนไว้ใน Consul ได้อัตโนมัติ

scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.service.consul:8500'
        datacenter: 'dc1'
        services: []    # [] = ดึงทุก service; ระบุ list เพื่อกรอง
        tags: ['metrics']    # กรองเฉพาะ service ที่มี tag นี้
    relabel_configs:
      # ใช้ service name เป็น job label
      - source_labels: [__meta_consul_service]
        target_label: 'service'

      # ใช้ datacenter เป็น label
      - source_labels: [__meta_consul_dc]
        target_label: 'datacenter'

      # Copy tags เข้ามา
      - source_labels: [__meta_consul_tags]
        regex: ',(prod|staging|dev),'
        target_label: 'environment'

DNS Service Discovery

DNS SD ใช้ DNS record (โดยเฉพาะ SRV record) เพื่อค้นหา target เหมาะกับระบบที่มี DNS-based service mesh หรือใช้ CoreDNS/Kubernetes DNS อยู่แล้ว

scrape_configs:
  - job_name: 'dns-srv-discovery'
    dns_sd_configs:
      - names:
          - '_metrics._tcp.services.consul'
          - '_prometheus._tcp.example.com'
        type: 'SRV'    # SRV, A, AAAA, MX
        refresh_interval: 30s

  - job_name: 'dns-a-record'
    dns_sd_configs:
      - names:
          - 'node-exporter.monitoring.svc.cluster.local'
        type: 'A'
        port: 9100

File-based Service Discovery

File SD อ่านรายชื่อ target จากไฟล์ JSON หรือ YAML ที่สามารถ update โดย external script/tool นี่คือทางที่ยืดหยุ่นที่สุดและเหมาะกับสถานการณ์ที่ไม่มี service registry แบบดั้งเดิม ระบบจะตรวจ file changes และ reload อัตโนมัติ

# prometheus.yml
scrape_configs:
  - job_name: 'file-based'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s

# ไฟล์ /etc/prometheus/targets/web.json
[
  {
    "targets": ["web-01:9100", "web-02:9100"],
    "labels": {
      "env": "production",
      "role": "web",
      "region": "bangkok"
    }
  },
  {
    "targets": ["staging-web-01:9100"],
    "labels": {
      "env": "staging",
      "role": "web"
    }
  }
]

File SD เหมาะกับการสร้าง pipeline ที่ดึงข้อมูลจากแหล่งภายนอก (เช่น CMDB, Ansible inventory, Terraform state) แล้ว generate ไฟล์ target ออกมาเป็น cron job ทุก 5-15 นาที ให้ระบบอ่านไป scrape

HTTP Service Discovery

HTTP SD เป็นการ discover target ผ่าน HTTP endpoint ภายนอกที่ return JSON รูปแบบเดียวกับ file SD เหมาะกับการเชื่อมต่อกับ service registry custom หรือ API ภายในขององค์กร

scrape_configs:
  - job_name: 'http-sd'
    http_sd_configs:
      - url: 'https://cmdb.internal.example.com/api/prometheus/targets'
        refresh_interval: 60s
        authorization:
          type: 'Bearer'
          credentials_file: '/etc/prometheus/api_token'

# Response format จาก endpoint (JSON)
[
  {
    "targets": ["10.0.1.10:9100", "10.0.1.11:9100"],
    "labels": {
      "env": "production",
      "team": "backend"
    }
  }
]

Service Discovery Providers อื่น ๆ

นอกจากที่กล่าวมาแล้ว ระบบยังรองรับ SD provider อีกหลายแบบ เหมาะกับสภาพแวดล้อมเฉพาะที่หลากหลาย

azure_sd_configs — ค้นหา Azure VM ตาม subscription/resource group
gce_sd_configs — ค้นหา Google Compute Engine instance
digitalocean_sd_configs — ค้นหา Droplet ใน DigitalOcean
hetzner_sd_configs — ค้นหา instance ใน Hetzner Cloud
linode_sd_configs — ค้นหา Linode instance
scaleway_sd_configs — ค้นหา instance ใน Scaleway
docker_sd_configs — ค้นหา container ผ่าน Docker Engine API
dockerswarm_sd_configs — ค้นหา service/task ใน Docker Swarm
nomad_sd_configs — ค้นหา job allocation ใน HashiCorp Nomad

Best Practices ในการใช้ Service Discovery

ใช้ refresh_interval ที่เหมาะสม — ถ้า SD ดึงจาก API ที่ rate limit อยู่ อย่าตั้งสั้นเกินไป ปกติ 30-60 วินาทีเพียงพอสำหรับระบบส่วนใหญ่
ใช้ relabel_configs เพื่อกรอง target — ไม่ควร scrape ทุก target ที่ SD ให้มา กรองด้วย action keep/drop ตาม label ที่กำหนด
ใช้ IAM role แทน access key — สำหรับ EC2 SD ควรใช้ IAM role ของ instance ที่รัน Prometheus แทนการใส่ access_key/secret_key ใน config
ระมัดระวัง cardinality — บาง label จาก SD เช่น pod IP, instance ID สามารถสร้าง cardinality สูงได้ ให้ drop ด้วย labeldrop
ใช้ namespaces filter ใน K8s SD — ถ้าต้องการ monitor แค่บาง namespace การใช้ namespaces.names ใน config จะลดภาระกว่าการ drop ใน relabel
ทดสอบ SD ก่อน production — ดูที่ /service-discovery endpoint ว่าได้ target ถูกต้อง และดู /targets ว่า scrape สำเร็จ
จัดการ authentication อย่างปลอดภัย — ใช้ credentials_file แทนการใส่ token ตรงใน config

การ Debug Service Discovery

เมื่อ SD ไม่ทำงานตามคาดหวัง มีหลายจุดที่ต้องตรวจสอบ เริ่มจาก web UI ของระบบ เปิดหน้า /targets เพื่อดูว่ามี target ถูก discover หรือไม่ และหน้า /service-discovery เพื่อดูรายละเอียด label ของ target ก่อนและหลัง relabeling

# ตรวจดู target ที่ discover ได้
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels'

# ตรวจดู target ที่ถูก drop ในขั้น relabel
curl http://localhost:9090/api/v1/targets?state=dropped | jq

# ตรวจดู SD configuration ที่ load
curl http://localhost:9090/api/v1/status/config | jq

# ดู log ของ SD (ตั้ง --log.level=debug)
journalctl -u prometheus -f | grep 'level=debug'

สรุป

Service Discovery เป็นหัวใจที่ทำให้ระบบ monitoring ทำงานได้ในสภาพแวดล้อม cloud-native ที่เปลี่ยนแปลงตลอดเวลา ช่วยให้ระบบ adapt ตาม infrastructure ได้อัตโนมัติโดยไม่ต้อง maintain target list เอง การเลือกใช้ SD provider ที่เหมาะกับแต่ละสภาพแวดล้อม พร้อมใช้ relabel_configs ในการกรองและปรับ label อย่างมีประสิทธิภาพ จะช่วยให้ระบบ scale ได้ตามการเติบโตของ infrastructure

ในกรณีที่ระบบมี infrastructure หลายชั้น (cloud + on-premise + container) สามารถใช้ SD หลายประเภทพร้อมกันได้โดยแยกเป็น scrape_configs ต่าง ๆ หรือใช้ File SD + HTTP SD เป็น unified layer ที่อ่าน target จากแหล่งต่าง ๆ มารวมกัน ซึ่งช่วยให้การจัดการเป็นระเบียบและ maintenance ง่ายขึ้น