Troubleshoot Prometheus: เมื่อ Metrics หายไป หรือ Query ช้า

เมื่อระบบ monitoring ที่ใช้ Prometheus เริ่มมีปัญหา เช่น metrics บางชุดหายไปจากกราฟ, dashboard ใช้เวลาโหลดนาน, query ที่เคยเร็วกลับช้าผิดปกติ หรือเก็บข้อมูลย้อนหลังไม่ได้ — ปัญหาเหล่านี้ส่วนใหญ่สาเหตุซ้ำกันและแก้ได้ด้วยการตรวจสอบอย่างเป็นระบบ การรู้ว่าจุดไหนควรดูก่อนและเครื่องมือใดใช้ได้เมื่อไหร่ จะช่วยลดเวลา downtime ของระบบ observability ลงอย่างมาก

บทความนี้รวมวิธี troubleshoot Prometheus ที่พบบ่อย แบ่งตาม 2 กลุ่มปัญหาหลัก คือ metrics หาย/ไม่ตรงจริง และ query ช้า/ใช้ memory สูง พร้อมตัวอย่างคำสั่งและ metrics ที่ควรใช้ตรวจสอบในแต่ละกรณี

ขั้นตอนที่ 1: ตรวจสถานะ target ใน /targets

เปิดหน้า Prometheus UI ไปที่ Status > Targets หากพบ target ที่มีสถานะ DOWN แสดงว่า Prometheus ไม่สามารถ scrape จาก endpoint นั้นได้ ให้ดูคอลัมน์ Error ซึ่งมักบอกสาเหตุชัดเจน เช่น connection refused, timeout, หรือ context deadline exceeded สาเหตุทั่วไปที่พบบ่อย ได้แก่

Service ที่ถูก scrape ไม่ทำงาน หรือ port เปลี่ยน
Firewall/Security Group block Prometheus server
DNS resolve ไม่ได้ (เกิดกับ static_configs ที่ใช้ hostname)
TLS certificate หมดอายุ (เกิดกับ endpoint HTTPS)
Service ใหม่ยังไม่ได้ register ใน service discovery

ขั้นตอนที่ 2: ทดสอบ scrape endpoint โดยตรง

หาก target อยู่ในสถานะ UP แต่ metrics บางชุดยังหาย ให้ทดสอบ scrape endpoint ด้วยคำสั่ง curl จาก Prometheus server โดยตรง เพื่อยืนยันว่า exporter ปล่อย metrics ออกมาจริงหรือไม่

curl -s http://target-host:9100/metrics | head -50
curl -s http://target-host:9100/metrics | grep -c "^# TYPE"
curl -s http://target-host:9100/metrics | grep "your_metric_name"

หากคำสั่งแรกส่ง metrics กลับมาแต่ไม่มี metric ที่ต้องการ แสดงว่า exporter ไม่ได้เก็บ metric นั้น (อาจต้องเปิด flag เพิ่ม) หากไม่ส่งอะไรเลย แสดงว่า exporter มีปัญหา ให้ตรวจ log ของ exporter ก่อน

ขั้นตอนที่ 3: ตรวจ relabel_configs และ metric_relabel_configs

หนึ่งในสาเหตุที่ metrics ดู “หาย” บ่อยคือการใช้ relabel rule ที่ผิดพลาด ทำให้ label ที่ query ใช้กลับไม่มีอยู่ หรือ metric ถูก drop ไปโดยไม่ตั้งใจ ตัวอย่าง relabel rule ที่มักผิด:

scrape_configs:
  - job_name: 'node'
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'node_memory_.*'
        action: keep     # drop ทุก metric ที่ไม่ใช่ memory ทิ้ง
      - source_labels: [instance]
        regex: '(.*):(.*)'
        target_label: host
        replacement: '${1}'
      - action: labeldrop
        regex: 'instance'   # ลบ label instance ทั้งหมด

ใช้ URL /api/v1/targets/metadata และ /api/v1/label/__name__/values ของ Prometheus เพื่อตรวจว่า metric ที่คาดว่าจะมี อยู่ในระบบจริงหรือไม่ หากไม่มี ให้ตรวจ relabel config ทั้งสองส่วนอย่างละเอียด

ขั้นตอนที่ 4: ตรวจ scrape_duration_seconds และ up{}

ใช้ Prometheus ตรวจตัวเอง (self-monitoring) ผ่าน query เหล่านี้ เพื่อหา target ที่ scrape ไม่สำเร็จหรือใช้เวลา scrape นาน:

# หา target ที่ down
up == 0

# หา target ที่ scrape เกิน 5 วินาที (ใกล้จะ timeout)
scrape_duration_seconds > 5

# ดู metric ที่ drop เพราะ sample limit
scrape_samples_post_metric_relabeling / scrape_samples_scraped < 1

# จำนวน series ทั้งหมดในระบบ
prometheus_tsdb_head_series

ขั้นตอนที่ 5: ตรวจ staleness และ lookback

Prometheus ใช้ concept ของ staleness — หาก metric ไม่มีค่าใหม่เข้ามาภายใน 5 นาที (default) Prometheus จะถือว่า metric นั้นหายไป ทำให้กราฟไม่แสดงค่า ปัญหานี้พบเมื่อ scrape interval นานกว่า 5 นาที หรือ scrape ตกบ่อย วิธีแก้คือลด scrape_interval หรือปรับ –query.lookback-delta ตอน start Prometheus

อาการทั่วไป

Dashboard โหลดช้า หรือบาง panel timeout
Query ที่เคยตอบใน 1 วินาที ใช้เวลา 10+ วินาที
Prometheus memory เพิ่มขึ้นเรื่อย ๆ จนเกือบเต็ม
Error message “too many samples in query” หรือ “query timeout”
Server OOM kill Prometheus process บ่อย

สาเหตุหลัก: Cardinality สูงเกินไป

Cardinality หมายถึงจำนวน time series ที่ไม่ซ้ำกันที่ Prometheus เก็บ การเพิ่ม label ที่มี unique value เยอะ (เช่น user_id, request_id, session_id) จะทำให้ cardinality พุ่งขึ้นอย่างรวดเร็ว กฎคร่าว ๆ คือ Prometheus รับได้ดีสูงสุดประมาณ 1-5 ล้าน active series ต่อ instance หากเกินนี้ query จะช้าและ memory พุ่ง

ใช้ query เหล่านี้เพื่อหาจุดที่ cardinality สูง:

# จำนวน series ทั้งระบบ
prometheus_tsdb_head_series

# Top 10 metric ที่มี series เยอะที่สุด
topk(10, count by (__name__)({__name__=~".+"}))

# Top 10 label ที่มี value เยอะ (รันบน /api/v1/status/tsdb)
# หรือใช้ promtool tsdb analyze
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByLabelValuePair[:10]'

หากพบ metric ที่ cardinality สูงผิดปกติ ให้ใช้ metric_relabel_configs ลบ label ที่ไม่จำเป็น หรือหยุด scrape metric นั้นทั้งตัว หากเป็นไปได้ ให้เปลี่ยนไปใช้ histogram แทนการ track value แต่ละค่าแบบ unique

การวิเคราะห์ข้อมูลใน TSDB ด้วย promtool

promtool เป็นเครื่องมือที่มากับ Prometheus ใช้วิเคราะห์ TSDB โดยตรง หา block ที่มีปัญหาหรือ metric ที่กิน disk เยอะ:

# วิเคราะห์ block ล่าสุด
promtool tsdb analyze /var/lib/prometheus/data

# ผลลัพธ์จะบอก:
# - Label names with highest cardinality
# - Label value pairs with most series
# - Series with highest churn rate (ถูกสร้าง/ลบบ่อย)
# - Label pair value sizes

# ดู block เก่า ๆ
promtool tsdb list /var/lib/prometheus/data

Query Optimization

Query ที่เขียนไม่ดี อาจทำให้ Prometheus โหลด series จำนวนมากเข้า memory เพื่อคำนวณ แม้ผลลัพธ์สุดท้ายจะเหลือไม่กี่ค่า หลักการเขียน query ที่ดี:

ใช้ label selector ที่เจาะจงเสมอ เช่น http_requests_total{job="api",code="500"} ดีกว่า http_requests_total
หลีกเลี่ยง {__name__=~".+"} เพราะ match ทุก metric
ใช้ recording rule สำหรับ query ที่ใช้บ่อย แทนการคำนวณใน dashboard ทุกครั้ง
ระวัง rate() บน range ที่ยาวเกินไป เช่น rate(metric[7d]) จะโหลด 7 วันเข้า memory
ใช้ sum by / avg by แทนการไม่ระบุ aggregation

ตัวอย่าง Recording Rule ช่วยลดภาระ query

groups:
  - name: api_recording
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, code) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))

      - record: job:http_error_ratio:5m
        expr: |
          job:http_errors:rate5m
          /
          sum by (job) (job:http_requests:rate5m)

เมื่อ dashboard เรียกใช้ job:http_error_ratio:5m Prometheus จะตอบผลลัพธ์ที่คำนวณล่วงหน้าแล้วทันที ไม่ต้องทำ rate + sum + divide ใหม่ทุกครั้ง

Memory Sizing Rule of Thumb

Prometheus ต้องการ memory ประมาณ 1-3 KB ต่อ active time series ในหน่วย working memory หาก instance หนึ่งเก็บ 2 ล้าน series ควรเตรียม 4-6 GB RAM ขั้นต่ำ + buffer อีก 30% สำหรับ query ที่โหลดข้อมูลเพิ่ม ตรวจสอบ memory usage ด้วย:

# Memory ที่ใช้จริง
process_resident_memory_bytes{job="prometheus"}

# จำนวน series
prometheus_tsdb_head_series

# Chunks ที่อยู่ใน memory
prometheus_tsdb_head_chunks

# อัตราการสร้าง/ลบ series ต่อวินาที (ค่าสูง = churn มาก)
rate(prometheus_tsdb_head_series_created_total[5m])

Disk และ Retention

ปัญหา disk เต็มมักเกิดจาก retention period ตั้งไว้สูง + cardinality สูง + scrape interval สั้น รวมกันทำให้ข้อมูลสะสมเร็วมาก ประมาณการใช้งาน disk ด้วยสูตร:

Disk = retention_days * active_series * samples_per_day * bytes_per_sample

ตัวอย่าง:
  retention = 30 วัน
  active_series = 1,000,000
  scrape_interval = 15 วินาที (5760 samples/day)
  bytes_per_sample = 1.3 bytes (after compression)

Disk ≈ 30 * 1,000,000 * 5760 * 1.3 = 225 GB

หาก disk ขึ้นถึง 90% ควรพิจารณา: ลด retention period, เพิ่ม storage, หรือย้ายข้อมูลเก่าไป remote storage เช่น Thanos, Cortex, Mimir

Alert ไม่ทำงาน หรือส่งช้า

นอกจากปัญหา metrics และ query แล้ว alert ที่ควรส่งแต่ไม่ส่ง หรือส่งช้า ก็เป็นปัญหาที่พบบ่อย ตรวจสอบตามลำดับ:

ตรวจ alert rule ที่ /alerts — สถานะต้องเป็น firing ไม่ใช่ pending
ตรวจ Alertmanager UI ว่าได้รับ alert หรือไม่
ตรวจ Alertmanager config ว่า route ไปถูกช่องทาง (Slack/Email/PagerDuty)
ตรวจ inhibit_rules ว่ามีกฎที่ block alert นี้หรือไม่
ตรวจ silence ที่ active อยู่
ดู log ของ Alertmanager ว่ามี error ตอนส่งหรือไม่

Checklist สำหรับ Troubleshooting

ตรวจ /targets — ทุก target ควร UP
Query up == 0 — ควรได้ผลลัพธ์ว่าง
Query rate(prometheus_tsdb_head_series_created_total[5m]) — ไม่ควรสูงเกิน 1000/s
ตรวจ prometheus_tsdb_head_series — เทียบกับ capacity ที่วางแผน
รัน promtool tsdb analyze ทุกเดือน หา cardinality ที่เพิ่มผิดปกติ
ตั้ง alert บน prometheus_notifications_errors_total เพื่อจับปัญหา Alertmanager
ตรวจ process_resident_memory_bytes ของ Prometheus — ต้องมี headroom อย่างน้อย 30%

สรุป

ปัญหา Prometheus ส่วนใหญ่สามารถ troubleshoot ได้ด้วยการใช้ Prometheus ตรวจตัวเอง (self-monitoring) โดยไม่ต้องเข้า server โดยตรง การสร้าง dashboard ที่ติดตาม target status, scrape duration, series count และ memory ของ Prometheus เอง จะช่วยให้ทีมเห็นปัญหาก่อนที่ผู้ใช้จะรายงาน

สำหรับระบบที่ cardinality เริ่มสูง ควรพิจารณาย้ายไปใช้ remote storage เช่น Thanos หรือ Mimir เพื่อแยก query layer กับ storage layer ออกจากกัน ทำให้ scale ได้ง่ายกว่าการเพิ่มขนาด Prometheus ใน single instance