Grafana Alerting: ออกแบบ Alert Rules สำหรับระบบ Monitoring

Grafana Alerting เป็นระบบ alert ที่ unified ตั้งแต่ Grafana 8 ซึ่งรวมการสร้าง alert rule, notification และ escalation ไว้ในที่เดียว แทนที่ระบบ dashboard alert เดิมที่จำกัดอยู่แค่ panel-level ระบบใหม่ทำงานแบบ rule-based ที่สามารถใช้ query จาก data source หลายตัวมารวมกันเพื่อตัดสินใจว่าจะส่ง alert หรือไม่

บทความนี้จะอธิบายสถาปัตยกรรมของ Grafana Alerting, ขั้นตอนการสร้าง alert rule, การตั้ง contact point, notification policy, รวมถึง best practices ในการออกแบบ alert ที่ไม่ spam ทีม on-call ในระบบ production จริง

สถาปัตยกรรม Grafana Alerting

ระบบ alert ของ Grafana แบ่งเป็น 4 องค์ประกอบหลักที่เชื่อมโยงกันเป็น pipeline:

Alert Rule: เงื่อนไขที่ evaluate เป็นระยะ เช่น “CPU > 80% ติดกัน 5 นาที”
Contact Point: ช่องทางส่ง notification เช่น Email, Slack, Discord, PagerDuty, LINE Notify
Notification Policy: กฎกำหนดว่า alert แต่ละประเภทจะส่งไปยัง contact point ไหน
Silence: ระงับ alert ชั่วคราว เช่น ช่วง maintenance window

Grafana จะ evaluate rule ตาม interval ที่ตั้งไว้ (ปกติ 1 นาที) ถ้าเงื่อนไขเป็นจริงติดต่อกันครบ for duration จะเข้าสู่สถานะ Firing แล้วส่ง alert ผ่าน notification policy ที่ match กับ label ของ rule นั้น

สร้าง Alert Rule แรก

สร้าง alert rule ใหม่ผ่าน UI ที่เมนู Alerting → Alert rules → New alert rule ขั้นตอนหลักมี 5 ส่วน:

Name: ตั้งชื่อชัดเจน เช่น HighCPUUsage ไม่ใช่ alert1
Query: เขียน PromQL หรือ query ตาม data source ที่เลือก
Condition: ตั้งเกณฑ์ เช่น IS ABOVE 80
Evaluation: กำหนด interval และ for duration
Labels & Annotations: ใส่ metadata สำหรับ routing และข้อมูลแจ้งเตือน

ตัวอย่าง Alert Rule: High CPU Usage

apiVersion: 1
groups:
  - orgId: 1
    name: system
    folder: Infrastructure
    interval: 1m
    rules:
      - uid: high-cpu-usage
        title: HighCPUUsage
        condition: C
        data:
          - refId: A
            datasourceUid: prom-prod
            model:
              expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              refId: A
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              conditions:
                - evaluator:
                    type: gt
                    params: [80]
              expression: A
              refId: C
        for: 5m
        labels:
          severity: warning
          team: infra
        annotations:
          summary: "CPU usage สูงบน {{ $labels.instance }}"
          description: "CPU usage อยู่ที่ {{ $values.A }}% เกิน 80% นานกว่า 5 นาที"

หลักสำคัญของ rule นี้: ใช้ rate() บน node_cpu_seconds_total mode idle แล้วกลับค่าเป็น usage percentage, เกณฑ์ threshold > 80, ต้องเป็นจริงติดต่อกันครบ 5 นาที (for: 5m) เพื่อกรอง CPU spike ชั่วคราวที่ไม่ใช่ปัญหาจริง

Labels และ Annotations

Label และ annotation เป็นสิ่งที่ทำให้ alert เข้าใจง่ายและ route ไปถูกคน สองอย่างนี้มีบทบาทต่างกันอย่างชัดเจน:

ประเภท	จุดประสงค์	ตัวอย่าง
Labels	ใช้ route และ group alert (เปลี่ยนค่าเปลี่ยน path)	`severity`, `team`, `environment`
Annotations	ข้อมูลสำหรับมนุษย์อ่าน (ไม่ใช้ routing)	`summary`, `description`, `runbook_url`

แนวทางที่ดี: ตั้ง severity ให้เป็น enum ชัดเจน เช่น critical, warning, info แล้วใช้ severity นี้ใน notification policy เพื่อ route ไปยัง contact point ต่างกัน — critical ไป PagerDuty, warning ไป Slack channel

Template ใน Annotation

Annotation รองรับ Go template syntax ที่ช่วยให้ข้อความเปลี่ยนไปตามค่าจริงของ alert:

annotations:
  summary: "Disk {{ $labels.mountpoint }} เต็ม {{ $values.A }}%"
  description: |
    Host: {{ $labels.instance }}
    Mount: {{ $labels.mountpoint }}
    Free space: {{ humanize1024 $values.B }}
    Runbook: https://wiki/runbook/disk-full

$labels.X คือ value จาก label ของ time series, $values.A คือค่าที่ query ส่งกลับมา ทั้งสองช่วยให้ alert message มีข้อมูลครบพอจะเริ่ม investigate โดยไม่ต้องเปิด dashboard

Contact Points

Contact point คือปลายทางของ alert — Grafana รองรับ integration หลายประเภท:

Email: ใช้ SMTP ที่ config ใน grafana.ini
Slack: ใช้ Incoming Webhook URL
Discord: Webhook เช่นกัน
PagerDuty: ใช้ integration key สำหรับ on-call
Microsoft Teams: Webhook URL จาก connector
Webhook: HTTP POST ไป endpoint ที่กำหนดเอง — ยืดหยุ่นที่สุด
LINE Notify: เหมาะกับทีมที่ใช้ LINE เป็น comms หลัก (ต้องใช้ webhook relay)

ตัวอย่าง Slack Contact Point

apiVersion: 1
contactPoints:
  - orgId: 1
    name: slack-infra-alerts
    receivers:
      - uid: slack-infra
        type: slack
        settings:
          url: https://hooks.slack.com/services/XXX/YYY/ZZZ
          recipient: '#infra-alerts'
          title: '{{ template "slack.default.title" . }}'
          text: '{{ template "slack.default.text" . }}'
          mentionGroups: 'S00000000'
          mentionChannel: 'here'

ใช้ mentionChannel: 'here' เพื่อ notify เฉพาะคนที่ online อยู่, ใช้ channel ก็ได้ถ้าต้องการให้ทุกคนเห็น — แต่ระวังการใช้ @channel บ่อยเกินไปจะทำให้ทีมมึนและเริ่มเพิกเฉย

Notification Policy

Notification policy คือ tree-structure ที่ตัดสินว่า alert แต่ละ label set จะส่งไปที่ไหน ทำงานแบบ top-down — matching label ตัวแรกที่เจอจะเป็น route ที่ใช้ (ถ้าไม่ตั้ง continue: true)

apiVersion: 1
policies:
  - orgId: 1
    receiver: default-email
    group_by: ['alertname', 'severity']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-oncall
        matchers:
          - severity = critical
        group_wait: 10s
        continue: false
      - receiver: slack-infra-alerts
        matchers:
          - team = infra
          - severity =~ warning|info
        continue: false
      - receiver: slack-app-team
        matchers:
          - team = app
        continue: false

Parameter ที่ควรทราบ:

group_by: รวม alert ที่มี label ชุดเดียวกันเป็น notification เดียว
group_wait: รอ alert อื่นเข้ากลุ่มก่อนส่งครั้งแรก (ลด noise)
group_interval: ช่วงพักระหว่างการส่ง notification ใหม่ของ group เดิมที่มี alert เพิ่ม
repeat_interval: ถ้า alert ยังไม่ resolve จะเตือนซ้ำทุกกี่ครั้ง

Silences และ Mute Timings

Silence ใช้เพื่อระงับ alert ชั่วคราว — เช่น กำลัง deploy ใหม่ ไม่ต้องการให้ alert dash เข้ามา:

สร้างที่เมนู Alerting → Silences → New silence
ใส่ matcher ตรงกับ label ของ alert ที่ต้องการ silence เช่น alertname = HighCPUUsage
กำหนด start/end time — แนะนำไม่เกิน 2 ชั่วโมงเพื่อป้องกันลืม

Mute Timing ต่างจาก silence ตรงที่เป็นกฎวนลูปประจำ เช่น “ไม่ส่ง alert ประเภทนี้ช่วงนอกเวลาทำงาน” — เหมาะกับ alert ที่ไม่ควรปลุกใครกลางดึก

Multi-dimensional Alerts

Query ใน Grafana Alerting ที่ return หลาย time series จะสร้าง alert instance แยกกันตามแต่ละ label combination — สิ่งนี้เรียกว่า multi-dimensional alerting:

expr: (node_filesystem_free_bytes / node_filesystem_size_bytes) * 100 < 10
# ถ้ามี 5 host x 3 mountpoint = 15 series → 15 alert instances

ประโยชน์: ถ้า disk ของ server A เต็ม ไม่กระทบ alert ของ server B — แต่ละ instance มี state แยกกัน แก้ปัญหาของใครของมัน ไม่ต้องสร้าง rule แยกให้ทุก host

ใช้ for duration เสมอ

Alert ที่ไม่ตั้ง for จะ fire ทันทีที่ condition เป็นจริงแม้เพียง 1 scrape interval — ผลคือ false positive จาก spike ชั่วคราว ควรตั้ง for อย่างน้อย 2 เท่าของ scrape interval

กำหนด Severity ให้ชัดเจน

Severity	ความเร่งด่วน	การตอบสนอง
critical	ระบบล่ม, impact ผู้ใช้	ปลุก on-call ทันที (PagerDuty)
warning	แนวโน้มน่าเป็นห่วง	แจ้ง Slack ช่วงเวลาทำงาน
info	ต้องรู้แต่ไม่เร่ง	log เข้า email summary

เขียน Runbook Link เสมอ

Annotation runbook_url ควรชี้ไปยังเอกสารที่อธิบายขั้นตอนการแก้ไขปัญหา — on-call เวลาถูกปลุกตอนตี 3 ไม่มีเวลามานั่งหาวิธีแก้จากศูนย์ runbook ดี ๆ ช่วยลด MTTR ได้มหาศาล

หลีกเลี่ยง Alert Fatigue

ไม่ตั้ง alert สำหรับทุก metric — ตั้งเฉพาะที่ actionable จริง
ใช้ SLO-based alert แทน threshold-based เมื่อเป็นไปได้
Review alert ทุกเดือน — ลบ alert ที่ไม่มีใครตอบสนอง
ตั้ง group_by และ group_wait ให้ดี เพื่อรวม alert คล้ายกันเป็น notification เดียว

Testing และ Debugging Alert

ก่อน deploy alert rule ใหม่ ควรทดสอบ:

กด Preview alerts ใน rule editor เพื่อดูว่า query ปัจจุบัน return instance อะไรบ้าง
ใช้ Test ใน contact point เพื่อยืนยันว่า webhook URL ใช้ได้จริง
ดู State history ของ rule เพื่อตรวจว่าเคย fire บ่อยแค่ไหน
ตรวจ Alert rule details ถ้า rule อยู่ในสถานะ Error จะมีสาเหตุบอกชัดเจน

การใช้ Alert Rule แบบ GitOps

Production ที่ดีควรจัดการ alert rule ผ่าน provisioning file และ commit เข้า Git เพื่อให้ทุกการเปลี่ยนแปลงผ่าน code review:

# docker-compose.yml
services:
  grafana:
    image: grafana/grafana:10.4.0
    volumes:
      - ./provisioning:/etc/grafana/provisioning
      - ./alerting:/etc/grafana/provisioning/alerting
    environment:
      GF_ALERTING_ENABLED: 'true'
      GF_UNIFIED_ALERTING_ENABLED: 'true'

ไฟล์ alert rules, contact points, notification policies แยกไว้ใน folder ต่างหาก แก้แล้ว restart Grafana จะโหลด config ใหม่อัตโนมัติ — ไม่ต้องคลิก UI อีกเลย

สรุป

Grafana Alerting เป็นระบบที่ทรงพลังเมื่อออกแบบถูกต้อง — alert rule ที่ดีต้อง actionable, มี severity ชัดเจน, มี runbook, และ route ไปยังทีมที่รับผิดชอบจริง ๆ การใช้ provisioning แทน UI ช่วยให้ alert config portable, reproducible และ audit ได้

จุดที่ควรระวังที่สุดคือ alert fatigue — alert เยอะเกินไปทำให้ทีมไม่สนใจ alert จริง ๆ ตอนเกิดปัญหา ควร review รายการ alert เป็นระยะและลบ rule ที่ไม่มีใคร action