Infrastructure Monitoring: Monitor VPS HealthStatus Uptime

Infrastructure Monitoring คือการเฝ้าระวัง VPS และเซิร์ฟเวอร์ทั้งฝั่ง hardware resource และ service availability เพื่อให้รู้ล่วงหน้าว่าระบบกำลังจะล่ม หรือทรัพยากรใกล้เต็ม ทำให้ทีม DevOps สามารถ scale หรือแก้ไขก่อนที่ user จะได้รับผลกระทบ การ monitor ที่ดีต้องครอบคลุมทั้ง health metrics (CPU, RAM, Disk, Network), status check (process running, port listening) และ uptime measurement (downtime tracking, SLA reporting)

บทความนี้อธิบายแนวคิด 4 เลเยอร์ของการ monitor infrastructure พร้อม tooling ที่เหมาะกับแต่ละเลเยอร์ วิธีตั้ง alert ที่ไม่ noisy การคำนวณ uptime percentage และข้อแตกต่างระหว่าง active vs passive check เพื่อวางระบบให้เหมาะกับ scale ที่ทีมดูแลอยู่

4 เลเยอร์ของการ Monitor Infrastructure

การแบ่งระดับ monitor ช่วยเลือก tool ที่เหมาะสมและไม่ซ้ำซ้อน

Hardware Layer — CPU, RAM, Disk I/O, Disk space, Network throughput, Temperature, Fan speed
OS Layer — Load average, Context switch, Open files, Process count, Swap usage, Zombie process
Service Layer — Systemd status, Port listening, Certificate expiry, Log error rate, Specific application metric
Uptime Layer — External ping, HTTP check, TCP check, Response time จากหลาย region

Tool ที่เหมาะกับแต่ละเลเยอร์

ไม่มี tool เดียวที่ทำได้ทุกอย่างดี การเลือก tool ที่เหมาะกับ layer และทีมจะช่วยลด overhead ในการ maintain

Layer	Tool ยอดนิยม	เหมาะกับ
Hardware + OS	Node Exporter + Prometheus	Linux server ทุกขนาด เก็บ time-series
Hardware (Agent-less)	SNMP Exporter, IPMI Exporter	Hardware network switch, BMC
Service	Blackbox Exporter, Telegraf	HTTP/TCP/ICMP probe, application
Uptime	Uptime Kuma, UptimeRobot, Pingdom	External check, status page
All-in-one	Zabbix, Nagios, Icinga, Checkmk	ทีมดูแลเซิร์ฟเวอร์ขนาดใหญ่

Active Check vs Passive Check

ระบบ monitor มี 2 รูปแบบหลักในการเก็บข้อมูล แต่ละแบบมีข้อดีข้อเสียต่างกัน

Active (Pull) — ระบบ monitor ถาม target ผ่าน HTTP/TCP/SNMP เช่น Prometheus scrape — ข้อดีคือรู้ว่า target ยังตอบได้, ข้อเสียคือเครื่อง monitor ต้องเข้าถึง target ได้
Passive (Push) — target ส่งข้อมูลเข้าหา monitor เช่น Telegraf → InfluxDB — ข้อดีคือผ่าน firewall/NAT ได้, ข้อเสียคือถ้า target ตายจะไม่รู้ ต้องมี “heartbeat” แยก

โดยทั่วไปใช้ active สำหรับ metric ปกติและ passive สำหรับ event-driven เช่น application trace หรือ log

ตัวชี้วัด Hardware ที่สำคัญ

ตัวชี้วัดพื้นฐานที่ควร track ทุก VPS พร้อม threshold ที่แนะนำจากประสบการณ์การดูแลระบบ production

Metric	Warning	Critical	หมายเหตุ
CPU usage (avg 5m)	> 70%	> 90%	พิจารณา CPU steal ด้วยถ้าเป็น VPS
Load average (1m/core)	> 1.5	> 3.0	หาร core count ก่อนเทียบ
Memory free	< 20%	< 10%	แยก available กับ free — ใช้ available
Swap usage	> 30%	> 60%	ถ้า swap = 0 ข้าม metric นี้
Disk usage	> 80%	> 90%	แยก / กับ /var/log ต่างหาก
Disk I/O wait	> 20%	> 40%	บอก disk bottleneck
Inode usage	> 80%	> 90%	ระบบ mail server จะโตเร็ว

ตั้ง Alert Rule บน Prometheus

ตัวอย่าง rule สำหรับตรวจจับปัญหาที่พบบ่อยบน VPS

groups:
- name: infra.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "{{ $labels.instance }} is down"

  - alert: HighMemoryUsage
    expr: |
      (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
    for: 5m
    labels:
      severity: warning

  - alert: DiskFillingUp
    expr: |
      predict_linear(node_filesystem_avail_bytes{fstype!="tmpfs"}[1h], 4 * 3600) < 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} disk {{ $labels.mountpoint }} will fill up in 4h"

  - alert: HighDiskIOWait
    expr: avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) > 0.2
    for: 10m
    labels:
      severity: warning

predict_linear ฉลาดกว่า static threshold เพราะใช้แนวโน้มคำนวณว่าอีกกี่ชั่วโมง disk จะเต็ม ทำให้มีเวลาเตรียม expand volume หรือลบไฟล์เก่าก่อนเกิดปัญหา

Uptime Measurement

Uptime percentage เป็นตัววัดว่าระบบพร้อมใช้งานกี่ % ของเวลาทั้งหมด นิยมรายงานเป็น “nines” เช่น 99.9% = “three nines”

Uptime	Downtime/year	Downtime/month
99% (two 9)	3.65 days	7.31 hours
99.9% (three 9)	8.77 hours	43.8 min
99.99% (four 9)	52.6 min	4.38 min
99.999% (five 9)	5.26 min	26.3 sec

การบอกว่าระบบมี uptime 99.9% จริงหรือไม่ต้องวัดจาก external check ไม่ใช่จาก server เอง เพราะถ้า server reboot server ก็ไม่รู้ว่าตัวเองดาวน์

Multi-Region External Check

การ check จากจุดเดียวอาจ false positive ถ้า network ระหว่าง monitor กับ target มีปัญหา ควรมี probe จากหลาย region เพื่อแยก “server ล่ม” ออกจาก “network ระหว่างทางล่ม”

# blackbox.yml - Prometheus Blackbox Exporter
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: [200, 301, 302]
      follow_redirects: true
      preferred_ip_protocol: "ip4"
      tls_config:
        insecure_skip_verify: false

  tcp_connect:
    prober: tcp
    timeout: 3s

  icmp:
    prober: icmp
    timeout: 3s

Deploy blackbox exporter ใน VPS ต่าง region (สิงคโปร์, ญี่ปุ่น, ยุโรป) แล้ว scrape metric ทั้งหมดเข้า Prometheus เดียวเพื่อเทียบกัน

Dashboard และ Visualization

Dashboard infrastructure ที่ดีควรจัดแบ่งให้ดูได้ในหน้าจอเดียว — บน Grafana สามารถ import dashboard จาก community ได้ เช่น Node Exporter Full (ID: 1860), Blackbox Exporter (ID: 7587) ที่มี metric ครอบคลุมและ variable filter ตาม instance

Panel สำคัญที่ควรมีคือ overview (health status summary), resource usage (CPU/RAM/Disk), network throughput, error rate, และ uptime trend 30 วันย้อนหลัง การใส่ threshold marker บน gauge ช่วยให้เห็นว่าใกล้ critical แค่ไหน

Best Practices

Baseline ก่อนตั้ง alert — วัด metric ปกติ 1-2 สัปดาห์ก่อนตั้ง threshold เพื่อเข้าใจ pattern จริง
Group by role — แยก group web, db, cache ออกจากกัน เพราะ pattern ต่างกัน
ใช้ SLO ไม่ใช่ static threshold — “response time p99 < 500ms” ดีกว่า “CPU < 80%”
Retention policy — เก็บ raw 15 วัน, 5m-resolution 30 วัน, 1h-resolution 1 ปี — ประหยัด disk
Heartbeat alert — ถ้าใช้ passive check ต้องมี “Dead man’s switch” เพื่อ alert เมื่อ agent เงียบ
Separate monitoring server — server monitor ห้ามอยู่เครื่องเดียวกับ production เพราะถ้า production ล่มจะ alert ไม่ได้

สรุป

การ monitor infrastructure ที่ดีต้องครอบคลุมทั้ง 4 เลเยอร์ hardware, OS, service และ uptime การเลือก tool ให้เหมาะกับแต่ละ layer ช่วยลด complexity และลด overhead ในการ maintain การใช้ active check สำหรับ metric ปกติและ passive check สำหรับ event ทำให้ระบบครอบคลุมทั้ง visibility และ reliability

ตัวชี้วัดที่สำคัญคือ uptime percentage ที่วัดจาก external probe หลาย region ไม่ใช่จาก server ตัวเอง ระบบที่ดีต้องวาง monitoring server แยกจาก production และมี alert rule ที่ปรับตาม baseline จริง ไม่ใช่ threshold ตายตัว เพื่อลดการ false positive และให้ทีม oncall โฟกัสเฉพาะปัญหาที่สำคัญ