Synthetic Monitoring: Simulate User Interactions เพื่อ Monitor เชิงรุก

ในโลกของ web application และ API ที่ซับซ้อน การรอให้ผู้ใช้งานจริงเจอปัญหาแล้วค่อยแก้ ไม่ใช่ทางเลือกที่ดีอีกต่อไป เพราะ downtime แต่ละนาทีอาจหมายถึงรายได้ที่หายไปและความเชื่อมั่นของลูกค้าที่ลดลง Synthetic Monitoring คือเทคนิคที่ช่วยให้เราตรวจจับปัญหาก่อนผู้ใช้งานจะเจอ ด้วยการจำลองพฤติกรรมผู้ใช้งาน (simulate user interactions) อย่างสม่ำเสมอจากหลายสถานที่ทั่วโลก

บทความนี้จะอธิบายหลักการ Synthetic Monitoring ความแตกต่างจาก Real User Monitoring (RUM) ประเภทของ synthetic check ตั้งแต่ HTTP ping ไปจนถึง end-to-end browser automation เครื่องมือที่นิยมใช้ เช่น Playwright, Selenium, Datadog Synthetics, Checkly และตัวอย่าง script สำหรับ simulate login flow, checkout journey และ API contract testing พร้อมแนวทางการตั้งค่า alert threshold ที่ช่วยลด false positive

Synthetic Monitoring คืออะไร

Synthetic Monitoring คือการใช้ script อัตโนมัติจำลองการกระทำของผู้ใช้งาน เช่น การเข้าเว็บไซต์ การล็อกอิน การสั่งซื้อสินค้า การเรียก API โดยที่ไม่มีผู้ใช้งานจริงอยู่ในระบบ script จะทำงานตามตารางเวลาที่กำหนด (เช่น ทุก 1 นาที 5 นาที หรือ 15 นาที) จากหลาย region ของ monitoring provider แล้วบันทึกผลลัพธ์ — ทั้ง response time, status code, DOM element, screenshot เพื่อให้ทีม SRE/DevOps รู้ทันทีเมื่อ service ล่มหรือทำงานช้าผิดปกติ

แนวคิดหลักคือ proactive detection — แทนที่จะรอให้ผู้ใช้งานจริงเจอปัญหาแล้วแจ้งเข้ามา เราสร้าง “ผู้ใช้งานสังเคราะห์” ขึ้นมาเองเพื่อทดสอบ critical path ของระบบตลอด 24 ชั่วโมง ทำให้ detect ปัญหาได้ภายในเวลาไม่เกิน 1-5 นาที และสามารถ alert ทีมงานก่อนที่จะกระทบผู้ใช้งานในวงกว้าง

Synthetic Monitoring vs Real User Monitoring (RUM)

ทั้งสองเทคนิคเป็นส่วนประกอบที่จำเป็นของ observability stack แต่ตอบโจทย์ต่างกัน — synthetic เน้น consistency และ proactive ส่วน RUM เน้น real-world experience และ reactive insight

ประเด็น	Synthetic Monitoring	Real User Monitoring (RUM)
แหล่งข้อมูล	Script จำลอง run จาก monitoring node	Browser/App ของผู้ใช้งานจริง
เวลาตรวจจับปัญหา	Proactive — รู้ก่อนผู้ใช้งานเจอ	Reactive — รู้หลังผู้ใช้งานเจอแล้ว
Coverage	เส้นทางที่กำหนดไว้เท่านั้น	ครอบคลุมทุก path ที่ผู้ใช้งานใช้
Data volume	น้อย ควบคุมได้	มาก ขึ้นกับ traffic
ใช้ตรวจสอบ	Availability, SLA, baseline performance	User behavior, device/browser diversity
Cost model	ต่อ check ต่อเดือน	ต่อ session ต่อเดือน

แนะนำให้ใช้ทั้งคู่ร่วมกัน — synthetic สำหรับ critical business flow (login, checkout, payment) และ RUM สำหรับเข้าใจประสบการณ์ผู้ใช้งานจริง เช่น geography, device, page performance

1. Uptime / Availability Check

Check ชั้นพื้นฐานที่สุด ส่ง HTTP request ไปยัง URL แล้วตรวจว่าตอบกลับเป็น 200 OK ภายในเวลาที่กำหนดหรือไม่ เหมาะสำหรับ homepage, landing page, health endpoint เช่น /healthz หรือ /api/ping

# ตัวอย่าง check ง่าย ๆ ด้วย curl
curl -f -s -o /dev/null -w "%{http_code} %{time_total}\n" \
  https://de.co.th/healthz

# ถ้า exit code != 0 หรือ time_total > 2 วินาที → fail

2. API / Contract Check

ทดสอบ API endpoint โดยส่ง request พร้อม payload แล้วตรวจว่า response status, response time และ JSON schema ตรงตามที่ระบุใน contract หรือไม่ สามารถทดสอบ chain ของ API หลายตัวต่อเนื่อง เช่น login → get token → query data

{
  "name": "Login API",
  "request": {
    "method": "POST",
    "url": "https://api.example.com/login",
    "headers": { "Content-Type": "application/json" },
    "body": { "email": "[email protected]", "password": "***" }
  },
  "assertions": [
    { "source": "status_code", "comparison": "equals", "value": 200 },
    { "source": "response_time", "comparison": "less_than", "value": 800 },
    { "source": "json_body", "property": "$.token", "comparison": "not_empty" }
  ]
}

3. Browser / End-to-End Check

ใช้ headless browser (Chromium, Firefox) รัน script จำลองผู้ใช้งานจริง — click, type, navigate, wait for element — เหมาะสำหรับ critical user journey เช่น signup flow, checkout, booking ตรวจสอบได้ทั้ง visual, JavaScript error, network request และสามารถ capture screenshot/video เมื่อ fail

4. Transaction / Multi-Step Check

การรวมหลาย action เข้าด้วยกันเพื่อจำลอง business flow ทั้งหมด เช่น search → add to cart → checkout → payment → confirmation มักใช้ร่วมกับ browser check และมีการวัด timing แยกตาม step เพื่อให้ทีมเห็นได้ว่า bottleneck อยู่ที่จุดไหน

5. SSL / Certificate Check

ตรวจสอบอายุของ SSL certificate, issuer, cipher suite และ expiry date เพื่อไม่ให้ certificate หมดอายุโดยไม่รู้ตัว เป็น check ที่มักถูกลืมแต่สำคัญมาก

ตัวอย่าง Browser Check ด้วย Playwright

Playwright เป็น framework สำหรับ browser automation ที่ได้รับความนิยมสูง รองรับ Chromium, Firefox และ WebKit มี API ที่เขียนง่ายและรองรับ async/await อย่างดี ตัวอย่างด้านล่างเป็น script จำลอง login flow ของเว็บไซต์ e-commerce

// synthetic-login.spec.ts
import { test, expect } from '@playwright/test';

test('login flow critical path', async ({ page }) => {
  const start = Date.now();

  // Step 1: เปิดหน้า login
  await page.goto('https://shop.example.com/login');
  await expect(page).toHaveTitle(/Login/);

  // Step 2: กรอก credentials
  await page.fill('input[name="email"]', process.env.SYN_EMAIL!);
  await page.fill('input[name="password"]', process.env.SYN_PASS!);

  // Step 3: คลิกปุ่ม Login
  await page.click('button[type="submit"]');

  // Step 4: รอ redirect ไปหน้า dashboard
  await page.waitForURL('**/dashboard', { timeout: 10000 });

  // Step 5: ตรวจสอบว่า user info โหลดสำเร็จ
  const userName = await page.textContent('[data-testid="user-name"]');
  expect(userName).toBeTruthy();

  // วัด total duration
  const duration = Date.now() - start;
  console.log(`Login flow completed in ${duration}ms`);

  // Capture screenshot เก็บไว้เป็นหลักฐาน
  await page.screenshot({ path: 'login-success.png', fullPage: true });
});

script นี้สามารถรันในโหมด headless บน CI/CD หรือจัดตาราง cron ให้ทำงานทุก 5 นาที เมื่อ fail จะได้ screenshot + video + trace file ที่ช่วยให้ทีม debug ได้อย่างรวดเร็ว

ตัวอย่าง API Check ด้วย Node.js

// synthetic-api-check.js
const https = require('https');

async function checkApi() {
  const start = Date.now();
  const res = await fetch('https://api.example.com/v1/products?limit=10', {
    headers: { 'Authorization': `Bearer ${process.env.API_TOKEN}` }
  });
  const duration = Date.now() - start;

  if (res.status !== 200) {
    throw new Error(`Expected 200 got ${res.status}`);
  }
  if (duration > 500) {
    throw new Error(`Response time ${duration}ms exceeded 500ms threshold`);
  }

  const body = await res.json();
  if (!Array.isArray(body.data) || body.data.length === 0) {
    throw new Error('Expected non-empty product list');
  }

  console.log(`OK — ${duration}ms, ${body.data.length} items`);
}

checkApi().catch(err => {
  console.error('Synthetic check failed:', err.message);
  process.exit(1);
});

เครื่องมือ Synthetic Monitoring ที่นิยม

เครื่องมือ	จุดเด่น	เหมาะกับ
Datadog Synthetics	รวมกับ APM/RUM ได้ลึก UI ใช้งานง่าย recorder ช่วยสร้าง test	องค์กรที่ใช้ Datadog อยู่แล้ว
Checkly	Monitoring as Code ด้วย Playwright/TypeScript, Terraform provider	ทีม DevOps ที่ชอบ version control ทุกอย่าง
New Relic Synthetic	ผสาน APM/RUM, Scripted Browser (Node.js)	องค์กรที่ใช้ New Relic
Grafana Synthetic Monitoring	Open source + SaaS, ใช้ k6 scripting	ทีมที่ใช้ Grafana stack
Pingdom	เรียบง่าย เริ่มต้นเร็ว uptime + transaction	ทีมเล็กที่ต้องการ basic monitoring
Uptime Kuma	Self-hosted open source, free	องค์กรที่ต้องการ on-prem โซลูชัน
Playwright + CI	DIY ใช้ GitHub Actions หรือ cron + alert	ทีมที่มี infrastructure ของตัวเอง

การเลือก Location และความถี่

การตั้งค่าที่เหมาะสมขึ้นกับลักษณะของ application และผู้ใช้งานกลุ่มเป้าหมาย หลักการทั่วไปคือ

Location: เลือกอย่างน้อย 3 region ที่ครอบคลุมผู้ใช้งานหลัก เช่น Singapore, Tokyo, Frankfurt สำหรับธุรกิจในเอเชียและยุโรป เพื่อจับปัญหาที่เกิดเฉพาะบาง region
ความถี่ (Frequency): critical flow เช่น login, checkout ตั้งไว้ทุก 1-5 นาที — ส่วน flow รอง เช่น search, contact form ตั้งทุก 10-15 นาที
Retry policy: เมื่อ check fail ให้ retry อย่างน้อย 2 ครั้งก่อน alert เพื่อลด false positive จาก network glitch ชั่วคราว
Alert threshold: alert เมื่อ fail จาก >= 2 ใน 3 region เพื่อแยก localized issue จาก global outage

Synthetic Monitoring as Code

การจัดการ check ผ่าน UI อาจจะรวดเร็วในช่วงแรก แต่เมื่อจำนวน check เพิ่มขึ้นและทีมใหญ่ขึ้น การเขียนเป็นโค้ดและ version control จะช่วยให้ตรวจสอบ, review และ rollback ได้ดีกว่า ตัวอย่างด้านล่างใช้ Checkly CLI

// __checks__/checkout.check.ts
import { BrowserCheck, Frequency } from 'checkly/constructs';
import * as path from 'path';

new BrowserCheck('checkout-flow', {
  name: 'Checkout Flow Critical Path',
  frequency: Frequency.EVERY_5M,
  locations: ['ap-southeast-1', 'ap-northeast-1', 'eu-central-1'],
  code: {
    entrypoint: path.join(__dirname, 'checkout.spec.ts')
  },
  alertChannels: ['slack-sre', 'pagerduty-critical'],
  runtimeId: '2024.02',
  retryStrategy: { type: 'LINEAR', baseBackoffSeconds: 30, maxRetries: 2 }
});

วิธีนี้ทำให้ check ถูกจัดเก็บใน git repo เดียวกับ application code — PR review ได้ ทำ CI/CD ได้ และสามารถ deploy พร้อมกับ feature release

การจัดการ Test Data และ Credential

Synthetic check ต้องใช้ credential และข้อมูลจริงในการทดสอบ ซึ่งเป็นจุดที่ต้องระวังเรื่อง security และ data integrity

สร้าง account พิเศษสำหรับ synthetic (เช่น [email protected]) แยกจาก user จริง
ตั้ง flag บน account/order ให้ไม่ trigger downstream system เช่น ไม่ส่ง email, ไม่เรียก payment gateway จริง
เก็บ credential ไว้ใน secret manager (AWS Secrets Manager, HashiCorp Vault) ไม่ hardcode ใน script
ใช้ test payment method ของ gateway (Stripe test card, PayPal sandbox) ในส่วนที่ต้องทดสอบ checkout
Clean up ข้อมูลที่ synthetic สร้างขึ้นเป็นระยะ (เช่น order, ticket) เพื่อไม่ให้ pollute production data

การลด False Positive

false positive คือศัตรูตัวสำคัญของ synthetic monitoring — alert ที่ไม่ใช่ปัญหาจริงจะทำให้ทีมเบื่อและเริ่มเพิกเฉย alert ทั้งหมด (alert fatigue) วิธีลดมีหลายแนวทาง

Retry ก่อน alert: fail 1 ครั้งไม่ alert ทันที — ลอง rerun 2-3 ครั้ง ถ้ายังพังจริงค่อย alert
Multi-region validation: ต้อง fail จาก >= 2 region ถึงจะถือว่าเป็น global outage
Threshold ตามสถานการณ์: ตั้ง threshold ให้ยืดหยุ่น ไม่เข้มเกินไป เช่น response time percentile แทน absolute ms
Maintenance window: ปิด alert ช่วง deploy หรือ scheduled maintenance
Dependency-aware alert: ถ้า upstream service ล่ม ไม่ควร alert ทุก check ที่ depend ไว้พร้อมกัน

การวัดผลและ SLO

Synthetic check เหมาะกับการวัด availability SLO ได้โดยตรง เพราะทำงานสม่ำเสมอ ตัวอย่างการกำหนด

Availability SLO: synthetic uptime >= 99.9% ต่อเดือน (อนุญาตให้ fail ไม่เกิน 43 นาทีต่อเดือน)
Latency SLO: 95% ของ synthetic check ต้องเสร็จภายใน 2 วินาที
Error Budget: ใช้ผลจาก synthetic คำนวณ budget ที่เหลือในแต่ละเดือน
MTTR: วัดเวลาเฉลี่ยจาก synthetic fail ครั้งแรกถึงเวลาที่ check กลับมาเขียว

Pitfalls ที่พบบ่อย

Test เฉพาะ happy path: ส่วนใหญ่ทดสอบเฉพาะเคสปกติ แต่ไม่ทดสอบ edge case เช่น credit card expired, sold-out product
Selector เปราะบาง: ใช้ CSS class ที่เปลี่ยนตามการ build (เช่น .btn-a7f3) ทำให้ script แตกเมื่อ frontend update — ควรใช้ data-testid
Coverage ไม่ครบ: เน้นแต่ homepage ไม่ได้ทดสอบ critical business flow จริง
ไม่อัพเดต script: เมื่อ UI เปลี่ยนไม่ได้แก้ script ทำให้ check fail และถูกเพิกเฉย
Flaky test: Script ที่บางครั้งผ่านบางครั้งไม่ผ่าน — ต้องแก้ root cause ไม่ใช่เพิ่ม retry เฉย ๆ

สรุป

Synthetic Monitoring เป็นด่านแรกของระบบ observability ที่ช่วยให้ทีมรู้ปัญหาก่อนผู้ใช้งานจริง โดยการรัน script จำลอง user journey อย่างสม่ำเสมอจากหลาย region เริ่มต้นได้ง่ายจาก uptime check แล้วขยายไปยัง API check และ end-to-end browser check สำหรับ critical business flow

ควรใช้ synthetic คู่กับ RUM เพื่อได้ทั้ง proactive detection และ real-world insight และให้ความสำคัญกับการลด false positive ผ่าน multi-region validation, retry policy และ maintenance window สำหรับทีมที่มีขนาดใหญ่ แนะนำให้เขียน check เป็นโค้ดและเก็บใน version control เพื่อควบคุมคุณภาพและ rollback ได้