Habilidades observability-monitoring-monitor-setup
📦

observability-monitoring-monitor-setup

Seguro

Set up comprehensive monitoring and observability

Implementing monitoring from scratch is complex and error-prone. This skill provides proven patterns for metrics, tracing, and logging that reduce MTTR and give full system visibility.

Suporta: Claude Codex Code(CC)
🥉 74 Bronze
1

Baixar o ZIP da skill

2

Upload no Claude

Vá em Configurações → Capacidades → Skills → Upload skill

3

Ative e comece a usar

Testar

A utilizar "observability-monitoring-monitor-setup". Set up Prometheus scraping for a Kubernetes cluster with automatic pod discovery

Resultado esperado:

  • Prometheus configuration with kubernetes_sd_configs for auto-discovery
  • Pod annotations required for scrape targeting
  • Relabel rules to filter and tag discovered targets
  • Verification steps to confirm scraping is working

A utilizar "observability-monitoring-monitor-setup". Create an alert for memory usage exceeding 90%

Resultado esperado:

  • PromQL expression using container_memory_working_set_bytes
  • Alert rule with appropriate thresholds and duration
  • Runbook steps for investigating memory pressure
  • Grafana panel query to visualize memory trends

Auditoria de Segurança

Seguro
v1 • 2/24/2026

This skill contains documentation and code samples for monitoring setup. All static analysis findings are false positives - backticks are markdown code block delimiters, not shell execution. URLs are internal service endpoints. Environment variable usage follows standard configuration patterns. No malicious patterns detected.

2
Arquivos analisados
557
Linhas analisadas
0
achados
1
Total de auditorias
Nenhum problema de segurança encontrado
Auditado por: claude

Pontuação de qualidade

38
Arquitetura
100
Manutenibilidade
87
Conteúdo
50
Comunidade
100
Segurança
91
Conformidade com especificações

O Que Você Pode Construir

Greenfield Service Monitoring

Set up complete observability stack for a new microservice from day one with metrics, tracing, and logging.

Production Incident Response

Create actionable dashboards and alerts to reduce MTTR and enable proactive issue detection.

SLO Definition and Tracking

Define service level objectives with error budgets and implement burn rate monitoring for reliability engineering.

Tente Estes Prompts

Basic Metrics Setup
Help me add Prometheus metrics to my Node.js API. I need request count, error rate, and latency tracking. Show me the prom-client setup and how to expose a /metrics endpoint.
Grafana Dashboard Creation
Create a Grafana dashboard JSON for my payment service showing the four golden signals. Include panels for request rate, error rate, p95/p99 latency, and saturation metrics.
Alert Configuration
I need alerting rules for high error rate (>5% for 5 minutes) and slow response time (p95 >1s for 10 minutes). Configure Alertmanager to route critical alerts to PagerDuty and warnings to Slack.
SLO Implementation
Define SLOs for my API with 99.9% availability target over 30 days. Show me how to calculate error budget, set up multi-window burn rate alerts, and create Grafana panels for SLO tracking.

Melhores Práticas

  • Use histogram buckets aligned with your SLO targets for accurate percentile calculation
  • Add consistent labels (service, environment, version) to all metrics for effective filtering
  • Test alerts against historical data to minimize false positives before enabling notifications

Evitar

  • Monitoring everything without clear ownership leads to alert fatigue and ignored pages
  • Using average latency instead of percentiles hides tail latency problems affecting users
  • Setting up dashboards before defining what questions they need to answer wastes effort

Perguntas Frequentes

How do I choose the right scrape interval for my metrics?
Start with 15s for most services. Use 5s for latency-sensitive systems or when debugging. Avoid intervals below 5s as they increase Prometheus load without proportional benefit.
Should I trace every request or sample?
Sample in production. Use head-based sampling (e.g., 10% of requests) for high-traffic services. Trace 100% in staging. Always trace errors regardless of sampling rate.
What is the difference between RED and USE monitoring?
RED (Rate, Errors, Duration) is for user-facing services. USE (Utilization, Saturation, Errors) is for infrastructure resources. Use RED for application monitoring, USE for nodes and databases.
How do I set meaningful SLO targets?
Base targets on user expectations and business requirements, not current performance. Start conservative (99%) and tighten as reliability improves. Measure over 28-30 day windows.
Do I need all three pillars (metrics, logs, traces) from day one?
Start with metrics - they are cheapest and answer 'what is broken'. Add logging for 'why it broke'. Add tracing for distributed systems when debugging cross-service issues becomes difficult.
How long should I retain monitoring data?
Keep high-resolution metrics (raw samples) for 15-30 days for debugging. Use downsampling or recording rules for long-term trends. Store logs based on compliance requirements, typically 90 days minimum.

Detalhes do Desenvolvedor

Estrutura de arquivos