Habilidades slo-implementation
📊

slo-implementation

Seguro

Implement SLOs and Error Budgets for SRE

También disponible en: wshobson

Balance reliability with feature velocity using measurable SLO targets and error budgets. Stop debating reliability and start measuring it with industry-standard SRE practices.

Soporta: Claude Codex Code(CC)
🥉 75 Bronce
1

Descargar el ZIP de la skill

2

Subir en Claude

Ve a Configuración → Capacidades → Skills → Subir skill

3

Activa y empieza a usar

Pruébalo

Usando "slo-implementation". Help me define SLIs and SLOs for my API service. I need availability and latency targets.

Resultado esperado:

Here are recommended SLIs for your API:

Availability SLI:
- Successful requests / Total requests
- Target: 99.9% (allows 43.2 min/month downtime)
- PromQL: sum(rate(http_requests_total{status!~'5..'}[28d])) / sum(rate(http_requests_total[28d]))

Latency SLI (p95 < 500ms):
- Requests below threshold / Total requests
- Target: 99% (1% of requests can be slow)
- PromQL: sum(rate(http_request_duration_seconds_bucket{le='0.5'}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget:
- 0.1% budget = 43.2 minutes/month
- Implement burn rate alerts to detect when consuming budget too quickly

Usando "slo-implementation". Create Prometheus alerting rules for multi-window burn rate detection.

Resultado esperado:

Multi-Window Burn Rate Alert Rules:

Fast Burn Alert (Critical):
- Window: 1 hour and 5 minute confirmation
- Threshold: 14.4x burn rate
- Triggers when: Consuming 2% of error budget in 1 hour
- Action: Page on-call engineer immediately

Slow Burn Alert (Warning):
- Window: 6 hours and 30 minute confirmation
- Threshold: 6x burn rate
- Triggers when: Consuming 5% of error budget in 6 hours
- Action: Create ticket for next business day

Error Budget Exhausted (Critical):
- Trigger: Error budget remaining < 0%
- Duration: 5 minutes sustained
- Action: Feature freeze, all-hands on reliability

Usando "slo-implementation". Design an error budget policy for a 99.9% SLO target.

Resultado esperado:

Error Budget Policy for 99.9% SLO:

Budget: 0.1% = 43.2 minutes per month

At 100% budget remaining:
- Normal development velocity
- All features allowed to ship

At 50% budget remaining:
- Postpone non-essential changes
- Require extra testing for releases
- Increase monitoring coverage

At 10% budget remaining:
- Freeze non-critical features
- Only security and hotfix releases
- Daily reliability standups

At 0% budget remaining:
- Complete feature freeze
- Focus 100% on reliability improvements
- No releases without SRE approval
- Root cause analysis required

Auditoría de seguridad

Seguro
v1 • 2/25/2026

Static analysis detected 35 potential security issues, all of which are false positives. The flagged 'Ruby/shell backtick execution' patterns are Markdown code formatting (backticks) used for PromQL queries and YAML examples. The 'weak cryptographic algorithm' flags are documentation text and annotations, not actual encryption code. This skill contains only documentation with no executable code, network operations, or security vulnerabilities.

1
Archivos escaneados
344
Líneas analizadas
2
hallazgos
1
Auditorías totales
Problemas de riesgo bajo (2)
False Positive: Code Block Formatting
Static analyzer flagged Markdown code blocks (using backticks) as 'Ruby/shell backtick execution'. These are documentation code examples for PromQL queries and YAML configurations, not executable shell commands.
False Positive: Documentation Text
Static analyzer flagged 'weak cryptographic algorithm' at lines 3, 215, 229, 239. These are plain text descriptions and YAML comments in documentation, not actual cryptographic implementations.
Auditado por: claude

Puntuación de calidad

38
Arquitectura
100
Mantenibilidad
87
Contenido
50
Comunidad
100
Seguridad
100
Cumplimiento de la especificación

Lo que puedes crear

Establish Reliability Baseline

Define initial SLIs and SLOs for a new microservice to set measurable reliability targets and create alerting that catches actual problems without false alarm fatigue.

Implement Error Budget Governance

Create error budget policies that automatically freeze risky deployments when reliability degrades, helping balance feature velocity with stability requirements.

Reduce Alert Fatigue

Replace brittle threshold alerts with multi-window burn rate alerts that only trigger on significant reliability degradation, cutting notification noise by 80%.

Prueba estos prompts

Define Basic SLOs
Help me define SLIs and SLOs for my API service. I need availability and latency targets.
Create Error Budget Policy
Design an error budget policy for a 99.9% SLO target. Define actions at 100%, 50%, 10%, and 0% remaining budget.
Build SLO Alerts
Create Prometheus alerting rules for multi-window burn rate detection. Use fast burn (1h/5m) and slow burn (6h/30m) windows.
Review SLO Compliance
Analyze my current SLO compliance data. Show error budget remaining, burn rate trends, and recommend whether to freeze feature releases.

Mejores prácticas

  • Start with user-facing SLIs that directly measure customer experience rather than backend metrics
  • Set achievable SLOs slightly below current performance to allow for normal variance and prevent constant alerting
  • Use multi-window burn rate alerts (combine short and long windows) to eliminate false positives from transient blips
  • Review SLOs quarterly to ensure they still reflect business priorities and actual user needs

Evitar

  • Setting SLO targets at 100% availability which eliminates all error budget and prevents any feature development
  • Creating alerts on raw metric thresholds instead of burn rates, causing alert fatigue from normal fluctuations
  • Defining too many SLIs which dilutes focus and makes it impossible to prioritize reliability improvements
  • Implementing SLOs without executive buy-in for error budget policies, rendering the governance unenforceable

Preguntas frecuentes

What is the difference between SLI, SLO, and SLA?
SLI (Service Level Indicator) is a measured metric like availability percentage. SLO (Service Level Objective) is your internal target for that metric, like 99.9% availability. SLA (Service Level Agreement) is the external commitment you make to customers, which should be lower than your internal SLO to provide a buffer.
Why should I not target 100% reliability?
100% reliability leaves zero error budget, meaning any incident immediately violates your SLO. This prevents all feature development since you cannot take any risk. A 99.9% target allows 43 minutes of downtime per month for maintenance and experimentation while maintaining excellent user experience.
How do I choose the right SLO percentage?
Analyze your current performance over 30 days, set the SLO slightly below that baseline. Consider user expectations, competitor benchmarks, and business impact. Start conservative (99%) and tighten as you build confidence. The goal is achievable targets that catch real problems, not perfection.
What is multi-window burn rate alerting?
Multi-window alerts require both a short window (like 1 hour) and long window (like 6 hours) to exceed burn rate thresholds simultaneously. This eliminates false positives from brief spikes while catching sustained degradation. For example, alert only if burn rate exceeds 14.4x in both 1-hour and 5-minute windows.
How does error budget governance work?
Error budgets translate SLOs into actionable development policies. When you have budget remaining, ship features normally. As budget depletes, freeze risky changes. At 0% budget, halt all features until reliability improves. This creates an automatic feedback loop balancing innovation and stability.
What tools do I need to implement SLOs?
You need a metrics system (Prometheus recommended), visualization (Grafana), and alerting (Alertmanager). This skill provides the PromQL queries, recording rules, and alert configurations. Deploy these to your existing monitoring stack, then build dashboards to track compliance.

Detalles del desarrollador

Estructura de archivos

📄 SKILL.md