FĂ€higkeiten slo-implementation
📊

slo-implementation

Sicher

Implement SLOs and Error Budgets for SRE

Auch verfĂŒgbar von: wshobson

Balance reliability with feature velocity using measurable SLO targets and error budgets. Stop debating reliability and start measuring it with industry-standard SRE practices.

UnterstĂŒtzt: Claude Codex Code(CC)
đŸ„‰ 75 Bronze
1

Die Skill-ZIP herunterladen

2

In Claude hochladen

Gehe zu Einstellungen → FĂ€higkeiten → Skills → Skill hochladen

3

Einschalten und loslegen

Teste es

Verwendung von "slo-implementation". Help me define SLIs and SLOs for my API service. I need availability and latency targets.

Erwartetes Ergebnis:

Here are recommended SLIs for your API:

Availability SLI:
- Successful requests / Total requests
- Target: 99.9% (allows 43.2 min/month downtime)
- PromQL: sum(rate(http_requests_total{status!~'5..'}[28d])) / sum(rate(http_requests_total[28d]))

Latency SLI (p95 < 500ms):
- Requests below threshold / Total requests
- Target: 99% (1% of requests can be slow)
- PromQL: sum(rate(http_request_duration_seconds_bucket{le='0.5'}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget:
- 0.1% budget = 43.2 minutes/month
- Implement burn rate alerts to detect when consuming budget too quickly

Verwendung von "slo-implementation". Create Prometheus alerting rules for multi-window burn rate detection.

Erwartetes Ergebnis:

Multi-Window Burn Rate Alert Rules:

Fast Burn Alert (Critical):
- Window: 1 hour and 5 minute confirmation
- Threshold: 14.4x burn rate
- Triggers when: Consuming 2% of error budget in 1 hour
- Action: Page on-call engineer immediately

Slow Burn Alert (Warning):
- Window: 6 hours and 30 minute confirmation
- Threshold: 6x burn rate
- Triggers when: Consuming 5% of error budget in 6 hours
- Action: Create ticket for next business day

Error Budget Exhausted (Critical):
- Trigger: Error budget remaining < 0%
- Duration: 5 minutes sustained
- Action: Feature freeze, all-hands on reliability

Verwendung von "slo-implementation". Design an error budget policy for a 99.9% SLO target.

Erwartetes Ergebnis:

Error Budget Policy for 99.9% SLO:

Budget: 0.1% = 43.2 minutes per month

At 100% budget remaining:
- Normal development velocity
- All features allowed to ship

At 50% budget remaining:
- Postpone non-essential changes
- Require extra testing for releases
- Increase monitoring coverage

At 10% budget remaining:
- Freeze non-critical features
- Only security and hotfix releases
- Daily reliability standups

At 0% budget remaining:
- Complete feature freeze
- Focus 100% on reliability improvements
- No releases without SRE approval
- Root cause analysis required

Sicherheitsaudit

Sicher
v1 ‱ 2/25/2026

Static analysis detected 35 potential security issues, all of which are false positives. The flagged 'Ruby/shell backtick execution' patterns are Markdown code formatting (backticks) used for PromQL queries and YAML examples. The 'weak cryptographic algorithm' flags are documentation text and annotations, not actual encryption code. This skill contains only documentation with no executable code, network operations, or security vulnerabilities.

1
Gescannte Dateien
344
Analysierte Zeilen
2
befunde
1
Gesamtzahl Audits
Probleme mit niedrigem Risiko (2)
False Positive: Code Block Formatting
Static analyzer flagged Markdown code blocks (using backticks) as 'Ruby/shell backtick execution'. These are documentation code examples for PromQL queries and YAML configurations, not executable shell commands.
False Positive: Documentation Text
Static analyzer flagged 'weak cryptographic algorithm' at lines 3, 215, 229, 239. These are plain text descriptions and YAML comments in documentation, not actual cryptographic implementations.
Auditiert von: claude

QualitÀtsbewertung

38
Architektur
100
Wartbarkeit
87
Inhalt
50
Community
100
Sicherheit
100
SpezifikationskonformitÀt

Was du bauen kannst

Establish Reliability Baseline

Define initial SLIs and SLOs for a new microservice to set measurable reliability targets and create alerting that catches actual problems without false alarm fatigue.

Implement Error Budget Governance

Create error budget policies that automatically freeze risky deployments when reliability degrades, helping balance feature velocity with stability requirements.

Reduce Alert Fatigue

Replace brittle threshold alerts with multi-window burn rate alerts that only trigger on significant reliability degradation, cutting notification noise by 80%.

Probiere diese Prompts

Define Basic SLOs
Help me define SLIs and SLOs for my API service. I need availability and latency targets.
Create Error Budget Policy
Design an error budget policy for a 99.9% SLO target. Define actions at 100%, 50%, 10%, and 0% remaining budget.
Build SLO Alerts
Create Prometheus alerting rules for multi-window burn rate detection. Use fast burn (1h/5m) and slow burn (6h/30m) windows.
Review SLO Compliance
Analyze my current SLO compliance data. Show error budget remaining, burn rate trends, and recommend whether to freeze feature releases.

BewÀhrte Verfahren

  • Start with user-facing SLIs that directly measure customer experience rather than backend metrics
  • Set achievable SLOs slightly below current performance to allow for normal variance and prevent constant alerting
  • Use multi-window burn rate alerts (combine short and long windows) to eliminate false positives from transient blips
  • Review SLOs quarterly to ensure they still reflect business priorities and actual user needs

Vermeiden

  • Setting SLO targets at 100% availability which eliminates all error budget and prevents any feature development
  • Creating alerts on raw metric thresholds instead of burn rates, causing alert fatigue from normal fluctuations
  • Defining too many SLIs which dilutes focus and makes it impossible to prioritize reliability improvements
  • Implementing SLOs without executive buy-in for error budget policies, rendering the governance unenforceable

HĂ€ufig gestellte Fragen

What is the difference between SLI, SLO, and SLA?
SLI (Service Level Indicator) is a measured metric like availability percentage. SLO (Service Level Objective) is your internal target for that metric, like 99.9% availability. SLA (Service Level Agreement) is the external commitment you make to customers, which should be lower than your internal SLO to provide a buffer.
Why should I not target 100% reliability?
100% reliability leaves zero error budget, meaning any incident immediately violates your SLO. This prevents all feature development since you cannot take any risk. A 99.9% target allows 43 minutes of downtime per month for maintenance and experimentation while maintaining excellent user experience.
How do I choose the right SLO percentage?
Analyze your current performance over 30 days, set the SLO slightly below that baseline. Consider user expectations, competitor benchmarks, and business impact. Start conservative (99%) and tighten as you build confidence. The goal is achievable targets that catch real problems, not perfection.
What is multi-window burn rate alerting?
Multi-window alerts require both a short window (like 1 hour) and long window (like 6 hours) to exceed burn rate thresholds simultaneously. This eliminates false positives from brief spikes while catching sustained degradation. For example, alert only if burn rate exceeds 14.4x in both 1-hour and 5-minute windows.
How does error budget governance work?
Error budgets translate SLOs into actionable development policies. When you have budget remaining, ship features normally. As budget depletes, freeze risky changes. At 0% budget, halt all features until reliability improves. This creates an automatic feedback loop balancing innovation and stability.
What tools do I need to implement SLOs?
You need a metrics system (Prometheus recommended), visualization (Grafana), and alerting (Alertmanager). This skill provides the PromQL queries, recording rules, and alert configurations. Deploy these to your existing monitoring stack, then build dashboards to track compliance.

Entwicklerdetails

Dateistruktur

📄 SKILL.md