المهارات slo-implementation
📊

slo-implementation

آمن

Implement SLOs and Error Budgets for SRE

متاح أيضًا من: wshobson

Balance reliability with feature velocity using measurable SLO targets and error budgets. Stop debating reliability and start measuring it with industry-standard SRE practices.

يدعم: Claude Codex Code(CC)
🥉 75 برونزي
1

تنزيل ZIP المهارة

2

رفع في Claude

اذهب إلى Settings → Capabilities → Skills → Upload skill

3

فعّل وابدأ الاستخدام

اختبرها

استخدام "slo-implementation". Help me define SLIs and SLOs for my API service. I need availability and latency targets.

النتيجة المتوقعة:

Here are recommended SLIs for your API:

Availability SLI:
- Successful requests / Total requests
- Target: 99.9% (allows 43.2 min/month downtime)
- PromQL: sum(rate(http_requests_total{status!~'5..'}[28d])) / sum(rate(http_requests_total[28d]))

Latency SLI (p95 < 500ms):
- Requests below threshold / Total requests
- Target: 99% (1% of requests can be slow)
- PromQL: sum(rate(http_request_duration_seconds_bucket{le='0.5'}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget:
- 0.1% budget = 43.2 minutes/month
- Implement burn rate alerts to detect when consuming budget too quickly

استخدام "slo-implementation". Create Prometheus alerting rules for multi-window burn rate detection.

النتيجة المتوقعة:

Multi-Window Burn Rate Alert Rules:

Fast Burn Alert (Critical):
- Window: 1 hour and 5 minute confirmation
- Threshold: 14.4x burn rate
- Triggers when: Consuming 2% of error budget in 1 hour
- Action: Page on-call engineer immediately

Slow Burn Alert (Warning):
- Window: 6 hours and 30 minute confirmation
- Threshold: 6x burn rate
- Triggers when: Consuming 5% of error budget in 6 hours
- Action: Create ticket for next business day

Error Budget Exhausted (Critical):
- Trigger: Error budget remaining < 0%
- Duration: 5 minutes sustained
- Action: Feature freeze, all-hands on reliability

استخدام "slo-implementation". Design an error budget policy for a 99.9% SLO target.

النتيجة المتوقعة:

Error Budget Policy for 99.9% SLO:

Budget: 0.1% = 43.2 minutes per month

At 100% budget remaining:
- Normal development velocity
- All features allowed to ship

At 50% budget remaining:
- Postpone non-essential changes
- Require extra testing for releases
- Increase monitoring coverage

At 10% budget remaining:
- Freeze non-critical features
- Only security and hotfix releases
- Daily reliability standups

At 0% budget remaining:
- Complete feature freeze
- Focus 100% on reliability improvements
- No releases without SRE approval
- Root cause analysis required

التدقيق الأمني

آمن
v1 • 2/25/2026

Static analysis detected 35 potential security issues, all of which are false positives. The flagged 'Ruby/shell backtick execution' patterns are Markdown code formatting (backticks) used for PromQL queries and YAML examples. The 'weak cryptographic algorithm' flags are documentation text and annotations, not actual encryption code. This skill contains only documentation with no executable code, network operations, or security vulnerabilities.

1
الملفات التي تم فحصها
344
الأسطر التي تم تحليلها
2
النتائج
1
إجمالي عمليات التدقيق
مشكلات منخفضة المخاطر (2)
False Positive: Code Block Formatting
Static analyzer flagged Markdown code blocks (using backticks) as 'Ruby/shell backtick execution'. These are documentation code examples for PromQL queries and YAML configurations, not executable shell commands.
False Positive: Documentation Text
Static analyzer flagged 'weak cryptographic algorithm' at lines 3, 215, 229, 239. These are plain text descriptions and YAML comments in documentation, not actual cryptographic implementations.
تم تدقيقه بواسطة: claude

درجة الجودة

38
الهندسة المعمارية
100
قابلية الصيانة
87
المحتوى
50
المجتمع
100
الأمان
100
الامتثال للمواصفات

ماذا يمكنك بناءه

Establish Reliability Baseline

Define initial SLIs and SLOs for a new microservice to set measurable reliability targets and create alerting that catches actual problems without false alarm fatigue.

Implement Error Budget Governance

Create error budget policies that automatically freeze risky deployments when reliability degrades, helping balance feature velocity with stability requirements.

Reduce Alert Fatigue

Replace brittle threshold alerts with multi-window burn rate alerts that only trigger on significant reliability degradation, cutting notification noise by 80%.

جرّب هذه الموجهات

Define Basic SLOs
Help me define SLIs and SLOs for my API service. I need availability and latency targets.
Create Error Budget Policy
Design an error budget policy for a 99.9% SLO target. Define actions at 100%, 50%, 10%, and 0% remaining budget.
Build SLO Alerts
Create Prometheus alerting rules for multi-window burn rate detection. Use fast burn (1h/5m) and slow burn (6h/30m) windows.
Review SLO Compliance
Analyze my current SLO compliance data. Show error budget remaining, burn rate trends, and recommend whether to freeze feature releases.

أفضل الممارسات

  • Start with user-facing SLIs that directly measure customer experience rather than backend metrics
  • Set achievable SLOs slightly below current performance to allow for normal variance and prevent constant alerting
  • Use multi-window burn rate alerts (combine short and long windows) to eliminate false positives from transient blips
  • Review SLOs quarterly to ensure they still reflect business priorities and actual user needs

تجنب

  • Setting SLO targets at 100% availability which eliminates all error budget and prevents any feature development
  • Creating alerts on raw metric thresholds instead of burn rates, causing alert fatigue from normal fluctuations
  • Defining too many SLIs which dilutes focus and makes it impossible to prioritize reliability improvements
  • Implementing SLOs without executive buy-in for error budget policies, rendering the governance unenforceable

الأسئلة المتكررة

What is the difference between SLI, SLO, and SLA?
SLI (Service Level Indicator) is a measured metric like availability percentage. SLO (Service Level Objective) is your internal target for that metric, like 99.9% availability. SLA (Service Level Agreement) is the external commitment you make to customers, which should be lower than your internal SLO to provide a buffer.
Why should I not target 100% reliability?
100% reliability leaves zero error budget, meaning any incident immediately violates your SLO. This prevents all feature development since you cannot take any risk. A 99.9% target allows 43 minutes of downtime per month for maintenance and experimentation while maintaining excellent user experience.
How do I choose the right SLO percentage?
Analyze your current performance over 30 days, set the SLO slightly below that baseline. Consider user expectations, competitor benchmarks, and business impact. Start conservative (99%) and tighten as you build confidence. The goal is achievable targets that catch real problems, not perfection.
What is multi-window burn rate alerting?
Multi-window alerts require both a short window (like 1 hour) and long window (like 6 hours) to exceed burn rate thresholds simultaneously. This eliminates false positives from brief spikes while catching sustained degradation. For example, alert only if burn rate exceeds 14.4x in both 1-hour and 5-minute windows.
How does error budget governance work?
Error budgets translate SLOs into actionable development policies. When you have budget remaining, ship features normally. As budget depletes, freeze risky changes. At 0% budget, halt all features until reliability improves. This creates an automatic feedback loop balancing innovation and stability.
What tools do I need to implement SLOs?
You need a metrics system (Prometheus recommended), visualization (Grafana), and alerting (Alertmanager). This skill provides the PromQL queries, recording rules, and alert configurations. Deploy these to your existing monitoring stack, then build dashboards to track compliance.

تفاصيل المطور

المؤلف

sickn33

الترخيص

MIT

مرجع

main

بنية الملفات

📄 SKILL.md