Compétences observability-monitoring-slo-implement
📊

observability-monitoring-slo-implement

Sûr

Implement SLOs and Error Budgets

Design and implement Service Level Objectives with SLIs and error budgets to measure and improve system reliability while balancing feature velocity.

Prend en charge: Claude Codex Code(CC)
đŸ„‰ 72 Bronze
1

Télécharger le ZIP du skill

2

Importer dans Claude

Allez dans ParamĂštres → CapacitĂ©s → Skills → Importer un skill

3

Activez et commencez Ă  utiliser

Tester

Utilisation de "observability-monitoring-slo-implement". Design SLOs for a new e-commerce checkout service

Résultat attendu:

A comprehensive SLO framework including: tier classification (critical), availability target (99.95%), latency SLIs (p95 < 500ms), error rate SLI (< 0.1%), error budget calculation (4.38 hours/month), and burn rate alert thresholds.

Utilisation de "observability-monitoring-slo-implement". Create Prometheus recording rules for SLO tracking

Résultat attendu:

YAML configuration with recording rules for request rate, success rate at multiple time windows (5m, 30m, 1h), latency percentiles (p50, p95, p99), and error budget burn rate calculations.

Audit de sécurité

Sûr
v1 ‱ 2/24/2026

Static analysis detected 57 potential issues, but manual review confirms all findings are false positives. The skill contains documentation with Python code examples for SLO implementation - no actual executable code, no network calls, and no cryptographic operations. The placeholder URLs use example.com domain. This is a legitimate DevOps reliability skill.

2
Fichiers analysés
1,124
Lignes analysées
5
résultats
1
Total des audits
ProblĂšmes Ă  risque moyen (2)
External Commands Detection in Documentation
Static scanner detected 'external_commands' pattern in markdown documentation. This is a false positive - the skill contains Python code examples in markdown blocks, not executable shell commands. The backtick syntax detected is part of Python f-strings and dictionary literals in documentation examples.
Hardcoded URLs in Example Configuration
Static scanner detected placeholder URLs in YAML configuration examples. These are example.com domain URLs used as placeholders in documentation, not actual network endpoints.
ProblĂšmes Ă  risque faible (3)
Numeric Pattern False Positives
Static scanner detected 'weak cryptographic algorithm' patterns at multiple locations. These are false positives - the numeric values detected (99.9%, 0.001, 14.4) are SLO availability targets and burn rate multipliers, not cryptographic algorithms.
Documentation Language False Positive
Static scanner detected 'system reconnaissance' patterns. This is a false positive - words like 'analyze', 'assess', 'identify' are used in the legitimate context of service analysis for SLO design, not reconnaissance.
Code Block Bracket Pattern
Static scanner detected 'obfuscation' pattern with multiple bracket chains. This is a false positive - the pattern detected is legitimate markdown code block formatting with Python dictionary and f-string syntax.
Audité par: claude

Score de qualité

38
Architecture
100
Maintenabilité
87
Contenu
50
Communauté
89
Sécurité
91
Conformité aux spécifications

Ce que vous pouvez construire

Define SLOs for a new API service

Create availability, latency, and error rate SLOs with appropriate targets based on service criticality

Set up error budget alerting

Configure multi-window burn rate alerts to detect fast and slow error budget consumption

Establish SLO review process

Create weekly SLO review templates and governance processes for engineering teams

Essayez ces prompts

Basic SLO Design
Help me design SLOs for my payment processing service. It handles 10,000 requests per minute and requires high reliability. What availability target should I set and how do I define the SLIs?
SLI Implementation
I need to implement SLIs for a REST API service using Prometheus. Show me how to create availability and latency SLI queries that track the percentage of successful requests and requests under 500ms.
Error Budget Alerts
Configure error budget burn rate alerts for my service with a 99.9% SLO target. I need both fast burn (page immediately) and slow burn (create ticket) alert rules.
SLO Governance
Establish an SLO governance framework for my team with roles and responsibilities, weekly review templates, and stakeholder communication processes.

Bonnes pratiques

  • Start with conservative SLO targets and tighten them based on actual service performance data
  • Use multiple time windows for burn rate alerts to catch both fast and slow budget consumption
  • Align SLO targets with business priorities and user expectations, not technical convenience

Éviter

  • Setting SLO targets too tight initially, leading to constant alerts and alert fatigue
  • Using only availability SLIs without considering latency or quality metrics
  • Creating SLOs without stakeholder alignment or business context

Foire aux questions

What is the difference between an SLO and an SLA?
An SLO (Service Level Objective) is an internal target that engineering teams commit to. An SLA (Service Level Agreement) is a contractual commitment to customers with financial consequences if violated.
How do I choose the right SLO availability target?
Start by analyzing historical availability, understanding user expectations, and considering business impact. Critical services typically need 99.95%+ while standard services may target 99.5%.
What time window should I use for SLO measurements?
Common windows are 30 days for rolling availability or calendar months for billing periods. Longer windows provide stability but slower feedback on issues.
How do I handle scheduled maintenance in SLO calculations?
Exclude planned maintenance windows from SLO measurements or use availability formulas that account for expected downtime. Document your approach clearly.
What should I do when error budget is depleted?
Pause feature development, focus on reliability improvements, and communicate status to stakeholders. Use the error budget policy to guide release decisions.
How many SLOs should a service have?
Start with 2-4 SLOs covering the most important user-facing aspects: availability, latency, and error rate. Add more as needed but avoid alert fatigue.

Détails du développeur

Structure de fichiers

📁 resources/

📄 implementation-playbook.md

📄 SKILL.md