incident-responder
Responder a Incidentes de Producción
Obtén orientación expertas sobre la gestión de incidentes de producción con protocolos de respuesta estructurados, técnicas de investigación basadas en observabilidad y procesos de post-mortem sin culpas para minimizar el MTTR y mejorar la confiabilidad del sistema.
تنزيل ZIP المهارة
رفع في Claude
اذهب إلى Settings → Capabilities → Skills → Upload skill
فعّل وابدأ الاستخدام
اختبرها
استخدام "incident-responder". We have a production incident. The payment service is returning 500 errors to 30% of users.
النتيجة المتوقعة:
## Immediate Assessment
**Severity: P1** - Major functionality degraded, significant user impact
### 1. Establish Incident Command
- **Incident Commander**: [Assign now]
- **Technical Lead**: Start investigation
- **Communication Lead**: Prepare stakeholder update
### 2. Quick Impact Assessment
- User impact: ~30% of payment requests failing
- Business impact: Revenue impact likely, SLA violation possible
- Scope: Payment service only, check downstream effects
### 3. Immediate Stabilization Options
- Check for recent deployments (last 30 min)
- Enable feature flag to disable payments if needed
- Scale payment service pods if resource-constrained
- Verify database connection pool health
### 4. First Communication
- Update status page within 15 minutes
- Internal: Create incident channel, ping on-call
- Set expectation: Hourly updates for P1
استخدام "incident-responder". Help me write a post-mortem for a database outage that caused 2 hours of downtime.
النتيجة المتوقعة:
## Post-Mortem Template
### Incident Summary
- **Duration**: [Start] to [End] - 2 hours
- **Impact**: Complete payment service outage
- **Users Affected**: 100% of payment users
- **SEV Level**: P0
### Timeline (Local Time)
- 14:00 - Alerts fire for elevated error rates
- 14:05 - On-call acknowledges, starts investigation
- 14:15 - Root cause identified as DB connection pool exhaustion
- 14:30 - Decision to restart service
- 15:45 - Service fully recovered
- 16:00 - Incident closed
### Root Cause Analysis (Five Whys)
1. Why did payment service fail? Database connection pool exhausted
2. Why was pool exhausted? Application leaked connections during retry storm
3. Why did retry storm occur? Missing circuit breaker on downstream API
4. Why was circuit breaker missing? Feature not implemented in redesign
5. Why? Gap in reliability requirements during migration
### Action Items
- [ ] Implement circuit breaker pattern (Owner: Team A, Due: 2 weeks)
- [ ] Add connection pool monitoring alerts (Owner: Team B, Due: 1 week)
- [ ] Update migration checklist to include reliability requirements (Owner: Team C, Due: 1 week)
التدقيق الأمني
آمنPrompt-only skill containing educational content about SRE incident management practices. Static analysis scanned 0 files (0 lines) and detected 0 security issues. The skill provides guidance on incident response procedures, observability practices, and post-incident analysis. No executable code, no network calls, no file operations, and no prompt injection attempts detected. This is a safe, informational skill for incident response education.
درجة الجودة
ماذا يمكنك بناءه
Respuesta Activa a Incidentes de Producción
Usa durante incidentes activos para seguir protocolos de respuesta estructurados, evaluar la severidad, establecer comando de incidentes y coordinar la comunicación con las partes interesadas.
Análisis y Aprendizaje Post-Incidente
Facilita post-mortems sin culpas guiando la creación de la línea de tiempo, análisis de causa raíz usando la técnica de los cinco porqués, e identificando mejoras accionables.
Práctica y Entrenamiento de SRE
Aprende mejores prácticas de gestión de incidentes, técnicas modernas de observabilidad y patrones de confiabilidad para construir sistemas más resilientes.
جرّب هذه الموجهات
We have a production incident. The service [service name] is experiencing [symptoms]. Help me assess the severity, establish incident command, and identify immediate stabilization steps.
We have a [P1/P2] incident affecting [service]. Initial investigation shows [observed symptoms]. Guide me through an observability-driven investigation to identify root cause.
We are in the middle of a [P0/P1] incident. I need to draft updates for [executives/customers/support team]. What should I communicate and how often?
Help me conduct a blameless post-mortem for an incident where [brief description]. Guide me through creating timeline, root cause analysis, and identifying action items.
أفضل الممارسات
- Establece la estructura de comando de incidentes inmediatamente - la propiedad unclear retrasa la resolución
- Comunica proactivamente y con frecuencia - las partes interesadas prefieren actualizaciones al silencio
- Enfócate en la restauración del servicio primero, el análisis de causa raíz segundo durante incidentes activos
- Documenta todo en tiempo real - las líneas de tiempo y decisiones son más difíciles de reconstruir después
تجنب
- Culpar a individuos en post-mortems - enfócate en sistemas y procesos en su lugar
- Omitir el comando de incidentes por 'todos responden' - causa caos de coordinación
- Retrasar la comunicación para tener información completa - las partes interesadas necesitan actualizaciones oportunas
- Implementar soluciones complejas durante incidentes activos - prefiere soluciones mínimas viables