incident-responder
Respond to Production Incidents
Get expert guidance on managing production incidents with structured response protocols, observability-driven investigation techniques, and blameless post-mortem processes to minimize MTTR and improve system reliability.
Die Skill-ZIP herunterladen
In Claude hochladen
Gehe zu Einstellungen → Fähigkeiten → Skills → Skill hochladen
Einschalten und loslegen
Teste es
Verwendung von "incident-responder". We have a production incident. The payment service is returning 500 errors to 30% of users.
Erwartetes Ergebnis:
## Immediate Assessment
**Severity: P1** - Major functionality degraded, significant user impact
### 1. Establish Incident Command
- **Incident Commander**: [Assign now]
- **Technical Lead**: Start investigation
- **Communication Lead**: Prepare stakeholder update
### 2. Quick Impact Assessment
- User impact: ~30% of payment requests failing
- Business impact: Revenue impact likely, SLA violation possible
- Scope: Payment service only, check downstream effects
### 3. Immediate Stabilization Options
- Check for recent deployments (last 30 min)
- Enable feature flag to disable payments if needed
- Scale payment service pods if resource-constrained
- Verify database connection pool health
### 4. First Communication
- Update status page within 15 minutes
- Internal: Create incident channel, ping on-call
- Set expectation: Hourly updates for P1
Verwendung von "incident-responder". Help me write a post-mortem for a database outage that caused 2 hours of downtime.
Erwartetes Ergebnis:
## Post-Mortem Template
### Incident Summary
- **Duration**: [Start] to [End] - 2 hours
- **Impact**: Complete payment service outage
- **Users Affected**: 100% of payment users
- **SEV Level**: P0
### Timeline (Local Time)
- 14:00 - Alerts fire for elevated error rates
- 14:05 - On-call acknowledges, starts investigation
- 14:15 - Root cause identified as DB connection pool exhaustion
- 14:30 - Decision to restart service
- 15:45 - Service fully recovered
- 16:00 - Incident closed
### Root Cause Analysis (Five Whys)
1. Why did payment service fail? Database connection pool exhausted
2. Why was pool exhausted? Application leaked connections during retry storm
3. Why did retry storm occur? Missing circuit breaker on downstream API
4. Why was circuit breaker missing? Feature not implemented in redesign
5. Why? Gap in reliability requirements during migration
### Action Items
- [ ] Implement circuit breaker pattern (Owner: Team A, Due: 2 weeks)
- [ ] Add connection pool monitoring alerts (Owner: Team B, Due: 1 week)
- [ ] Update migration checklist to include reliability requirements (Owner: Team C, Due: 1 week)
Sicherheitsaudit
SicherPrompt-only skill containing educational content about SRE incident management practices. Static analysis scanned 0 files (0 lines) and detected 0 security issues. The skill provides guidance on incident response procedures, observability practices, and post-incident analysis. No executable code, no network calls, no file operations, and no prompt injection attempts detected. This is a safe, informational skill for incident response education.
Qualitätsbewertung
Was du bauen kannst
Active Production Incident Response
Use during live incidents to follow structured response protocols, assess severity, establish incident command, and coordinate communication with stakeholders.
Post-Incident Analysis and Learning
Facilitate blameless post-mortems by guiding timeline creation, root cause analysis using five whys technique, and identifying actionable improvements.
SRE Practice and Training
Learn incident management best practices, modern observability techniques, and reliability patterns for building more resilient systems.
Probiere diese Prompts
We have a production incident. The service [service name] is experiencing [symptoms]. Help me assess the severity, establish incident command, and identify immediate stabilization steps.
We have a [P1/P2] incident affecting [service]. Initial investigation shows [observed symptoms]. Guide me through an observability-driven investigation to identify root cause.
We are in the middle of a [P0/P1] incident. I need to draft updates for [executives/customers/support team]. What should I communicate and how often?
Help me conduct a blameless post-mortem for an incident where [brief description]. Guide me through creating timeline, root cause analysis, and identifying action items.
Bewährte Verfahren
- Establish incident command structure immediately - unclear ownership delays resolution
- Communicate proactively and frequently - stakeholders prefer updates over silence
- Focus on service restoration first, root cause analysis second during active incidents
- Document everything in real-time - timelines and decisions are harder to reconstruct later
Vermeiden
- Blaming individuals in post-mortems - focus on systems and processes instead
- Skipping incident command for 'everyone responds' - causes coordination chaos
- Delaying communication to have complete information - stakeholders need timely updates
- Implementing complex fixes during active incidents - prefer minimal viable fixes
Häufig gestellte Fragen
How quickly should I respond to a P0 incident?
What is the difference between incident commander and technical lead?
How often should I send incident updates?
When should I declare an incident resolved?
How do I conduct a blameless post-mortem?
Can this skill execute actual remediation commands?
Entwicklerdetails
Autor
sickn33Lizenz
MIT
Repository
https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/incident-responderRef
main
Dateistruktur
đź“„ SKILL.md