Fähigkeiten incident-responder

🚨

incident-responder

Name: incident-responder
Author: sickn33

Sicher

Respond to Production Incidents

Get expert guidance on managing production incidents with structured response protocols, observability-driven investigation techniques, and blameless post-mortem processes to minimize MTTR and improve system reliability.

Unterstützt: Claude Codex Code(CC)

🥉 74 Bronze

Die Skill-ZIP herunterladen

In Claude hochladen

Gehe zu Einstellungen → Fähigkeiten → Skills → Skill hochladen

Einschalten und loslegen

Teste es

Verwendung von "incident-responder". We have a production incident. The payment service is returning 500 errors to 30% of users.

Erwartetes Ergebnis:

## Immediate Assessment

**Severity: P1** - Major functionality degraded, significant user impact

### 1. Establish Incident Command
- **Incident Commander**: [Assign now]
- **Technical Lead**: Start investigation
- **Communication Lead**: Prepare stakeholder update

### 2. Quick Impact Assessment
- User impact: ~30% of payment requests failing
- Business impact: Revenue impact likely, SLA violation possible
- Scope: Payment service only, check downstream effects

### 3. Immediate Stabilization Options
- Check for recent deployments (last 30 min)
- Enable feature flag to disable payments if needed
- Scale payment service pods if resource-constrained
- Verify database connection pool health

### 4. First Communication
- Update status page within 15 minutes
- Internal: Create incident channel, ping on-call
- Set expectation: Hourly updates for P1

Verwendung von "incident-responder". Help me write a post-mortem for a database outage that caused 2 hours of downtime.

Erwartetes Ergebnis:

## Post-Mortem Template

### Incident Summary
- **Duration**: [Start] to [End] - 2 hours
- **Impact**: Complete payment service outage
- **Users Affected**: 100% of payment users
- **SEV Level**: P0

### Timeline (Local Time)
- 14:00 - Alerts fire for elevated error rates
- 14:05 - On-call acknowledges, starts investigation
- 14:15 - Root cause identified as DB connection pool exhaustion
- 14:30 - Decision to restart service
- 15:45 - Service fully recovered
- 16:00 - Incident closed

### Root Cause Analysis (Five Whys)
1. Why did payment service fail? Database connection pool exhausted
2. Why was pool exhausted? Application leaked connections during retry storm
3. Why did retry storm occur? Missing circuit breaker on downstream API
4. Why was circuit breaker missing? Feature not implemented in redesign
5. Why? Gap in reliability requirements during migration

### Action Items
- [ ] Implement circuit breaker pattern (Owner: Team A, Due: 2 weeks)
- [ ] Add connection pool monitoring alerts (Owner: Team B, Due: 1 week)
- [ ] Update migration checklist to include reliability requirements (Owner: Team C, Due: 1 week)

Sicherheitsaudit

Sicher

v1 • 2/25/2026

Prompt-only skill containing educational content about SRE incident management practices. Static analysis scanned 0 files (0 lines) and detected 0 security issues. The skill provides guidance on incident response procedures, observability practices, and post-incident analysis. No executable code, no network calls, no file operations, and no prompt injection attempts detected. This is a safe, informational skill for incident response education.

Gescannte Dateien

Analysierte Zeilen

befunde

Gesamtzahl Audits

Keine Sicherheitsprobleme gefunden

Auditiert von: claude

Qualitätsbewertung

Architektur

100

Wartbarkeit

Inhalt

Community

100

Sicherheit

Spezifikationskonformität

Was du bauen kannst

Active Production Incident Response

Use during live incidents to follow structured response protocols, assess severity, establish incident command, and coordinate communication with stakeholders.

Post-Incident Analysis and Learning

Facilitate blameless post-mortems by guiding timeline creation, root cause analysis using five whys technique, and identifying actionable improvements.

SRE Practice and Training

Learn incident management best practices, modern observability techniques, and reliability patterns for building more resilient systems.

Probiere diese Prompts

Initial Incident Assessment

We have a production incident. The service [service name] is experiencing [symptoms]. Help me assess the severity, establish incident command, and identify immediate stabilization steps.

Investigation and Triage

We have a [P1/P2] incident affecting [service]. Initial investigation shows [observed symptoms]. Guide me through an observability-driven investigation to identify root cause.

Stakeholder Communication

We are in the middle of a [P0/P1] incident. I need to draft updates for [executives/customers/support team]. What should I communicate and how often?

Post-Mortem Facilitation

Help me conduct a blameless post-mortem for an incident where [brief description]. Guide me through creating timeline, root cause analysis, and identifying action items.

Bewährte Verfahren

Establish incident command structure immediately - unclear ownership delays resolution
Communicate proactively and frequently - stakeholders prefer updates over silence
Focus on service restoration first, root cause analysis second during active incidents
Document everything in real-time - timelines and decisions are harder to reconstruct later

Vermeiden

Blaming individuals in post-mortems - focus on systems and processes instead
Skipping incident command for 'everyone responds' - causes coordination chaos
Delaying communication to have complete information - stakeholders need timely updates
Implementing complex fixes during active incidents - prefer minimal viable fixes

Häufig gestellte Fragen

How quickly should I respond to a P0 incident?

P0 (critical) incidents require acknowledgment within 15 minutes and resolution within 1 hour. Immediate escalation and incident command establishment are critical.

What is the difference between incident commander and technical lead?

Incident Commander makes decisions, coordinates response, manages communication. Technical Lead investigates the technical root cause and implements fixes. Separate roles prevent cognitive overload.

How often should I send incident updates?

For active incidents: every 15 minutes for P0/P1, hourly for P2. Updates should include current status, actions taken, next steps, and ETA if known.

When should I declare an incident resolved?

Declare resolution when all SLIs return to normal thresholds, user experience is validated, and capacity headroom is confirmed. Continue enhanced monitoring for 24 hours post-resolution.

How do I conduct a blameless post-mortem?

Focus on what happened and why, not who made mistakes. Use techniques like five whys or fishbone diagrams. Identify systemic factors, not human errors. Share learnings openly.

Can this skill execute actual remediation commands?

No. This skill provides guidance and recommendations only. It cannot access your systems, execute commands, or make decisions. Always verify guidance against your specific environment.

Entwicklerdetails

Autor

sickn33

Lizenz

MIT

Repository

https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/incident-responder

Ref

main

Dateistruktur

📄 SKILL.md