스킬 incident-responder

🚨

incident-responder

Name: incident-responder
Author: sickn33

안전

프로덕션 인시던트 대응

구조화된 대응 프로토콜, 관측성 기반 조사 기법, 비난 없는 포스트모템 프로세스를 통해 MTTR을 최소화하고 시스템 신뢰성을 개선하는 프로덕션 인시던트 관리에 대한 전문가 가이드를 제공합니다.

지원: Claude Codex Code(CC)

📊 70 적절함

스킬 ZIP 다운로드

Claude에서 업로드

설정 → 기능 → 스킬 → 스킬 업로드로 이동

토글을 켜고 사용 시작

테스트해 보기

"incident-responder" 사용 중입니다. We have a production incident. The payment service is returning 500 errors to 30% of users.

예상 결과:

## Immediate Assessment

**Severity: P1** - Major functionality degraded, significant user impact

### 1. Establish Incident Command
- **Incident Commander**: [Assign now]
- **Technical Lead**: Start investigation
- **Communication Lead**: Prepare stakeholder update

### 2. Quick Impact Assessment
- User impact: ~30% of payment requests failing
- Business impact: Revenue impact likely, SLA violation possible
- Scope: Payment service only, check downstream effects

### 3. Immediate Stabilization Options
- Check for recent deployments (last 30 min)
- Enable feature flag to disable payments if needed
- Scale payment service pods if resource-constrained
- Verify database connection pool health

### 4. First Communication
- Update status page within 15 minutes
- Internal: Create incident channel, ping on-call
- Set expectation: Hourly updates for P1

"incident-responder" 사용 중입니다. Help me write a post-mortem for a database outage that caused 2 hours of downtime.

예상 결과:

## Post-Mortem Template

### Incident Summary
- **Duration**: [Start] to [End] - 2 hours
- **Impact**: Complete payment service outage
- **Users Affected**: 100% of payment users
- **SEV Level**: P0

### Timeline (Local Time)
- 14:00 - Alerts fire for elevated error rates
- 14:05 - On-call acknowledges, starts investigation
- 14:15 - Root cause identified as DB connection pool exhaustion
- 14:30 - Decision to restart service
- 15:45 - Service fully recovered
- 16:00 - Incident closed

### Root Cause Analysis (Five Whys)
1. Why did payment service fail? Database connection pool exhausted
2. Why was pool exhausted? Application leaked connections during retry storm
3. Why did retry storm occur? Missing circuit breaker on downstream API
4. Why was circuit breaker missing? Feature not implemented in redesign
5. Why? Gap in reliability requirements during migration

### Action Items
- [ ] Implement circuit breaker pattern (Owner: Team A, Due: 2 weeks)
- [ ] Add connection pool monitoring alerts (Owner: Team B, Due: 1 week)
- [ ] Update migration checklist to include reliability requirements (Owner: Team C, Due: 1 week)

보안 감사

안전

v1 • 2/25/2026

Prompt-only skill containing educational content about SRE incident management practices. Static analysis scanned 0 files (0 lines) and detected 0 security issues. The skill provides guidance on incident response procedures, observability practices, and post-incident analysis. No executable code, no network calls, no file operations, and no prompt injection attempts detected. This is a safe, informational skill for incident response education.

스캔된 파일

분석된 줄 수

발견 사항

총 감사 수

보안 문제를 찾지 못했습니다

감사자: claude

품질 점수

아키텍처

100

유지보수성

콘텐츠

커뮤니티

100

보안

사양 준수

만들 수 있는 것

활성 프로덕션 인시던트 대응

실시간 인시던트 중 구조화된 대응 프로토콜을 따르고, 심각도를 평가하며, 인시던트 지휘 체계를 수립하고 이해관계자와의 커뮤니케이션을 조율하는 데 사용합니다.

인시던트 후 분석 및 학습

타임라인 작성, 다섯 번의 왜(5 Whys) 기법을 통한 근본 원인 분석, 실행 가능한 개선 사항 도출을 안내하여 비난 없는 포스트모템을 지원합니다.

SRE 실무와 교육

인시던트 관리 모범 사례, 현대적 관측성 기법, 더 탄력적인 시스템 구축을 위한 신뢰성 패턴을 학습합니다.

이 프롬프트를 사용해 보세요

초기 인시던트 평가

We have a production incident. The service [service name] is experiencing [symptoms]. Help me assess the severity, establish incident command, and identify immediate stabilization steps.

조사 및 트리아지

We have a [P1/P2] incident affecting [service]. Initial investigation shows [observed symptoms]. Guide me through an observability-driven investigation to identify root cause.

이해관계자 커뮤니케이션

We are in the middle of a [P0/P1] incident. I need to draft updates for [executives/customers/support team]. What should I communicate and how often?

포스트모템 진행

Help me conduct a blameless post-mortem for an incident where [brief description]. Guide me through creating timeline, root cause analysis, and identifying action items.

모범 사례

인시던트 지휘 체계를 즉시 수립 - 불명확한 책임은 해결 지연을 초래
선제적이고 자주 커뮤니케이션 - 이해관계자는 침묵보다 업데이트를 선호
활성 인시던트 중에는 서비스 복구를 우선, 근본 원인 분석은 그 다음
실시간으로 모든 것을 문서화 - 타임라인과 결정 사항은 나중에 복원하기 어려움

피하기

포스트모템에서 개인을 비난 - 대신 시스템과 프로세스에 집중
'모두가 대응'이라며 인시던트 지휘를 생략 - 조율 혼란을 초래
완전한 정보가 있을 때까지 커뮤니케이션을 지연 - 이해관계자는 신속한 업데이트가 필요
활성 인시던트 중 복잡한 수정 구현 - 최소한의 실행 가능한 수정이 우선

자주 묻는 질문

P0 인시던트에 얼마나 빨리 대응해야 하나요?

P0(치명적) 인시던트는 15분 이내 인지 및 1시간 이내 해결이 필요합니다. 즉각적인 에스컬레이션과 인시던트 지휘 체계 수립이 중요합니다.

인시던트 커맨더와 테크니컬 리드의 차이는 무엇인가요?

인시던트 커맨더는 의사결정을 내리고 대응을 조율하며 커뮤니케이션을 관리합니다. 테크니컬 리드는 기술적 근본 원인을 조사하고 수정 사항을 구현합니다. 역할을 분리하면 인지 과부하를 방지할 수 있습니다.

인시던트 업데이트는 얼마나 자주 보내야 하나요?

활성 인시던트의 경우: P0/P1은 15분마다, P2는 매시간. 업데이트에는 현재 상태, 수행한 조치, 다음 단계, ETA(알려진 경우)를 포함해야 합니다.

인시던트가 해결되었다고 언제 선언해야 하나요?

모든 SLI가 정상 임계값으로 복귀하고, 사용자 경험이 검증되며, 용량 여유가 확인되면 해결을 선언합니다. 해결 후 24시간 동안 강화된 모니터링을 지속합니다.

비난 없는 포스트모템은 어떻게 진행하나요?

누가 실수했는지가 아니라 무엇이 일어났고 왜 일어났는지에 집중하세요. 다섯 번의 왜(5 Whys)나 생선뼈(피시본) 다이어그램 같은 기법을 사용합니다. 인간의 오류가 아닌 시스템적 요인을 식별하고, 학습 내용을 공개적으로 공유합니다.

이 스킬이 실제 복구 명령을 실행할 수 있나요?

아니요. 이 스킬은 가이드와 권장 사항만 제공합니다. 시스템에 접근하거나 명령을 실행하거나 의사결정을 내릴 수 없습니다. 항상 특정 환경에 맞춰 가이드를 검증하세요.

개발자 세부 정보

작성자

sickn33

라이선스

MIT

리포지토리

https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/incident-responder

참조

main

파일 구조

📄 SKILL.md