🤖

aiops

Niedriges Risiko ⚙️ Externe Befehle🌐 Netzwerkzugriff

实施智能AIOps策略以优化基础设施

AIOps技能帮助你构建自动化的事件响应、异常检测和自愈基础设施。这些模式可与任何监控平台配合使用,以缩短MTTR并提高可靠性。

Unterstützt: Claude Codex Code(CC)
⚠️ 65 Schlecht
1

Die Skill-ZIP herunterladen

2

In Claude hochladen

Gehe zu Einstellungen → Fähigkeiten → Skills → Skill hochladen

3

Einschalten und loslegen

Teste es

Verwendung von "aiops". Help me analyze a high severity alert about high memory usage on our primary database service

Erwartetes Ergebnis:

  • Alert Classification: resource_exhaustion - Memory pressure detected
  • Recommended Actions: 1) Scale replicas by 20%, 2) Notify devops channel, 3) Check for memory leaks in recent deployments
  • Risk Assessment: Impact is high, automatic remediation is safe
  • Follow-up: Schedule capacity review if trend continues

Verwendung von "aiops". Create an automated remediation policy for a service that keeps failing health checks

Erwartetes Ergebnis:

  • Policy Name: health_check_remediation
  • Trigger Condition: 3 consecutive health check failures within 2 minutes
  • Action Sequence: 1) Restart deployment, 2) Verify pods are running, 3) Send notification if restart fails
  • Safety Gates: Check cluster CPU < 80%, verify no ongoing deployments
  • Escalation: Alert on-call if remediation fails after 2 attempts

Verwendung von "aiops". How do I set up anomaly detection for my API response times

Erwartetes Ergebnis:

  • Data Source: Configure PrometheusDataSource to query http_request_duration_seconds histogram
  • Model Type: Isolation Forest for multi-dimensional anomaly detection
  • Training Window: Use last 7 days of historical data
  • Alert Threshold: Flag anomalies with z-score > 3.0
  • Actions: Trigger investigation workflow when anomaly detected

Sicherheitsaudit

Niedriges Risiko
v5 • 1/16/2026

This is a documentation-only skill containing code patterns for AIOps implementation. Static findings are false positives: C2 keywords are standard DevOps terminology, weak crypto flags are incorrect, subprocess patterns are legitimate kubectl automation, and API key references are placeholder parameter names. Prior AI audit confirmed low risk with no file system access beyond normal execution.

2
Gescannte Dateien
2,077
Analysierte Zeilen
2
befunde
5
Gesamtzahl Audits

Risikofaktoren

⚙️ Externe Befehle (1)
🌐 Netzwerkzugriff (1)
Auditiert von: claude Audit-Verlauf anzeigen →

Qualitätsbewertung

38
Architektur
100
Wartbarkeit
87
Inhalt
21
Community
90
Sicherheit
70
Spezifikationskonformität

Was du bauen kannst

自动化事件响应

构建自动化运维手册,实现问题检测、分类和修复,无需人工干预

部署自愈系统

创建自动重启服务、扩展资源或隔离故障的策略

实施统一可观测性

跨多个来源的指标、日志和追踪构建仪表板和告警

Probiere diese Prompts

告警分诊
Help me analyze this alert. The severity is [critical|high|medium|low], the service is [name], and the message is: [alert message]. What should our response plan include?
容量规划
Our [CPU|memory|storage] usage has been [describe trend] over the past [time period]. Using the AIOps patterns, help me create a capacity prediction model and scaling recommendations.
构建自动化
I need to create an automated remediation action for [specific failure type]. Following AIOps best practices, what should the action sequence, safety checks, and rollback plan include?
根因分析
We experienced an incident affecting [service name]. Help me correlate metrics, logs, and traces to identify the root cause using the observability patterns from the AIOps skill.

Bewährte Verfahren

  • 在生产环境中启用自动执行之前,始终为自动化操作实施试运行模式
  • 对所有自动化修复操作使用断路器和超时限制
  • 即使对于低严重性的自动化响应,也要维护人工升级路径

Vermeiden

  • 切勿在无条件检查和预批准保护措施的情况下执行自动化
  • 避免硬编码凭据 - 使用密钥管理和环境变量
  • 未经彻底测试,请勿对关键服务启用自动修复

Häufig gestellte Fragen

支持哪些监控平台?
Prometheus、Loki、Jaeger和Datadog集成已包含在内。可通过抽象的DataSource类添加其他数据源。
推荐的重置和超时设置是什么?
默认重试为3次,指数退避从2秒开始。默认超时为每个操作300秒。
如何与我现有的工具集成?
为您的平台实现抽象的DataSource类。使用AutomationEngine处理程序连接到您的运维手册系统。
使用这些模式时我的数据安全吗?
是的。模式在您的环境中运行。除了您配置的监控端点外,不会向外部服务发送任何数据。
如果自动化导致意外更改怎么办?
所有自动化操作都包括安全机制:前置条件、超时、回滚处理程序以及执行历史记录日志。
这与现有的AIOps工具相比如何?
这些是您可以适应的实现模式。与供应商工具不同,这使您可以完全控制ML模型、自动化逻辑和集成。

Entwicklerdetails

Dateistruktur

📄 SKILL.md