스킬 data-engineering-data-pipeline

📦

data-engineering-data-pipeline

Name: data-engineering-data-pipeline
Author: sickn33

낮은 위험

建構可擴展的資料管道

設計可投入生產環境的資料管道既複雜又容易出錯。此技能提供 ETL、串流和湖屋系統的經證實架構模式與實作指引。

지원: Claude Codex Code(CC)

⚠️ 67 나쁨

스킬 ZIP 다운로드

Claude에서 업로드

설정 → 기능 → 스킬 → 스킬 업로드로 이동

토글을 켜고 사용 시작

테스트해 보기

"data-engineering-data-pipeline" 사용 중입니다. Design a batch pipeline for daily customer data sync from MySQL to Snowflake

예상 결과:

Architecture: ELT pattern with incremental loading. Components: (1) Extract using watermark column 'updated_at', (2) Load raw data to S3 staging, (3) Transform in Snowflake with dbt, (4) Validate with dbt tests, (5) Alert on failures via Slack. Key considerations: Handle late-arriving data, implement retry logic, monitor row count variance.

"data-engineering-data-pipeline" 사용 중입니다. How do I handle schema evolution in a streaming pipeline?

예상 결과:

Strategy: Use schema registry with compatibility checks. For additive changes, use default values. For breaking changes, implement dual-write during migration. Tools: Confluent Schema Registry for Kafka, Delta Lake schema evolution with mergeSchema option. Always test backward compatibility before deployment.

보안 감사

낮은 위험

v1 • 2/24/2026

All static analyzer findings are false positives. The skill is documentation-only, providing architectural guidance and educational code examples. No executable code, external commands, or security risks detected. Safe for publication.

스캔된 파일

204

분석된 줄 수

발견 사항

총 감사 수

낮은 위험 문제 (3)

SKILL.md:3 SKILL.md:28 SKILL.md:39 SKILL.md:42 SKILL.md:94 SKILL.md:167

Static Analyzer False Positives - Weak Cryptographic Algorithm

Static analyzer flagged lines 3, 28, 39, 42, 94, and 167 as containing weak cryptographic algorithms. Review confirms these are false positives - the flagged lines contain architectural terms (ETL/ELT, Lambda, Kappa) and documentation headers, not cryptographic code.

SKILL.md:124-159

Static Analyzer False Positive - External Command Execution

Static analyzer flagged line 124 as Ruby/shell backtick execution. Review confirms this is a Python code example showing batch ingestion patterns, not shell command execution.

SKILL.md:49 SKILL.md:116 SKILL.md:184

Static Analyzer False Positives - Reconnaissance Patterns

Static analyzer flagged lines 49, 116, and 184 as system/network reconnaissance. Review confirms these are data pipeline terminology (metadata tracking fields, partitioning strategies, monitoring alerts), not reconnaissance activity.

감사자: claude

품질 점수

아키텍처

100

유지보수성

콘텐츠

커뮤니티

보안

사양 준수

만들 수 있는 것

全新管道架構設計

為從電子試算表遷移至現代資料堆疊的新創公司從頭設計完整的資料管道。

串流遷移策略

使用 Kafka 與串流處理框架，將現有批次管道轉換為即時串流架構。

資料品質框架實作

使用 Great Expectations 與具自動警報的 dbt 測試，實作全面的資料品質檢查。

이 프롬프트를 사용해 보세요

基本管道設計

I need to build a data pipeline that extracts data from PostgreSQL daily, transforms it, and loads it to a data warehouse. What architecture should I use and what are the key components?

串流架構選擇

We have high-volume event data from our application and need near-real-time analytics. Compare Lambda vs Kappa architecture for our use case with 1M events per minute.

資料品質實作

Show me how to implement data quality checks for our orders table using Great Expectations. We need to validate uniqueness of order IDs, non-null customer IDs, and positive order amounts.

成本最佳化審查

Our monthly data pipeline costs have doubled. Review our architecture and provide specific recommendations to reduce costs while maintaining SLA. Current stack: Airflow, Spark, S3, Redshift.

모범 사례

在選擇架構模式前，先評估資料來源、量、延遲需求與目標系統
使用浮水印欄位實作增量處理，以避免重新處理完整資料集
在每個管道階段加入資料品質閘道，並在驗證失敗時自動發出警報

피하기

未針對特定資料量與速度需求進行調整，直接複製生產環境模式
基於趨勢而非業務需求與團隊能力選擇架構
優先考慮功能而非監控、可觀察性與操作手冊

자주 묻는 질문

Should I use Lambda or Kappa architecture for real-time analytics?

當您需要批次準確度與具複零聚合的低延遲檢視時，請選擇 Lambda。若僅需簡單的串流處理且重放能力已足夠，請選擇 Kappa。Kappa 可降低操作複雜度，但需要強健的串流處理基礎架構。

How do I handle late-arriving data in streaming pipelines?

使用具浮水印的事件時間處理來定義遲到閾值。為可重新處理的遲到資料實作側邊輸出。對於關鍵資料，請維護定期執行的批次修正工作以修復任何遺漏的記錄。

What file format should I use for data lake storage?

對於具壓縮與謂詞下推的欄式分析工作負載，請使用 Parquet。Delta Lake 或 Iceberg 在 Parquet 之上新增 ACID 交易、架構演進與時間旅行。請根據您對交易與中繼資料管理的需求進行選擇。

When should I use dbt versus Spark for transformations?

使用 dbt 在您的資料倉儲中進行基於 SQL 的轉換，具備內建測試與文件功能。使用 Spark 處理大規模資料、需要 Python/Scala 的複雜轉換，或是在載入至倉儲前處理資料湖。

How do I achieve exactly-once processing in streaming?

結合等冪接收器與交易式處理。使用 Kafka 交易進行原子寫入、用於復原的檢查點狀態，並設計等冪操作。對於資料庫，請使用具唯一限制條件的 upsert 操作以防止重複。

What monitoring metrics are essential for data pipelines?

追蹤：每階段處理與失敗的記錄數、端對端延遲、資料新鮮度、管道成功率與資源使用率。在 SLA 違規、錯誤率飆升與資料品質失敗時設定警報。監控趨勢以在造成停機前識別容量問題。

개발자 세부 정보

작성자

sickn33

라이선스

MIT

리포지토리

https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/data-engineering-data-pipeline

참조

main

파일 구조

📄 SKILL.md