技能 spark-optimization

⚡

spark-optimization

Name: spark-optimization
Author: wshobson

安全 🌐 網路存取⚙️ 外部命令

改善大型管線的 Spark 效能

也可從以下取得: sickn33

緩慢的 Spark 作業會浪費叢集時間並延誤分析。此技能提供經過驗證的調整模式，用於分割、快取、連接和記憶體，以提升效能。

支援: Claude Codex Code(CC)

📊 69 充足

下載技能 ZIP

在 Claude 中上傳

前往設定 → 功能 → 技能 → 上傳技能

開啟並開始使用

測試它

正在使用「spark-optimization」。 Suggest Spark optimizations for a slow join and high shuffle spill.

預期結果:

Enable AQE and skew join handling to split hot partitions.
Broadcast the small dimension table to avoid shuffle on that side.
Lower shuffle partition size and enable compression to reduce spill.

正在使用「spark-optimization」。 How do I optimize partitioning for 500GB of daily data?

預期結果:

Target 128-256 MB per partition for balanced parallelism.
For 500GB, use roughly 2000-4000 partitions initially.
Adjust based on actual task duration in Spark UI.
Consider partitionBy on date columns for pruneable reads.

正在使用「spark-optimization」。 My cache is not improving performance. What am I doing wrong?

預期結果:

Ensure you call count() or an action to materialize the cache.
Check that you are reusing the DataFrame multiple times.
Verify the DataFrame fits in memory without excessive spilling.
Use unpersist() when done to free memory for other operations.

安全審計

安全

v4 • 1/17/2026

Pure documentation skill containing only markdown content with Apache Spark tuning guidance. No executable code, credential access, network calls, or malicious patterns detected. All 43 static findings are false positives triggered by misidentified Spark terminology.

已掃描檔案

590

分析行數

發現項

審計總數

審計者: claude 查看審計歷史 →

品質評分

架構

100

可維護性

內容

社群

100

安全

規範符合性

你能建構什麼

減少夜間作業時間

分析緩慢的批次管線，並獲得針對分割區、連接和快取的調整步驟。

修復傾斜連接

應用 AQE 和加鹽指導來移除長時間執行的任務。

標準化 Spark 配置

為新叢集建立基準執行器和隨機分配配置。

試試這些提示

加快我的作業速度

My Spark job takes 2 hours and uses groupBy on large tables. Suggest quick wins for partitions, caching, and joins.

分割區大小調整

I process 1 TB of parquet data daily. Recommend partition counts and file sizes, and explain how to adjust shuffle partitions.

傾斜診斷

A join on customer_id has a few hot keys and long tasks. Provide AQE settings and a manual salting approach.

記憶體調整

We use 8g executors and see frequent spills. Propose memory, overhead, and shuffle settings with rationale.

最佳實務

使用 AQE 並監控 Spark UI 以偵測傾斜和溢寫。
目標分割區大小為 128 到 256 MB，以實現平衡的平行處理。
優先使用內建函數而非 UDF，以獲得更好的最佳化效果。

避免

將大型資料集收集到驅動程式。
快取多個大型 DataFrame 而未進行 unpersist。
在沒有預先聚合的情況下，為簡單聚合使用廣泛的隨機分配。

常見問題

這是否與 PySpark 和 Spark SQL 相容？

是的。此指導涵蓋 PySpark DataFrame 和 Spark SQL 配置。

建議的限制是什麼？

這些是一般模式，需要根據您的資料大小和叢集限制進行驗證。

是否可以與 Databricks 或 EMR 整合？

是的。您可以在這些平台上套用相同的 Spark 配置和最佳化步驟。

它會存取我的資料或叢集嗎？

不會。它僅提供指導，不會連接到您的系統。

如果效能沒有改善怎麼辦？

提供 Spark UI 指標、查詢計劃和資料大小以優化建議。

這與一般調整建議有何不同？

它專注於 Spark 特定的執行階段、隨機分配和記憶體行為，並提供具體的配置範例。

開發者詳情

作者

wshobson

授權

MIT

儲存庫

https://github.com/wshobson/agents/tree/main/plugins/data-engineering/skills/spark-optimization

引用

main

檔案結構

📄 SKILL.md

spark-optimization

測試它

安全審計

風險因素

品質評分

你能建構什麼

減少夜間作業時間

修復傾斜連接

標準化 Spark 配置

試試這些提示

最佳實務

避免

常見問題

開發者詳情