spark-optimization
Improve Spark Performance for Large Pipelines
Slow Spark jobs waste cluster time and delay analytics. This skill provides proven tuning patterns for partitioning, caching, joins, and memory to improve performance.
Download the skill ZIP
Upload in Claude
Go to Settings → Capabilities → Skills → Upload skill
Toggle on and start using
Test it
Using "spark-optimization". Suggest Spark optimizations for a slow join and high shuffle spill.
Expected outcome:
- Enable AQE and skew join handling to split hot partitions.
- Broadcast the small dimension table to avoid shuffle on that side.
- Lower shuffle partition size and enable compression to reduce spill.
Using "spark-optimization". How do I optimize partitioning for 500GB of daily data?
Expected outcome:
- Target 128-256 MB per partition for balanced parallelism.
- For 500GB, use roughly 2000-4000 partitions initially.
- Adjust based on actual task duration in Spark UI.
- Consider partitionBy on date columns for pruneable reads.
Using "spark-optimization". My cache is not improving performance. What am I doing wrong?
Expected outcome:
- Ensure you call count() or an action to materialize the cache.
- Check that you are reusing the DataFrame multiple times.
- Verify the DataFrame fits in memory without excessive spilling.
- Use unpersist() when done to free memory for other operations.
Security Audit
SafePure documentation skill containing only markdown content with Apache Spark tuning guidance. No executable code, credential access, network calls, or malicious patterns detected. All 43 static findings are false positives triggered by misidentified Spark terminology.
Risk Factors
🌐 Network access (4)
⚙️ External commands (23)
Quality Score
What You Can Build
Reduce nightly job time
Analyze a slow batch pipeline and get tuning steps for partitions, joins, and caching.
Fix skewed joins
Apply AQE and salting guidance to remove long running tasks.
Standardize Spark configs
Create a baseline executor and shuffle configuration for new clusters.
Try These Prompts
My Spark job takes 2 hours and uses groupBy on large tables. Suggest quick wins for partitions, caching, and joins.
I process 1 TB of parquet data daily. Recommend partition counts and file sizes, and explain how to adjust shuffle partitions.
A join on customer_id has a few hot keys and long tasks. Provide AQE settings and a manual salting approach.
We use 8g executors and see frequent spills. Propose memory, overhead, and shuffle settings with rationale.
Best Practices
- Use AQE and monitor Spark UI for skew and spills.
- Target 128 to 256 MB partition sizes for balanced parallelism.
- Prefer built in functions over UDFs for better optimization.
Avoid
- Collecting large datasets to the driver.
- Over caching multiple large DataFrames without unpersist.
- Using wide shuffles for simple aggregates without pre aggregation.
Frequently Asked Questions
Is this compatible with PySpark and Spark SQL?
What are the limits of the recommendations?
Can it integrate with Databricks or EMR?
Does it access my data or cluster?
What if performance does not improve?
How does it compare to generic tuning advice?
Developer Details
Author
wshobsonLicense
MIT
Repository
https://github.com/wshobson/agents/tree/main/plugins/data-engineering/skills/spark-optimizationRef
main
File structure
📄 SKILL.md