Skills spark-optimization

⚡

spark-optimization

Name: spark-optimization
Author: wshobson

Safe 🌐 Network access⚙️ External commands

Improve Spark Performance for Large Pipelines

Slow Spark jobs waste cluster time and delay analytics. This skill provides proven tuning patterns for partitioning, caching, joins, and memory to improve performance.

Supports: Claude Codex Code(CC)

📊 69 Adequate

Download the skill ZIP

Upload in Claude

Go to Settings → Capabilities → Skills → Upload skill

Toggle on and start using

Test it

Using "spark-optimization". Suggest Spark optimizations for a slow join and high shuffle spill.

Expected outcome:

Enable AQE and skew join handling to split hot partitions.
Broadcast the small dimension table to avoid shuffle on that side.
Lower shuffle partition size and enable compression to reduce spill.

Using "spark-optimization". How do I optimize partitioning for 500GB of daily data?

Expected outcome:

Target 128-256 MB per partition for balanced parallelism.
For 500GB, use roughly 2000-4000 partitions initially.
Adjust based on actual task duration in Spark UI.
Consider partitionBy on date columns for pruneable reads.

Using "spark-optimization". My cache is not improving performance. What am I doing wrong?

Expected outcome:

Ensure you call count() or an action to materialize the cache.
Check that you are reusing the DataFrame multiple times.
Verify the DataFrame fits in memory without excessive spilling.
Use unpersist() when done to free memory for other operations.

Security Audit

Safe

v4 • 1/17/2026

Pure documentation skill containing only markdown content with Apache Spark tuning guidance. No executable code, credential access, network calls, or malicious patterns detected. All 43 static findings are false positives triggered by misidentified Spark terminology.

Files scanned

590

Lines analyzed

findings

Total audits

Audited by: claude View Audit History →

Quality Score

Architecture

100

Maintainability

Content

Community

100

Security

Spec Compliance

What You Can Build

Reduce nightly job time

Analyze a slow batch pipeline and get tuning steps for partitions, joins, and caching.

Fix skewed joins

Apply AQE and salting guidance to remove long running tasks.

Standardize Spark configs

Create a baseline executor and shuffle configuration for new clusters.

Try These Prompts

Speed up my job

My Spark job takes 2 hours and uses groupBy on large tables. Suggest quick wins for partitions, caching, and joins.

Partition sizing

I process 1 TB of parquet data daily. Recommend partition counts and file sizes, and explain how to adjust shuffle partitions.

Skew diagnosis

A join on customer_id has a few hot keys and long tasks. Provide AQE settings and a manual salting approach.

Memory tuning

We use 8g executors and see frequent spills. Propose memory, overhead, and shuffle settings with rationale.

Best Practices

Use AQE and monitor Spark UI for skew and spills.
Target 128 to 256 MB partition sizes for balanced parallelism.
Prefer built in functions over UDFs for better optimization.

Avoid

Collecting large datasets to the driver.
Over caching multiple large DataFrames without unpersist.
Using wide shuffles for simple aggregates without pre aggregation.

Frequently Asked Questions

Is this compatible with PySpark and Spark SQL?

Yes. The guidance covers PySpark DataFrame and Spark SQL configurations.

What are the limits of the recommendations?

They are general patterns and require validation against your data size and cluster constraints.

Can it integrate with Databricks or EMR?

Yes. You can apply the same Spark configs and optimization steps in those platforms.

Does it access my data or cluster?

No. It provides guidance only and does not connect to your systems.

What if performance does not improve?

Provide Spark UI metrics, query plans, and data sizes to refine the recommendations.

How does it compare to generic tuning advice?

It focuses on Spark specific execution stages, shuffles, and memory behavior with concrete configuration examples.

Developer Details

Author

wshobson

License

MIT

Repository

https://github.com/wshobson/agents/tree/main/plugins/data-engineering/skills/spark-optimization

Ref

main

File structure

📄 SKILL.md

spark-optimization

Test it

Security Audit

Risk Factors

Quality Score

What You Can Build

Reduce nightly job time

Fix skewed joins

Standardize Spark configs

Try These Prompts

Best Practices

Avoid

Frequently Asked Questions

Developer Details