vaex
Analyze massive datasets with Vaex
Also available from: davila7
Processing large tabular datasets that exceed RAM requires specialized tools. Vaex enables out-of-core DataFrame operations, lazy evaluation, and billion-row-per-second processing on datasets too big for memory. Perfect for astronomical data, financial time series, and large-scale scientific analysis.
Download the skill ZIP
Upload in Claude
Go to Settings → Capabilities → Skills → Upload skill
Toggle on and start using
Test it
Using "vaex". Load my parquet file and show statistics
Expected outcome:
- DataFrame shape: (10,000,000, 15) rows x columns
- Column types: int64 (5), float64 (7), string (3)
- Memory usage: 0.5 GB (virtual columns)
- Mean age: 34.2 | Std income: 45200.5
Using "vaex". Filter and group data
Expected outcome:
- Filtered to 2.3 million rows (age > 25)
- Group by category results:
- - Electronics: 450K rows, mean $52,000
- - Clothing: 890K rows, mean $31,000
- - Home: 960K rows, mean $42,000
Using "vaex". Convert CSV to HDF5 for performance
Expected outcome:
- Original CSV: 15 GB, 45 minutes to load
- Converted HDF5: 8 GB, instant loading
- Memory-mapped access - zero RAM for exploration
Security Audit
SafeThis is a pure documentation skill for the Vaex Python library. All 498 static findings are false positives caused by markdown code block formatting. The scanner misinterpreted backticks in code examples as Ruby/shell commands, flagged memory-mapping as filesystem access, and misidentified DataFrame inspection methods as reconnaissance. No executable code, credential handling, or malicious patterns exist.
Risk Factors
⚙️ External commands (7)
📁 Filesystem access (3)
🌐 Network access (2)
Quality Score
What You Can Build
Explore billion-row datasets
Analyze massive CSV/HDF5 datasets interactively without memory constraints or preprocessing.
Process astronomical data
Work with terabyte-scale scientific datasets using out-of-core computation and lazy evaluation.
Build scalable pipelines
Create feature engineering and ML workflows that handle datasets exceeding available RAM.
Try These Prompts
Use Vaex to open my HDF5 file at data/large_dataset.hdf5 and show its structure, column types, and row count.
Filter the dataset for records where age > 25 and calculate the mean and standard deviation of income grouped by category.
Create a heatmap showing the relationship between x and y coordinates with 100 bins on each axis.
Use Vaex ML to create a StandardScaler for features age and income, then apply PCA for dimensionality reduction.
Best Practices
- Use HDF5 or Apache Arrow formats for instant memory-mapped loading instead of CSV
- Leverage virtual columns and expressions for computations without materializing data
- Batch operations with delay=True when performing multiple aggregations for efficiency
Avoid
- Avoid loading entire datasets into RAM - use vaex.open() for memory-mapped access
- Do not convert large datasets to pandas - use Vaex operations throughout the pipeline
- Avoid multiple small exports - batch writes and use efficient formats like HDF5