dask
Escala pandas y NumPy más allá de la memoria con Dask
또한 다음에서 사용할 수 있습니다: K-Dense-AI
Procesar grandes conjuntos de datos que exceden la RAM disponible causa errores de memoria y rendimiento lento. Dask proporciona abstracciones de computación paralela que escalan operaciones de pandas y NumPy para manejar datos a escala de terabytes en laptops o clústeres.
스킬 ZIP 다운로드
Claude에서 업로드
설정 → 기능 → 스킬 → 스킬 업로드로 이동
토글을 켜고 사용 시작
테스트해 보기
"dask" 사용 중입니다. Cargar un archivo CSV de 50GB que excede la RAM y calcular el promedio de ventas por región
예상 결과:
- Created Dask DataFrame with 500 partitions (~100MB each)
- Filtered to valid records (status='completed'): 45.2M rows
- Grouped by region and computed mean('sale_amount')
- Result: 8 regions with averages ranging from $127 to $892
- Peak memory usage: 12GB (vs 50GB total data)
- Processing time: 4 minutes on 8 cores
"dask" 사용 중입니다. Procesar 1 millón de archivos de logs JSON con seguimiento de errores
예상 결과:
- Loaded logs using Dask Bag with 1000 partitions
- Filtered to error-level entries: 127,453 records
- Extracted error patterns and counted occurrences
- Top errors: timeout (34%), disk full (22%), network (18%)
- Saved aggregation to Parquet for dashboard display
"dask" 사용 중입니다. Ejecutar multiplicación de matrices en un conjunto de datos numérico de 500GB
예상 결과:
- Created Dask Array from HDF5 with 500 chunks (~1GB each)
- Applied SVD decomposition across 16 cores
- Reduced 500GB to 50GB of principal components
- Processing completed in 23 minutes with memory peak at 64GB
보안 감사
안전This is a pure documentation skill containing only markdown files with example Python code in code blocks. No executable code, network calls, file system access, or external commands detected. The static analyzer produced 449 false positives by flagging markdown code block backticks as shell commands, cryptographic mentions in documentation as weak algorithms, and technical terms like 'process', 'command', 'control' as security keywords. All flagged content is benign documentation about Dask parallel computing library usage.
위험 요인
⚙️ 외부 명령어 (403)
📁 파일 시스템 액세스 (1)
🌐 네트워크 접근 (1)
품질 점수
만들 수 있는 것
Análisis de grandes conjuntos de datos
Analiza conjuntos de datos demasiado grandes para los límites de memoria de pandas en una sola máquina
Preprocesamiento distribuido
Preprocesa datos de entrenamiento a través de múltiples núcleos o nodos de clúster
Computación científica a escala
Ejecuta simulaciones numéricas y procesa grandes conjuntos de datos científicos
이 프롬프트를 사용해 보세요
Use Dask to read all CSV files matching 'data/year=2024/month=*/day=*.csv' into a single DataFrame, filter for records where status='valid', and compute the mean and sum of the 'amount' column grouped by category.
Create a Dask array from a Zarr file containing 100GB of image data, normalize each chunk by subtracting the mean and dividing by standard deviation, then save the normalized result back to Zarr format.
Set up a local Dask distributed client and submit 100 independent parameter sweep tasks where each task runs a simulation function with different parameters, then gather results as they complete.
Read all JSON log files from 'logs/*.json', filter for error entries, extract the 'message' and 'timestamp' fields, convert to a DataFrame, and save aggregated error counts by message pattern to Parquet.
모범 사례
- Permite que Dask cargue los datos directamente en lugar de cargar primero en pandas y luego convertir
- Usa una sola llamada a compute() para múltiples operaciones para habilitar la ejecución paralela
- Apunta a tamaños de fragmento de alrededor de 100MB por partición para equilibrar paralelismo y sobrecarga
피하기
- Cargar el conjunto de datos completo en pandas antes de convertir a Dask DataFrame
- Llamar compute() dentro de bucles en lugar de agrupar operaciones por lotes
- Crear millones de tareas pequeñas al no usar map_partitions para operaciones por lotes