ML Engineer
Build Production ML Systems with Expert Guidance
Deploying machine learning models to production requires expertise in serving, monitoring, and infrastructure that many teams lack. This skill provides battle-tested patterns for building reliable, scalable ML systems using modern frameworks like PyTorch 2.x and TensorFlow.
์คํฌ ZIP ๋ค์ด๋ก๋
Claude์์ ์ ๋ก๋
์ค์ โ ๊ธฐ๋ฅ โ ์คํฌ โ ์คํฌ ์ ๋ก๋๋ก ์ด๋
ํ ๊ธ์ ์ผ๊ณ ์ฌ์ฉ ์์
ํ ์คํธํด ๋ณด๊ธฐ
"ML Engineer" ์ฌ์ฉ ์ค์ ๋๋ค. Design a model serving architecture for image classification with 50ms latency SLA
์์ ๊ฒฐ๊ณผ:
- Recommended architecture using TorchServe with GPU instances
- Request batching configuration for throughput optimization
- Redis layer for prediction caching on repeated inputs
- Auto-scaling policy based on queue depth and latency metrics
- Circuit breaker pattern for graceful degradation during failures
"ML Engineer" ์ฌ์ฉ ์ค์ ๋๋ค. How do I implement A/B testing for model comparison
์์ ๊ฒฐ๊ณผ:
- Traffic splitting strategy with sticky sessions for user consistency
- Statistical power calculation for detecting 2% improvement
- Guardrail metrics to monitor for negative side effects
- Sequential testing approach with early stopping criteria
- Sample size estimation based on baseline conversion rate
๋ณด์ ๊ฐ์ฌ
์์ Prompt-only skill with no executable code. Static analysis found 0 files with executable content and computed risk score of 0/100. The SKILL.md file contains only markdown documentation and AI assistant instructions for ML engineering tasks. No security concerns identified.
ํ์ง ์ ์
๋ง๋ค ์ ์๋ ๊ฒ
Real-time Recommendation System
Design a high-throughput recommendation engine handling 100K predictions per second with Redis caching and model serving via TorchServe.
ML Pipeline Automation
Build end-to-end ML pipelines with Apache Airflow or Kubeflow that automate data processing, training, validation, and deployment.
Model Performance Monitoring
Implement comprehensive monitoring with Prometheus and Grafana to track data drift, prediction latency, and business metrics in production.
์ด ํ๋กฌํํธ๋ฅผ ์ฌ์ฉํด ๋ณด์ธ์
I have a trained PyTorch model saved as model.pth. Guide me through deploying it as a REST API using FastAPI and Docker. Include health checks, input validation, and basic logging.
Design a feature store architecture for our e-commerce recommendation system. We need both batch features (user purchase history) and real-time features (session activity). Compare Feast vs Tecton for our use case.
We need to train a 2B parameter transformer model on 8xA100 GPUs. Recommend a distributed training strategy using PyTorch FSDP or DeepSpeed. Include gradient checkpointing, mixed precision, and communication optimization.
Design a comprehensive monitoring system for our fraud detection model serving 10K requests/second. Include data drift detection, model performance tracking, alerting thresholds, and automated rollback triggers.
๋ชจ๋ฒ ์ฌ๋ก
- Always implement comprehensive input validation and data quality checks before model inference to catch drift early
- Use infrastructure as code (Terraform, CloudFormation) for reproducible ML infrastructure deployments
- Design for graceful degradation with fallback models and circuit breakers to maintain service during failures
ํผํ๊ธฐ
- Deploying models without monitoring for data drift or performance degradation leads to silent failures
- Hardcoding model paths or hyperparameters in application code instead of using model registries
- Running training and inference on the same infrastructure causes resource contention and unpredictable latency