Data Analytics
Potrzeba przeskalowania pipeline machine learning dla rosnącego ruchu
DataCorp to wiodąca firma analityczna specjalizująca się w real-time data processing dla sektora e-commerce. Ich platforma przetwarza dane z ponad 500 sklepów online, dostarczając insights w czasie rzeczywistym.
Firma zmagała się z gwałtownym wzrostem ruchu - z 100 requestów/s do potrzeby obsługi 10,000 req/s w ciągu 6 miesięcy. Istniejąca architektura nie była w stanie skalować się z taką prędkością.
Current Architecture Limitations:
- Single VM deployment: Monolityczny deployment na jednej maszynie
- No horizontal scaling: Brak możliwości dodawania instancji
- Memory bottlenecks: 32GB RAM limit na VM
- Cold start latency: 5-10s startup time dla nowych modeli
- Manual scaling: Ręczne zarządzanie zasobami
| Metryka | Wartość | Problem | |---------|---------|---------| | Max throughput | 100 req/s | Bottleneck na CPU | | Response time p95 | 2.5s | Za wolno dla real-time | | Memory usage | 95%+ | Constant swapping | | Error rate | 5-15% | OOM kills | | Availability | 94% | Frequent downtime |
# Dockerfile dla ML model service
FROM python:3.9-slim
WORKDIR /app
# Dependency installation with layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Model artifacts
COPY models/ ./models/
COPY src/ ./src/
# Multi-stage build for smaller images
FROM python:3.9-slim as runtime
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages
COPY --from=builder /app /app
EXPOSE 8080
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "4", "app:app"]
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference-service
spec:
replicas: 3
selector:
matchLabels:
app: ml-inference
template:
metadata:
labels:
app: ml-inference
spec:
containers:
- name: ml-service
image: datacorp/ml-inference:v1.0
ports:
- containerPort: 8080
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
env:
- name: REDIS_URL
value: "redis://redis-cluster:6379"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-inference-service
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "200"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
# Cluster Autoscaler dla node scaling
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/datacorp-ml-cluster
- --balance-similar-node-groups
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
# Redis cluster dla distributed caching
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-cluster-config
data:
redis.conf: |
cluster-enabled yes
cluster-require-full-coverage no
cluster-node-timeout 15000
cluster-config-file nodes.conf
cluster-migration-barrier 1
appendonly yes
maxmemory 2gb
maxmemory-policy allkeys-lru
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cluster
spec:
serviceName: redis-cluster
replicas: 6
template:
spec:
containers:
- name: redis
image: redis:6.2-alpine
ports:
- containerPort: 6379
- containerPort: 16379
volumeMounts:
- name: conf
mountPath: /usr/local/etc/redis/redis.conf
subPath: redis.conf
- name: data
mountPath: /data
# Smart caching dla ML predictions
import redis
import pickle
import hashlib
from functools import wraps
class MLModelCache:
def __init__(self, redis_url, ttl=3600):
self.redis_client = redis.from_url(redis_url)
self.ttl = ttl
def cache_prediction(self, model_version):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Create cache key from input hash
input_hash = hashlib.md5(
pickle.dumps((args, kwargs))
).hexdigest()
cache_key = f"ml_pred:{model_version}:{input_hash}"
# Try to get from cache
cached_result = self.redis_client.get(cache_key)
if cached_result:
return pickle.loads(cached_result)
# Compute and cache result
result = func(*args, **kwargs)
self.redis_client.setex(
cache_key,
self.ttl,
pickle.dumps(result)
)
return result
return wrapper
return decorator
# Usage in ML service
cache = MLModelCache(os.getenv('REDIS_URL'))
@cache.cache_prediction("v2.1")
def predict_user_behavior(user_features, session_data):
return model.predict([user_features, session_data])
# Custom metrics dla ML pipeline
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
prediction_requests = Counter(
'ml_prediction_requests_total',
'Total ML prediction requests',
['model_version', 'status']
)
prediction_latency = Histogram(
'ml_prediction_duration_seconds',
'ML prediction latency',
['model_version']
)
model_cache_hits = Counter(
'ml_cache_hits_total',
'Total cache hits',
['model_version']
)
active_models = Gauge(
'ml_active_models',
'Number of active ML models'
)
# Usage in prediction endpoint
@app.route('/predict', methods=['POST'])
def predict():
start_time = time.time()
try:
result = make_prediction(request.json)
prediction_requests.labels(
model_version='v2.1',
status='success'
).inc()
return jsonify(result)
except Exception as e:
prediction_requests.labels(
model_version='v2.1',
status='error'
).inc()
return jsonify({'error': str(e)}), 500
finally:
prediction_latency.labels(
model_version='v2.1'
).observe(time.time() - start_time)
{
"dashboard": {
"title": "ML Pipeline Performance",
"panels": [
{
"title": "Requests per Second",
"type": "graph",
"targets": [
{
"expr": "rate(ml_prediction_requests_total[5m])",
"legendFormat": "{{model_version}} - {{status}}"
}
]
},
{
"title": "Prediction Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(ml_prediction_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(ml_prediction_duration_seconds_bucket[5m]))",
"legendFormat": "Median"
}
]
},
{
"title": "Cache Hit Rate",
"type": "stat",
"targets": [
{
"expr": "rate(ml_cache_hits_total[5m]) / rate(ml_prediction_requests_total[5m]) * 100",
"legendFormat": "Cache Hit %"
}
]
}
]
}
}
| Metryka | Przed | Po | Poprawa | |---------|-------|----|---------:| | Max throughput | 100 req/s | 10,000 req/s | 100x | | Avg latency | 800ms | 45ms | -94% | | P95 latency | 2.5s | 120ms | -95% | | P99 latency | 8s | 250ms | -97% |
| Metryka | Przed | Po | Poprawa | |---------|-------|----|---------:| | Uptime | 94% | 99.9% | +5.9% | | Error rate | 8% | 0.1% | -99% | | MTTR | 2h | 5min | -96% | | Cache hit rate | 0% | 85% | +85% |
Resource Efficiency:
CPU Usage:
- Single VM: 95% (constant stress)
- Kubernetes: 60% average (elastic scaling)
Memory Usage:
- Single VM: 95%+ (frequent OOM)
- Kubernetes: 70% average (better allocation)
Cost per Request:
- Before: $0.008/request
- After: $0.002/request (-75%)
# Przykład event'u auto-scaling
Event Type: Normal
Reason: SuccessfulRescale
Message: New size: 15; reason: cpu resource utilization (percentage of request) above target
Time: 2024-01-25 14:30:15
# Skuteczność skalowania
Average scale-up time: 45 seconds
Average scale-down time: 5 minutes (deliberate delay)
Max pods reached: 45 (during traffic spike)
Min pods maintained: 3 (baseline)
Observability:
Metrics:
- Prometheus: Metrics collection
- Grafana: Visualization
- AlertManager: Alerting
Logging:
- Fluentd: Log aggregation
- Elasticsearch: Log storage
- Kibana: Log analysis
Tracing:
- Jaeger: Distributed tracing
- OpenTelemetry: Instrumentation
Problem: Nowe pods potrzebowały 30s na załadowanie modeli
Solution:
# Model pre-loading via init containers
initContainers:
- name: model-loader
image: datacorp/model-loader:v1.0
volumeMounts:
- name: model-cache
mountPath: /models
command: ["python", "preload_models.py"]
Problem: TensorFlow models zużywały nieprzewidywalną ilość RAM
Solution:
# TensorFlow memory growth configuration
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# Model loading with memory limits
tf.config.experimental.set_memory_growth(True)
Problem: Inconsistent cache między różnymi model versions
Solution:
# Version-aware caching
cache_key = f"ml_pred:{model_version}:{feature_hash}:{timestamp//3600}"
Monthly Costs:
Before:
- Single VM (32 CPU, 64GB): $800/month
- Manual operations: $4,000/month (DevOps time)
- Downtime cost: $10,000/month (SLA breaches)
Total: $14,800/month
After:
- Kubernetes cluster: $1,200/month (dynamic scaling)
- Automated operations: $500/month (monitoring)
- Downtime cost: $100/month (99.9% uptime)
Total: $1,800/month
Monthly savings: $13,000 (88% reduction)
"Big bang migrations są przepisem na katastrofę"
Migrowaliśmy po jednej usłudze, utrzymując fallback na starą architekturę.
Kompletny monitoring stack był kluczowy dla debugging performance issues.
Properly designed cache keys z model versioning zapobiegły stale data issues.
HPA + VPA + Cluster Autoscaler wymagają careful tuning żeby uniknąć thrashing.
"EffiLab nie tylko pomógł nam przeskalować nasze ML pipeline ze 100 do 10,000 req/s, ale też nauczył nasz zespół best practices Kubernetes. Ich approach oparto o metryki i gradual migration pozwolił nam uniknąć downtime podczas transformacji. System jest teraz 100x bardziej skalowalny i 75% tańszy w utrzymaniu."
Anna Nowak, Head of Engineering, DataCorp
Potrzebujesz pomocy ze skalowaniem ML infrastructure? Umów konsultację i sprawdź jak możemy przyspieszyć Twoje modele.