DataCorp

Data Analytics

Skalowanie pipeline ML z 100 do 10,000 requestów/s

Potrzeba przeskalowania pipeline machine learning dla rosnącego ruchu

20 stycznia 2024

4 tygodnie

50-100 osób

Data & Analytics

KubernetesML PipelineRedis

Wyniki projektu

100x

Skalowanie

4 tyg.

Implementacja

99.9%

Dostępność

Technologie

KubernetesRedisTensorFlowPrometheusGrafana

Powrót do case studies

Case Study: DataCorp - Skalowanie ML Pipeline

O kliencie

DataCorp to wiodąca firma analityczna specjalizująca się w real-time data processing dla sektora e-commerce. Ich platforma przetwarza dane z ponad 500 sklepów online, dostarczając insights w czasie rzeczywistym.

Wyzwanie biznesowe

Firma zmagała się z gwałtownym wzrostem ruchu - z 100 requestów/s do potrzeby obsługi 10,000 req/s w ciągu 6 miesięcy. Istniejąca architektura nie była w stanie skalować się z taką prędkością.

Sytuacja początkowa

Problemy techniczne

Current Architecture Limitations:
  - Single VM deployment: Monolityczny deployment na jednej maszynie
  - No horizontal scaling: Brak możliwości dodawania instancji
  - Memory bottlenecks: 32GB RAM limit na VM
  - Cold start latency: 5-10s startup time dla nowych modeli
  - Manual scaling: Ręczne zarządzanie zasobami

Architektura przed skalowaniem

ML Models: TensorFlow models na pojedynczej VM (32GB RAM, 16 CPU)
Data processing: Pandas + NumPy na pojedynczej instancji
Cache: Lokalny file-based cache
Load balancing: Nginx na tej samej maszynie
Monitoring: Podstawowe logi systemowe

Problemy wydajnościowe

| Metryka | Wartość | Problem | |---------|---------|---------| | Max throughput | 100 req/s | Bottleneck na CPU | | Response time p95 | 2.5s | Za wolno dla real-time | | Memory usage | 95%+ | Constant swapping | | Error rate | 5-15% | OOM kills | | Availability | 94% | Frequent downtime |

Proces implementacji

Week 1: Containerization & Kubernetes Setup

1. Model Containerization

# Dockerfile dla ML model service
FROM python:3.9-slim

WORKDIR /app

# Dependency installation with layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Model artifacts
COPY models/ ./models/
COPY src/ ./src/

# Multi-stage build for smaller images
FROM python:3.9-slim as runtime
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages
COPY --from=builder /app /app

EXPOSE 8080
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "4", "app:app"]

2. Kubernetes Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-inference
  template:
    metadata:
      labels:
        app: ml-inference
    spec:
      containers:
      - name: ml-service
        image: datacorp/ml-inference:v1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        env:
        - name: REDIS_URL
          value: "redis://redis-cluster:6379"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Week 2: Auto-scaling Implementation

1. Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "200"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

2. Cluster Autoscaler Setup

# Cluster Autoscaler dla node scaling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
        name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/datacorp-ml-cluster
        - --balance-similar-node-groups
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m

Week 3: Caching & Performance Optimization

1. Redis Cluster Implementation

# Redis cluster dla distributed caching
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-cluster-config
data:
  redis.conf: |
    cluster-enabled yes
    cluster-require-full-coverage no
    cluster-node-timeout 15000
    cluster-config-file nodes.conf
    cluster-migration-barrier 1
    appendonly yes
    maxmemory 2gb
    maxmemory-policy allkeys-lru
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cluster
spec:
  serviceName: redis-cluster
  replicas: 6
  template:
    spec:
      containers:
      - name: redis
        image: redis:6.2-alpine
        ports:
        - containerPort: 6379
        - containerPort: 16379
        volumeMounts:
        - name: conf
          mountPath: /usr/local/etc/redis/redis.conf
          subPath: redis.conf
        - name: data
          mountPath: /data

2. Model Caching Strategy

# Smart caching dla ML predictions
import redis
import pickle
import hashlib
from functools import wraps

class MLModelCache:
    def __init__(self, redis_url, ttl=3600):
        self.redis_client = redis.from_url(redis_url)
        self.ttl = ttl
    
    def cache_prediction(self, model_version):
        def decorator(func):
            @wraps(func)
            def wrapper(*args, **kwargs):
                # Create cache key from input hash
                input_hash = hashlib.md5(
                    pickle.dumps((args, kwargs))
                ).hexdigest()
                cache_key = f"ml_pred:{model_version}:{input_hash}"
                
                # Try to get from cache
                cached_result = self.redis_client.get(cache_key)
                if cached_result:
                    return pickle.loads(cached_result)
                
                # Compute and cache result
                result = func(*args, **kwargs)
                self.redis_client.setex(
                    cache_key, 
                    self.ttl, 
                    pickle.dumps(result)
                )
                return result
            return wrapper
        return decorator

# Usage in ML service
cache = MLModelCache(os.getenv('REDIS_URL'))

@cache.cache_prediction("v2.1")
def predict_user_behavior(user_features, session_data):
    return model.predict([user_features, session_data])

Week 4: Monitoring & Observability

1. Prometheus Metrics

# Custom metrics dla ML pipeline
from prometheus_client import Counter, Histogram, Gauge

# Request metrics
prediction_requests = Counter(
    'ml_prediction_requests_total',
    'Total ML prediction requests',
    ['model_version', 'status']
)

prediction_latency = Histogram(
    'ml_prediction_duration_seconds',
    'ML prediction latency',
    ['model_version']
)

model_cache_hits = Counter(
    'ml_cache_hits_total',
    'Total cache hits',
    ['model_version']
)

active_models = Gauge(
    'ml_active_models',
    'Number of active ML models'
)

# Usage in prediction endpoint
@app.route('/predict', methods=['POST'])
def predict():
    start_time = time.time()
    
    try:
        result = make_prediction(request.json)
        prediction_requests.labels(
            model_version='v2.1', 
            status='success'
        ).inc()
        return jsonify(result)
    
    except Exception as e:
        prediction_requests.labels(
            model_version='v2.1', 
            status='error'
        ).inc()
        return jsonify({'error': str(e)}), 500
    
    finally:
        prediction_latency.labels(
            model_version='v2.1'
        ).observe(time.time() - start_time)

2. Grafana Dashboards

{
  "dashboard": {
    "title": "ML Pipeline Performance",
    "panels": [
      {
        "title": "Requests per Second",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(ml_prediction_requests_total[5m])",
            "legendFormat": "{{model_version}} - {{status}}"
          }
        ]
      },
      {
        "title": "Prediction Latency",
        "type": "graph", 
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(ml_prediction_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          },
          {
            "expr": "histogram_quantile(0.50, rate(ml_prediction_duration_seconds_bucket[5m]))",
            "legendFormat": "Median"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(ml_cache_hits_total[5m]) / rate(ml_prediction_requests_total[5m]) * 100",
            "legendFormat": "Cache Hit %"
          }
        ]
      }
    ]
  }
}

Wyniki po implementacji

Metryki wydajności

Throughput & Latency

| Metryka | Przed | Po | Poprawa | |---------|-------|----|---------:| | Max throughput | 100 req/s | 10,000 req/s | 100x | | Avg latency | 800ms | 45ms | -94% | | P95 latency | 2.5s | 120ms | -95% | | P99 latency | 8s | 250ms | -97% |

Reliability & Availability

| Metryka | Przed | Po | Poprawa | |---------|-------|----|---------:| | Uptime | 94% | 99.9% | +5.9% | | Error rate | 8% | 0.1% | -99% | | MTTR | 2h | 5min | -96% | | Cache hit rate | 0% | 85% | +85% |

Resource Utilization

Resource Efficiency:
  CPU Usage:
    - Single VM: 95% (constant stress)
    - Kubernetes: 60% average (elastic scaling)
  
  Memory Usage:
    - Single VM: 95%+ (frequent OOM)
    - Kubernetes: 70% average (better allocation)
  
  Cost per Request:
    - Before: $0.008/request
    - After: $0.002/request (-75%)

Auto-scaling Performance

Scale-up Events

# Przykład event'u auto-scaling
Event Type: Normal
Reason: SuccessfulRescale
Message: New size: 15; reason: cpu resource utilization (percentage of request) above target
Time: 2024-01-25 14:30:15

# Skuteczność skalowania
Average scale-up time: 45 seconds
Average scale-down time: 5 minutes (deliberate delay)
Max pods reached: 45 (during traffic spike)
Min pods maintained: 3 (baseline)

Technologie zastosowane

Container Orchestration

Kubernetes 1.21: Container orchestration
Docker: Containerization platform
Helm: Package management
Kustomize: Configuration management

Performance & Caching

Redis Cluster: Distributed caching (6 nodes)
Nginx Ingress: Load balancing & SSL termination
TensorFlow Serving: Optimized model serving
Gunicorn: WSGI server with multiple workers

Monitoring Stack

Observability:
  Metrics:
    - Prometheus: Metrics collection
    - Grafana: Visualization
    - AlertManager: Alerting
  
  Logging:
    - Fluentd: Log aggregation
    - Elasticsearch: Log storage
    - Kibana: Log analysis
  
  Tracing:
    - Jaeger: Distributed tracing
    - OpenTelemetry: Instrumentation

Challenges & Solutions

1. Cold Start Problem

Problem: Nowe pods potrzebowały 30s na załadowanie modeli

Solution:

# Model pre-loading via init containers
initContainers:
- name: model-loader
  image: datacorp/model-loader:v1.0
  volumeMounts:
  - name: model-cache
    mountPath: /models
  command: ["python", "preload_models.py"]

2. Memory Management

Problem: TensorFlow models zużywały nieprzewidywalną ilość RAM

Solution:

# TensorFlow memory growth configuration
import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)

# Model loading with memory limits
tf.config.experimental.set_memory_growth(True)

3. Cache Consistency

Problem: Inconsistent cache między różnymi model versions

Solution:

# Version-aware caching
cache_key = f"ml_pred:{model_version}:{feature_hash}:{timestamp//3600}"

Business Impact

Performance ROI

Revenue increase: +40% dzięki real-time personalization
Customer satisfaction: Response time poniżej 100ms
Operational efficiency: 90% redukcja manual intervention

Cost Optimization

Monthly Costs:
  Before:
    - Single VM (32 CPU, 64GB): $800/month
    - Manual operations: $4,000/month (DevOps time)
    - Downtime cost: $10,000/month (SLA breaches)
    Total: $14,800/month
  
  After:
    - Kubernetes cluster: $1,200/month (dynamic scaling)
    - Automated operations: $500/month (monitoring)
    - Downtime cost: $100/month (99.9% uptime)
    Total: $1,800/month
  
  Monthly savings: $13,000 (88% reduction)

Lessons Learned

1. Gradual Migration Strategy

"Big bang migrations są przepisem na katastrofę"

Migrowaliśmy po jednej usłudze, utrzymując fallback na starą architekturę.

2. Observability First

Kompletny monitoring stack był kluczowy dla debugging performance issues.

3. Cache Invalidation Strategy

Properly designed cache keys z model versioning zapobiegły stale data issues.

4. Resource Planning

HPA + VPA + Cluster Autoscaler wymagają careful tuning żeby uniknąć thrashing.

Future Improvements

Short-term (1-3 miesiące)

GPU acceleration dla bardziej complex models
Model versioning z A/B testing framework
Cross-region deployment dla global availability

Long-term (6-12 miesięcy)

MLOps pipeline z automated model training
Feature store dla consistent feature engineering
Multi-cloud strategy dla vendor diversification

Testimonial

"EffiLab nie tylko pomógł nam przeskalować nasze ML pipeline ze 100 do 10,000 req/s, ale też nauczył nasz zespół best practices Kubernetes. Ich approach oparto o metryki i gradual migration pozwolił nam uniknąć downtime podczas transformacji. System jest teraz 100x bardziej skalowalny i 75% tańszy w utrzymaniu."

Anna Nowak, Head of Engineering, DataCorp

Potrzebujesz pomocy ze skalowaniem ML infrastructure? Umów konsultację i sprawdź jak możemy przyspieszyć Twoje modele.