Runbooks Operacionales

Tabla de Contenidos

Propósito
Runbook 1: Desplegar Nueva Versión a Producción
Runbook 2: Rollback de Despliegue Fallido
Runbook 3: Escalar Aplicación Bajo Alta Carga
Runbook 4: Reiniciar Pods Caídos
Runbook 5: Rotar Secrets y Credenciales
Runbook 6: Renovar Certificados TLS
Runbook 7: Agregar Nuevo Microservicio a AKS
Runbook 8: Investigar Alto Uso de Memoria

Propósito

Este documento proporciona procedimientos paso a paso (runbooks) para tareas operacionales comunes en la plataforma Algesta. Cada runbook incluye prerrequisitos, pasos, verificación y procedimientos de rollback.

Runbook 1: Deploy New Version to Production

Objective: Deploy a new version of a Microservicio to production namespace

Prerequisites:

New Docker image pushed to ACR
Access to AKS cluster
kubectl configured

Steps:

Verify Image Exists in ACR:

az acr repository show-tags \
  --name acralgestaproduction \
  --repository api-gateway \
  --output table

Get Current Despliegue State:

kubectl get deployment api-gateway-production -n production -o yaml > backup-deployment.yaml
kubectl rollout history deployment/api-gateway-production -n production

Update Despliegue Image:

kubectl set image deployment/api-gateway-production \
  api-gateway=acralgestaproduction.azurecr.io/api-gateway:12345 \
  --namespace=production \
  --record

Monitor Rollout:

kubectl rollout status deployment/api-gateway-production -n production --timeout=5m

Verify New Pods Running:

kubectl get pods -n production -l app=api-gateway
kubectl logs -l app=api-gateway -n production --tail=50

Test Health Endpoint:

curl https://algesta-api-prod.3astronautas.com/health

Verification:

All pods in Running state
Health Endpoint returns {"Estado":"ok"}
No errors in pod logs
Grafana shows normal Métricas (CPU, memory, request rate)

Rollback (if Despliegue fails):

kubectl rollout undo deployment/api-gateway-production -n production
kubectl rollout status deployment/api-gateway-production -n production

Runbook 2: Rollback Failed Despliegue

Objective: Rollback to previous version after failed Despliegue

Prerequisites:

Despliegue has rollout history
Access to AKS cluster

Steps:

Check Rollout History:

kubectl rollout history deployment/api-gateway-production -n production

Identify Target Revision:

kubectl rollout history deployment/api-gateway-production -n production --revision=3

Rollback to Previous Version:

# Rollback to previous revision
kubectl rollout undo deployment/api-gateway-production -n production

# OR rollback to specific revision
kubectl rollout undo deployment/api-gateway-production -n production --to-revision=3

Monitor Rollback:

kubectl rollout status deployment/api-gateway-production -n production

Verify Pods Running:

kubectl get pods -n production -l app=api-gateway

Test Application:

curl https://algesta-api-prod.3astronautas.com/health

Verification:

Pods running with previous image tag
Health checks passing
Error rate decreased in Grafana

Runbook 3: Scale Application Under High Load

Objective: Manually scale Despliegue to handle traffic spike

Prerequisites:

Monitoring shows high CPU/memory usage
Access to AKS cluster

Steps:

Check Current Scale:

kubectl get deployment api-gateway-production -n production
kubectl top pods -n production -l app=api-gateway

Check HPA Estado:

kubectl get hpa api-gateway-production-hpa -n production

Temporarily Disable HPA (optional):

kubectl patch hpa api-gateway-production-hpa -n production \
  --patch '{"spec":{"minReplicas":5,"maxReplicas":5}}'

Scale Despliegue:

# Scale to 5 replicas
kubectl scale deployment/api-gateway-production --replicas=5 -n production

Verify Scaling:

kubectl get pods -n production -l app=api-gateway -w

Monitor Métricas:

kubectl top pods -n production -l app=api-gateway

Verification:

5 pods in Running state
CPU/memory usage per pod decreased
Request latency decreased in Grafana

Cleanup (after traffic subsides):

# Re-enable HPA
kubectl patch hpa api-gateway-production-hpa -n production \
  --patch '{"spec":{"minReplicas":2,"maxReplicas":10}}'

Runbook 4: Restart Crashed Pods

Objective: Restart pods stuck in CrashLoopBackOff or Error state

Prerequisites:

Pods in unhealthy state
Access to AKS cluster

Steps:

Identify Crashed Pods:

kubectl get pods -n production --field-selector=status.phase!=Running

Check Pod Logs:

kubectl logs api-gateway-production-xxxxx-yyyyy -n production
kubectl logs api-gateway-production-xxxxx-yyyyy -n production --previous

Check Pod Events:

kubectl describe pod api-gateway-production-xxxxx-yyyyy -n production

Delete Crashed Pod (Kubernetes will recreate):

kubectl delete pod api-gateway-production-xxxxx-yyyyy -n production

OR Restart Entire Despliegue:

kubectl rollout restart deployment/api-gateway-production -n production

Monitor New Pods:

kubectl get pods -n production -l app=api-gateway -w

Verification:

All pods in Running state
No crash loops
Application responding to requests

If Pods Continue Crashing:

Check for missing environment variables: kubectl exec <pod> -n production -- env
Check Base de datos connectivity: kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017
Check resource limits: kubectl describe pod <pod> -n production | grep -A 5 Resources
Escalate to Incident Response

Runbook 5: Rotate Secrets and Credentials

Objective: Rotate MongoDB credentials and update Kubernetes secrets

Prerequisites:

New credentials generated in MongoDB Atlas
Access to AKS cluster and Azure Key Vault

Steps:

Generate New MongoDB User in Atlas:
- Login to MongoDB Atlas
- Navigate to Base de datos Access → Add New Base de datos User
- Create new user with same permissions
- Copy connection string

Update Azure Key Vault:

az keyvault secret set \
  --name mongodb-uri \
  --vault-name akv-algesta-production \
  --value "mongodb+srv://newuser:newpassword@cluster.mongodb.net/algesta"

Update Kubernetes Secret:

kubectl create secret generic mongodb-credentials \
  --from-literal=uri="mongodb+srv://newuser:newpassword@cluster.mongodb.net/algesta" \
  --namespace=production \
  --dry-run=client -o yaml | kubectl apply -f -

Restart Despliegues:

kubectl rollout restart deployment -n production

Verify Despliegues:

kubectl get pods -n production
kubectl logs -l app=api-gateway -n production | grep "MongoDB connected"

Delete Old MongoDB User in Atlas:
- MongoDB Atlas → Base de datos Access → Delete old user

Verification:

All pods running
No MongoDB connection errors in logs
Application responding to requests

Rollback (if issues occur):

kubectl create secret generic mongodb-credentials \
  --from-literal=uri="<old-connection-string>" \
  --namespace=production \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl rollout restart deployment -n production

Runbook 6: Renew TLS Certificates

Objective: Manually renew TLS certificates if auto-renewal fails

Prerequisites:

cert-manager installed in cluster
Access to AKS cluster

Steps:

Check Certificate Estado:

kubectl get certificates -n production
kubectl describe certificate algesta-api-prod-tls -n production

Check cert-manager Logs:

kubectl logs -n cert-manager -l app=cert-manager --tail=100

Delete Certificate (triggers re-issuance):

kubectl delete certificate algesta-api-prod-tls -n production

Monitor Certificate Creation:

kubectl get certificate algesta-api-prod-tls -n production -w

Check CertificateRequest and Orders:

kubectl get certificaterequests -n production
kubectl get orders -n production
kubectl get challenges -n production

Verify Certificate Issued:

kubectl get certificate algesta-api-prod-tls -n production
# STATUS should be "True"

Test HTTPS:

curl -v https://algesta-api-prod.3astronautas.com/health
openssl s_client -connect algesta-api-prod.3astronautas.com:443 < /dev/null 2>/dev/null | openssl x509 -noout -dates

Verification:

Certificate Estado: Ready=True
New certificate issued (check expiration date)
HTTPS connections successful

If Auto-Renewal Continues to Fail:

Check DNS records point to correct ingress IP
Check ClusterIssuer configured correctly
Check ACME challenge can reach ingress (HTTP-01 validation)
Contact DevOps team for manual certificate upload

Runbook 7: Add New Microservicio to AKS

Objective: Deploy a new Microservicio to production namespace

Prerequisites:

Docker image built and pushed to ACR
Kubernetes manifests prepared
Base de datos and secrets configured

Steps:

Create Kubernetes Manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ms-new-service-production
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ms-new-service
  template:
    metadata:
      labels:
        app: ms-new-service
    spec:
      containers:
        - name: ms-new-service
          image: acralgestaproduction.azurecr.io/ms-new-service:latest
          ports:
            - containerPort: 3004
          env:
            - name: MONGODB_URI
              valueFrom:
                secretKeyRef:
                  name: mongodb-credentials
                  key: uri
---
apiVersion: v1
kind: Service
metadata:
  name: ms-new-service-production
  namespace: production
spec:
  selector:
    app: ms-new-service
  ports:
    - port: 80
      targetPort: 3004

Apply Manifests:

kubectl apply -f ms-new-service-deployment.yaml

Verify Despliegue:

kubectl get deployments -n production
kubectl get pods -n production -l app=ms-new-service
kubectl logs -l app=ms-new-service -n production

Test Service Internally:

kubectl run -it --rm debug --image=curlimages/curl:latest --restart=Never -n production -- \
  curl http://ms-new-service-production.production.svc.cluster.local/health

Update Ingress (if external access needed):

# Add to ingress-production.yaml
- path: /new-service
  pathType: Prefix
  backend:
    service:
      name: ms-new-service-production
      port:
        number: 80

kubectl apply -f ingress-production.yaml

Test External Access:

curl https://algesta-api-prod.3astronautas.com/new-service/health

Verification:

Despliegue shows 2/2 ready replicas
Pods in Running state
Service Endpoints populated
Health checks passing

Runbook 8: Investigate High Memory Usage

Objective: Diagnose and resolve high memory usage in pods

Prerequisites:

Monitoring alert triggered
Access to AKS cluster and Grafana

Steps:

Identify Pods with High Memory:

kubectl top pods -n production --sort-by=memory

Check Memory Limits:

kubectl describe deployment api-gateway-production -n production | grep -A 5 "Limits"

Check Pod Logs for Memory Errors:

kubectl logs api-gateway-production-xxxxx-yyyyy -n production | grep -i "memory\|oom\|heap"

Check for OOMKilled Events:

kubectl get pods -n production -l app=api-gateway -o jsonpath='{.items[*].status.containerStatuses[*].lastState.terminated.reason}'

Generate Heap Dump (NestJS/Node.js):

kubectl exec api-gateway-production-xxxxx-yyyyy -n production -- \
  node --expose-gc -e "global.gc(); console.log(process.memoryUsage())"

Check Grafana for Memory Trends:
- Open Grafana → Application Performance Dashboard
- Query: container_memory_usage_bytes{pod=~"api-gateway-production.*"}
- Check for gradual increase (memory leak) or sudden spike

Resolution:

If Memory Leak:

Review recent code changes for memory leaks
Check for unclosed Base de datos connections
Review logs for excessive object creation

Restart pods as temporary fix:

kubectl rollout restart deployment/api-gateway-production -n production

If Insufficient Memory:

Increase memory limits:

kubectl set resources deployment api-gateway-production -n production \
  --limits=memory=2Gi \
  --requests=memory=1Gi

Verification:

Memory usage stabilized
No OOMKilled events
Application responding normally

Related Documentoation:

Kubernetes Operaciones: kubectl reference
Incident Response: Escalation procedures
Backup & DR: Restore procedures

For Support:

Escalate to on-call engineer if runbook doesn’t resolve issue
Documento findings in incident report
Update runbook based on lessons learned