Saltearse al contenido

Runbooks Operacionales

Tabla de Contenidos

  1. Propósito
  2. Runbook 1: Desplegar Nueva Versión a Producción
  3. Runbook 2: Rollback de Despliegue Fallido
  4. Runbook 3: Escalar Aplicación Bajo Alta Carga
  5. Runbook 4: Reiniciar Pods Caídos
  6. Runbook 5: Rotar Secrets y Credenciales
  7. Runbook 6: Renovar Certificados TLS
  8. Runbook 7: Agregar Nuevo Microservicio a AKS
  9. Runbook 8: Investigar Alto Uso de Memoria

Propósito

Este documento proporciona procedimientos paso a paso (runbooks) para tareas operacionales comunes en la plataforma Algesta. Cada runbook incluye prerrequisitos, pasos, verificación y procedimientos de rollback.


Runbook 1: Deploy New Version to Production

Objective: Deploy a new version of a Microservicio to production namespace

Prerequisites:

  • New Docker image pushed to ACR
  • Access to AKS cluster
  • kubectl configured

Steps:

  1. Verify Image Exists in ACR:

    Ventana de terminal
    az acr repository show-tags \
    --name acralgestaproduction \
    --repository api-gateway \
    --output table
  2. Get Current Despliegue State:

    Ventana de terminal
    kubectl get deployment api-gateway-production -n production -o yaml > backup-deployment.yaml
    kubectl rollout history deployment/api-gateway-production -n production
  3. Update Despliegue Image:

    Ventana de terminal
    kubectl set image deployment/api-gateway-production \
    api-gateway=acralgestaproduction.azurecr.io/api-gateway:12345 \
    --namespace=production \
    --record
  4. Monitor Rollout:

    Ventana de terminal
    kubectl rollout status deployment/api-gateway-production -n production --timeout=5m
  5. Verify New Pods Running:

    Ventana de terminal
    kubectl get pods -n production -l app=api-gateway
    kubectl logs -l app=api-gateway -n production --tail=50
  6. Test Health Endpoint:

    Ventana de terminal
    curl https://algesta-api-prod.3astronautas.com/health

Verification:

  • All pods in Running state
  • Health Endpoint returns {"Estado":"ok"}
  • No errors in pod logs
  • Grafana shows normal Métricas (CPU, memory, request rate)

Rollback (if Despliegue fails):

Ventana de terminal
kubectl rollout undo deployment/api-gateway-production -n production
kubectl rollout status deployment/api-gateway-production -n production

Runbook 2: Rollback Failed Despliegue

Objective: Rollback to previous version after failed Despliegue

Prerequisites:

  • Despliegue has rollout history
  • Access to AKS cluster

Steps:

  1. Check Rollout History:

    Ventana de terminal
    kubectl rollout history deployment/api-gateway-production -n production
  2. Identify Target Revision:

    Ventana de terminal
    kubectl rollout history deployment/api-gateway-production -n production --revision=3
  3. Rollback to Previous Version:

    Ventana de terminal
    # Rollback to previous revision
    kubectl rollout undo deployment/api-gateway-production -n production
    # OR rollback to specific revision
    kubectl rollout undo deployment/api-gateway-production -n production --to-revision=3
  4. Monitor Rollback:

    Ventana de terminal
    kubectl rollout status deployment/api-gateway-production -n production
  5. Verify Pods Running:

    Ventana de terminal
    kubectl get pods -n production -l app=api-gateway
  6. Test Application:

    Ventana de terminal
    curl https://algesta-api-prod.3astronautas.com/health

Verification:

  • Pods running with previous image tag
  • Health checks passing
  • Error rate decreased in Grafana

Runbook 3: Scale Application Under High Load

Objective: Manually scale Despliegue to handle traffic spike

Prerequisites:

  • Monitoring shows high CPU/memory usage
  • Access to AKS cluster

Steps:

  1. Check Current Scale:

    Ventana de terminal
    kubectl get deployment api-gateway-production -n production
    kubectl top pods -n production -l app=api-gateway
  2. Check HPA Estado:

    Ventana de terminal
    kubectl get hpa api-gateway-production-hpa -n production
  3. Temporarily Disable HPA (optional):

    Ventana de terminal
    kubectl patch hpa api-gateway-production-hpa -n production \
    --patch '{"spec":{"minReplicas":5,"maxReplicas":5}}'
  4. Scale Despliegue:

    Ventana de terminal
    # Scale to 5 replicas
    kubectl scale deployment/api-gateway-production --replicas=5 -n production
  5. Verify Scaling:

    Ventana de terminal
    kubectl get pods -n production -l app=api-gateway -w
  6. Monitor Métricas:

    Ventana de terminal
    kubectl top pods -n production -l app=api-gateway

Verification:

  • 5 pods in Running state
  • CPU/memory usage per pod decreased
  • Request latency decreased in Grafana

Cleanup (after traffic subsides):

Ventana de terminal
# Re-enable HPA
kubectl patch hpa api-gateway-production-hpa -n production \
--patch '{"spec":{"minReplicas":2,"maxReplicas":10}}'

Runbook 4: Restart Crashed Pods

Objective: Restart pods stuck in CrashLoopBackOff or Error state

Prerequisites:

  • Pods in unhealthy state
  • Access to AKS cluster

Steps:

  1. Identify Crashed Pods:

    Ventana de terminal
    kubectl get pods -n production --field-selector=status.phase!=Running
  2. Check Pod Logs:

    Ventana de terminal
    kubectl logs api-gateway-production-xxxxx-yyyyy -n production
    kubectl logs api-gateway-production-xxxxx-yyyyy -n production --previous
  3. Check Pod Events:

    Ventana de terminal
    kubectl describe pod api-gateway-production-xxxxx-yyyyy -n production
  4. Delete Crashed Pod (Kubernetes will recreate):

    Ventana de terminal
    kubectl delete pod api-gateway-production-xxxxx-yyyyy -n production
  5. OR Restart Entire Despliegue:

    Ventana de terminal
    kubectl rollout restart deployment/api-gateway-production -n production
  6. Monitor New Pods:

    Ventana de terminal
    kubectl get pods -n production -l app=api-gateway -w

Verification:

  • All pods in Running state
  • No crash loops
  • Application responding to requests

If Pods Continue Crashing:

  • Check for missing environment variables: kubectl exec <pod> -n production -- env
  • Check Base de datos connectivity: kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017
  • Check resource limits: kubectl describe pod <pod> -n production | grep -A 5 Resources
  • Escalate to Incident Response

Runbook 5: Rotate Secrets and Credentials

Objective: Rotate MongoDB credentials and update Kubernetes secrets

Prerequisites:

  • New credentials generated in MongoDB Atlas
  • Access to AKS cluster and Azure Key Vault

Steps:

  1. Generate New MongoDB User in Atlas:

    • Login to MongoDB Atlas
    • Navigate to Base de datos Access → Add New Base de datos User
    • Create new user with same permissions
    • Copy connection string
  2. Update Azure Key Vault:

    Ventana de terminal
    az keyvault secret set \
    --name mongodb-uri \
    --vault-name akv-algesta-production \
    --value "mongodb+srv://newuser:newpassword@cluster.mongodb.net/algesta"
  3. Update Kubernetes Secret:

    Ventana de terminal
    kubectl create secret generic mongodb-credentials \
    --from-literal=uri="mongodb+srv://newuser:newpassword@cluster.mongodb.net/algesta" \
    --namespace=production \
    --dry-run=client -o yaml | kubectl apply -f -
  4. Restart Despliegues:

    Ventana de terminal
    kubectl rollout restart deployment -n production
  5. Verify Despliegues:

    Ventana de terminal
    kubectl get pods -n production
    kubectl logs -l app=api-gateway -n production | grep "MongoDB connected"
  6. Delete Old MongoDB User in Atlas:

    • MongoDB Atlas → Base de datos Access → Delete old user

Verification:

  • All pods running
  • No MongoDB connection errors in logs
  • Application responding to requests

Rollback (if issues occur):

Ventana de terminal
kubectl create secret generic mongodb-credentials \
--from-literal=uri="<old-connection-string>" \
--namespace=production \
--dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment -n production

Runbook 6: Renew TLS Certificates

Objective: Manually renew TLS certificates if auto-renewal fails

Prerequisites:

  • cert-manager installed in cluster
  • Access to AKS cluster

Steps:

  1. Check Certificate Estado:

    Ventana de terminal
    kubectl get certificates -n production
    kubectl describe certificate algesta-api-prod-tls -n production
  2. Check cert-manager Logs:

    Ventana de terminal
    kubectl logs -n cert-manager -l app=cert-manager --tail=100
  3. Delete Certificate (triggers re-issuance):

    Ventana de terminal
    kubectl delete certificate algesta-api-prod-tls -n production
  4. Monitor Certificate Creation:

    Ventana de terminal
    kubectl get certificate algesta-api-prod-tls -n production -w
  5. Check CertificateRequest and Orders:

    Ventana de terminal
    kubectl get certificaterequests -n production
    kubectl get orders -n production
    kubectl get challenges -n production
  6. Verify Certificate Issued:

    Ventana de terminal
    kubectl get certificate algesta-api-prod-tls -n production
    # STATUS should be "True"
  7. Test HTTPS:

    Ventana de terminal
    curl -v https://algesta-api-prod.3astronautas.com/health
    openssl s_client -connect algesta-api-prod.3astronautas.com:443 < /dev/null 2>/dev/null | openssl x509 -noout -dates

Verification:

  • Certificate Estado: Ready=True
  • New certificate issued (check expiration date)
  • HTTPS connections successful

If Auto-Renewal Continues to Fail:

  • Check DNS records point to correct ingress IP
  • Check ClusterIssuer configured correctly
  • Check ACME challenge can reach ingress (HTTP-01 validation)
  • Contact DevOps team for manual certificate upload

Runbook 7: Add New Microservicio to AKS

Objective: Deploy a new Microservicio to production namespace

Prerequisites:

  • Docker image built and pushed to ACR
  • Kubernetes manifests prepared
  • Base de datos and secrets configured

Steps:

  1. Create Kubernetes Manifests:

    ms-new-service-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: ms-new-service-production
    namespace: production
    spec:
    replicas: 2
    selector:
    matchLabels:
    app: ms-new-service
    template:
    metadata:
    labels:
    app: ms-new-service
    spec:
    containers:
    - name: ms-new-service
    image: acralgestaproduction.azurecr.io/ms-new-service:latest
    ports:
    - containerPort: 3004
    env:
    - name: MONGODB_URI
    valueFrom:
    secretKeyRef:
    name: mongodb-credentials
    key: uri
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: ms-new-service-production
    namespace: production
    spec:
    selector:
    app: ms-new-service
    ports:
    - port: 80
    targetPort: 3004
  2. Apply Manifests:

    Ventana de terminal
    kubectl apply -f ms-new-service-deployment.yaml
  3. Verify Despliegue:

    Ventana de terminal
    kubectl get deployments -n production
    kubectl get pods -n production -l app=ms-new-service
    kubectl logs -l app=ms-new-service -n production
  4. Test Service Internally:

    Ventana de terminal
    kubectl run -it --rm debug --image=curlimages/curl:latest --restart=Never -n production -- \
    curl http://ms-new-service-production.production.svc.cluster.local/health
  5. Update Ingress (if external access needed):

    # Add to ingress-production.yaml
    - path: /new-service
    pathType: Prefix
    backend:
    service:
    name: ms-new-service-production
    port:
    number: 80
    Ventana de terminal
    kubectl apply -f ingress-production.yaml
  6. Test External Access:

    Ventana de terminal
    curl https://algesta-api-prod.3astronautas.com/new-service/health

Verification:

  • Despliegue shows 2/2 ready replicas
  • Pods in Running state
  • Service Endpoints populated
  • Health checks passing

Runbook 8: Investigate High Memory Usage

Objective: Diagnose and resolve high memory usage in pods

Prerequisites:

  • Monitoring alert triggered
  • Access to AKS cluster and Grafana

Steps:

  1. Identify Pods with High Memory:

    Ventana de terminal
    kubectl top pods -n production --sort-by=memory
  2. Check Memory Limits:

    Ventana de terminal
    kubectl describe deployment api-gateway-production -n production | grep -A 5 "Limits"
  3. Check Pod Logs for Memory Errors:

    Ventana de terminal
    kubectl logs api-gateway-production-xxxxx-yyyyy -n production | grep -i "memory\|oom\|heap"
  4. Check for OOMKilled Events:

    Ventana de terminal
    kubectl get pods -n production -l app=api-gateway -o jsonpath='{.items[*].status.containerStatuses[*].lastState.terminated.reason}'
  5. Generate Heap Dump (NestJS/Node.js):

    Ventana de terminal
    kubectl exec api-gateway-production-xxxxx-yyyyy -n production -- \
    node --expose-gc -e "global.gc(); console.log(process.memoryUsage())"
  6. Check Grafana for Memory Trends:

    • Open Grafana → Application Performance Dashboard
    • Query: container_memory_usage_bytes{pod=~"api-gateway-production.*"}
    • Check for gradual increase (memory leak) or sudden spike

Resolution:

If Memory Leak:

  • Review recent code changes for memory leaks
  • Check for unclosed Base de datos connections
  • Review logs for excessive object creation
  • Restart pods as temporary fix:
    Ventana de terminal
    kubectl rollout restart deployment/api-gateway-production -n production

If Insufficient Memory:

  • Increase memory limits:
    Ventana de terminal
    kubectl set resources deployment api-gateway-production -n production \
    --limits=memory=2Gi \
    --requests=memory=1Gi

Verification:

  • Memory usage stabilized
  • No OOMKilled events
  • Application responding normally

Related Documentoation:

For Support:

  • Escalate to on-call engineer if runbook doesn’t resolve issue
  • Documento findings in incident report
  • Update runbook based on lessons learned