Runbooks Operacionales
Tabla de Contenidos
- Propósito
- Runbook 1: Desplegar Nueva Versión a Producción
- Runbook 2: Rollback de Despliegue Fallido
- Runbook 3: Escalar Aplicación Bajo Alta Carga
- Runbook 4: Reiniciar Pods Caídos
- Runbook 5: Rotar Secrets y Credenciales
- Runbook 6: Renovar Certificados TLS
- Runbook 7: Agregar Nuevo Microservicio a AKS
- Runbook 8: Investigar Alto Uso de Memoria
Propósito
Este documento proporciona procedimientos paso a paso (runbooks) para tareas operacionales comunes en la plataforma Algesta. Cada runbook incluye prerrequisitos, pasos, verificación y procedimientos de rollback.
Runbook 1: Deploy New Version to Production
Objective: Deploy a new version of a Microservicio to production namespace
Prerequisites:
- New Docker image pushed to ACR
- Access to AKS cluster
- kubectl configured
Steps:
-
Verify Image Exists in ACR:
Ventana de terminal az acr repository show-tags \--name acralgestaproduction \--repository api-gateway \--output table -
Get Current Despliegue State:
Ventana de terminal kubectl get deployment api-gateway-production -n production -o yaml > backup-deployment.yamlkubectl rollout history deployment/api-gateway-production -n production -
Update Despliegue Image:
Ventana de terminal kubectl set image deployment/api-gateway-production \api-gateway=acralgestaproduction.azurecr.io/api-gateway:12345 \--namespace=production \--record -
Monitor Rollout:
Ventana de terminal kubectl rollout status deployment/api-gateway-production -n production --timeout=5m -
Verify New Pods Running:
Ventana de terminal kubectl get pods -n production -l app=api-gatewaykubectl logs -l app=api-gateway -n production --tail=50 -
Test Health Endpoint:
Ventana de terminal curl https://algesta-api-prod.3astronautas.com/health
Verification:
- All pods in
Runningstate - Health Endpoint returns
{"Estado":"ok"} - No errors in pod logs
- Grafana shows normal Métricas (CPU, memory, request rate)
Rollback (if Despliegue fails):
kubectl rollout undo deployment/api-gateway-production -n productionkubectl rollout status deployment/api-gateway-production -n productionRunbook 2: Rollback Failed Despliegue
Objective: Rollback to previous version after failed Despliegue
Prerequisites:
- Despliegue has rollout history
- Access to AKS cluster
Steps:
-
Check Rollout History:
Ventana de terminal kubectl rollout history deployment/api-gateway-production -n production -
Identify Target Revision:
Ventana de terminal kubectl rollout history deployment/api-gateway-production -n production --revision=3 -
Rollback to Previous Version:
Ventana de terminal # Rollback to previous revisionkubectl rollout undo deployment/api-gateway-production -n production# OR rollback to specific revisionkubectl rollout undo deployment/api-gateway-production -n production --to-revision=3 -
Monitor Rollback:
Ventana de terminal kubectl rollout status deployment/api-gateway-production -n production -
Verify Pods Running:
Ventana de terminal kubectl get pods -n production -l app=api-gateway -
Test Application:
Ventana de terminal curl https://algesta-api-prod.3astronautas.com/health
Verification:
- Pods running with previous image tag
- Health checks passing
- Error rate decreased in Grafana
Runbook 3: Scale Application Under High Load
Objective: Manually scale Despliegue to handle traffic spike
Prerequisites:
- Monitoring shows high CPU/memory usage
- Access to AKS cluster
Steps:
-
Check Current Scale:
Ventana de terminal kubectl get deployment api-gateway-production -n productionkubectl top pods -n production -l app=api-gateway -
Check HPA Estado:
Ventana de terminal kubectl get hpa api-gateway-production-hpa -n production -
Temporarily Disable HPA (optional):
Ventana de terminal kubectl patch hpa api-gateway-production-hpa -n production \--patch '{"spec":{"minReplicas":5,"maxReplicas":5}}' -
Scale Despliegue:
Ventana de terminal # Scale to 5 replicaskubectl scale deployment/api-gateway-production --replicas=5 -n production -
Verify Scaling:
Ventana de terminal kubectl get pods -n production -l app=api-gateway -w -
Monitor Métricas:
Ventana de terminal kubectl top pods -n production -l app=api-gateway
Verification:
- 5 pods in
Runningstate - CPU/memory usage per pod decreased
- Request latency decreased in Grafana
Cleanup (after traffic subsides):
# Re-enable HPAkubectl patch hpa api-gateway-production-hpa -n production \ --patch '{"spec":{"minReplicas":2,"maxReplicas":10}}'Runbook 4: Restart Crashed Pods
Objective: Restart pods stuck in CrashLoopBackOff or Error state
Prerequisites:
- Pods in unhealthy state
- Access to AKS cluster
Steps:
-
Identify Crashed Pods:
Ventana de terminal kubectl get pods -n production --field-selector=status.phase!=Running -
Check Pod Logs:
Ventana de terminal kubectl logs api-gateway-production-xxxxx-yyyyy -n productionkubectl logs api-gateway-production-xxxxx-yyyyy -n production --previous -
Check Pod Events:
Ventana de terminal kubectl describe pod api-gateway-production-xxxxx-yyyyy -n production -
Delete Crashed Pod (Kubernetes will recreate):
Ventana de terminal kubectl delete pod api-gateway-production-xxxxx-yyyyy -n production -
OR Restart Entire Despliegue:
Ventana de terminal kubectl rollout restart deployment/api-gateway-production -n production -
Monitor New Pods:
Ventana de terminal kubectl get pods -n production -l app=api-gateway -w
Verification:
- All pods in
Runningstate - No crash loops
- Application responding to requests
If Pods Continue Crashing:
- Check for missing environment variables:
kubectl exec <pod> -n production -- env - Check Base de datos connectivity:
kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017 - Check resource limits:
kubectl describe pod <pod> -n production | grep -A 5 Resources - Escalate to Incident Response
Runbook 5: Rotate Secrets and Credentials
Objective: Rotate MongoDB credentials and update Kubernetes secrets
Prerequisites:
- New credentials generated in MongoDB Atlas
- Access to AKS cluster and Azure Key Vault
Steps:
-
Generate New MongoDB User in Atlas:
- Login to MongoDB Atlas
- Navigate to Base de datos Access → Add New Base de datos User
- Create new user with same permissions
- Copy connection string
-
Update Azure Key Vault:
Ventana de terminal az keyvault secret set \--name mongodb-uri \--vault-name akv-algesta-production \--value "mongodb+srv://newuser:newpassword@cluster.mongodb.net/algesta" -
Update Kubernetes Secret:
Ventana de terminal kubectl create secret generic mongodb-credentials \--from-literal=uri="mongodb+srv://newuser:newpassword@cluster.mongodb.net/algesta" \--namespace=production \--dry-run=client -o yaml | kubectl apply -f - -
Restart Despliegues:
Ventana de terminal kubectl rollout restart deployment -n production -
Verify Despliegues:
Ventana de terminal kubectl get pods -n productionkubectl logs -l app=api-gateway -n production | grep "MongoDB connected" -
Delete Old MongoDB User in Atlas:
- MongoDB Atlas → Base de datos Access → Delete old user
Verification:
- All pods running
- No MongoDB connection errors in logs
- Application responding to requests
Rollback (if issues occur):
kubectl create secret generic mongodb-credentials \ --from-literal=uri="<old-connection-string>" \ --namespace=production \ --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment -n productionRunbook 6: Renew TLS Certificates
Objective: Manually renew TLS certificates if auto-renewal fails
Prerequisites:
- cert-manager installed in cluster
- Access to AKS cluster
Steps:
-
Check Certificate Estado:
Ventana de terminal kubectl get certificates -n productionkubectl describe certificate algesta-api-prod-tls -n production -
Check cert-manager Logs:
Ventana de terminal kubectl logs -n cert-manager -l app=cert-manager --tail=100 -
Delete Certificate (triggers re-issuance):
Ventana de terminal kubectl delete certificate algesta-api-prod-tls -n production -
Monitor Certificate Creation:
Ventana de terminal kubectl get certificate algesta-api-prod-tls -n production -w -
Check CertificateRequest and Orders:
Ventana de terminal kubectl get certificaterequests -n productionkubectl get orders -n productionkubectl get challenges -n production -
Verify Certificate Issued:
Ventana de terminal kubectl get certificate algesta-api-prod-tls -n production# STATUS should be "True" -
Test HTTPS:
Ventana de terminal curl -v https://algesta-api-prod.3astronautas.com/healthopenssl s_client -connect algesta-api-prod.3astronautas.com:443 < /dev/null 2>/dev/null | openssl x509 -noout -dates
Verification:
- Certificate Estado:
Ready=True - New certificate issued (check expiration date)
- HTTPS connections successful
If Auto-Renewal Continues to Fail:
- Check DNS records point to correct ingress IP
- Check ClusterIssuer configured correctly
- Check ACME challenge can reach ingress (HTTP-01 validation)
- Contact DevOps team for manual certificate upload
Runbook 7: Add New Microservicio to AKS
Objective: Deploy a new Microservicio to production namespace
Prerequisites:
- Docker image built and pushed to ACR
- Kubernetes manifests prepared
- Base de datos and secrets configured
Steps:
-
Create Kubernetes Manifests:
ms-new-service-deployment.yaml apiVersion: apps/v1kind: Deploymentmetadata:name: ms-new-service-productionnamespace: productionspec:replicas: 2selector:matchLabels:app: ms-new-servicetemplate:metadata:labels:app: ms-new-servicespec:containers:- name: ms-new-serviceimage: acralgestaproduction.azurecr.io/ms-new-service:latestports:- containerPort: 3004env:- name: MONGODB_URIvalueFrom:secretKeyRef:name: mongodb-credentialskey: uri---apiVersion: v1kind: Servicemetadata:name: ms-new-service-productionnamespace: productionspec:selector:app: ms-new-serviceports:- port: 80targetPort: 3004 -
Apply Manifests:
Ventana de terminal kubectl apply -f ms-new-service-deployment.yaml -
Verify Despliegue:
Ventana de terminal kubectl get deployments -n productionkubectl get pods -n production -l app=ms-new-servicekubectl logs -l app=ms-new-service -n production -
Test Service Internally:
Ventana de terminal kubectl run -it --rm debug --image=curlimages/curl:latest --restart=Never -n production -- \curl http://ms-new-service-production.production.svc.cluster.local/health -
Update Ingress (if external access needed):
# Add to ingress-production.yaml- path: /new-servicepathType: Prefixbackend:service:name: ms-new-service-productionport:number: 80Ventana de terminal kubectl apply -f ingress-production.yaml -
Test External Access:
Ventana de terminal curl https://algesta-api-prod.3astronautas.com/new-service/health
Verification:
- Despliegue shows 2/2 ready replicas
- Pods in
Runningstate - Service Endpoints populated
- Health checks passing
Runbook 8: Investigate High Memory Usage
Objective: Diagnose and resolve high memory usage in pods
Prerequisites:
- Monitoring alert triggered
- Access to AKS cluster and Grafana
Steps:
-
Identify Pods with High Memory:
Ventana de terminal kubectl top pods -n production --sort-by=memory -
Check Memory Limits:
Ventana de terminal kubectl describe deployment api-gateway-production -n production | grep -A 5 "Limits" -
Check Pod Logs for Memory Errors:
Ventana de terminal kubectl logs api-gateway-production-xxxxx-yyyyy -n production | grep -i "memory\|oom\|heap" -
Check for OOMKilled Events:
Ventana de terminal kubectl get pods -n production -l app=api-gateway -o jsonpath='{.items[*].status.containerStatuses[*].lastState.terminated.reason}' -
Generate Heap Dump (NestJS/Node.js):
Ventana de terminal kubectl exec api-gateway-production-xxxxx-yyyyy -n production -- \node --expose-gc -e "global.gc(); console.log(process.memoryUsage())" -
Check Grafana for Memory Trends:
- Open Grafana → Application Performance Dashboard
- Query:
container_memory_usage_bytes{pod=~"api-gateway-production.*"} - Check for gradual increase (memory leak) or sudden spike
Resolution:
If Memory Leak:
- Review recent code changes for memory leaks
- Check for unclosed Base de datos connections
- Review logs for excessive object creation
- Restart pods as temporary fix:
Ventana de terminal kubectl rollout restart deployment/api-gateway-production -n production
If Insufficient Memory:
- Increase memory limits:
Ventana de terminal kubectl set resources deployment api-gateway-production -n production \--limits=memory=2Gi \--requests=memory=1Gi
Verification:
- Memory usage stabilized
- No OOMKilled events
- Application responding normally
Related Documentoation:
- Kubernetes Operaciones: kubectl reference
- Incident Response: Escalation procedures
- Backup & DR: Restore procedures
For Support:
- Escalate to on-call engineer if runbook doesn’t resolve issue
- Documento findings in incident report
- Update runbook based on lessons learned