Operaciones de Kubernetes
Tabla de Contenidos
- Propósito
- ¿Para quién es esto?
- Descripción General del Cluster AKS
- Estrategia de Namespaces
- Arquitectura de Workloads
- Configuración de Ingress y TLS
- Gestión de Secrets
- Operaciones Comunes con kubectl
- Escalamiento y Gestión de Recursos
- Solución de Problemas
Propósito
Este documento describe las operaciones del cluster Kubernetes (AKS) para la plataforma Algesta. Cubre arquitectura del cluster, organización de namespaces, configuración de ingress, gestión de secrets y procedimientos operacionales comunes.
Siguiendo esta guía, entenderás:
- Diseño del cluster AKS y configuración del node pool
- Aislamiento de entornos basado en namespaces (development, pruebas, production, monitoring)
- Enrutamiento de ingress con certificados TLS vía cert-manager
- Secrets de Kubernetes e integración con Azure Key Vault
- Comandos kubectl comunes para despliegues, servicios y solución de problemas
¿Para quién es esto?
Esta guía es para ingenieros DevOps gestionando el cluster AKS, SREs solucionando problemas de producción e ingenieros de plataforma desplegando aplicaciones. Asume familiaridad con Kubernetes, kubectl y AKS.
AKS Cluster Descripción General
Cluster Details
| Property | Valor | Notes |
|---|---|---|
| Cluster Name | aks-algesta-{environment} | Per-environment clusters (dev, production) |
| Azure Region | East US | Configurable in Terraform |
| Kubernetes Version | 1.28+ | Managed by Azure (auto-upgrade available) |
| SKU Tier | Free | Production should use Standard tier for SLA |
| Network Plugin | kubenet | Default AKS networking (consider Azure CNI for advanced Funcionalidades) |
| DNS Prefix | algesta-{environment} | Used for API server FQDN |
| Identity Type | System-assigned Managed Identity | For Azure resource access (ACR, Key Vault) |
Node Pools
System Node Pool (default):
- Propósito: Runs AKS system Componentes (CoreDNS, Métricas-server, tunnelfront)
- VM Size:
Standard_B2s(2 vCPU, 4 GB RAM) - Node Count: 1 (auto-scaling: min 1, max 1)
- OS: Linux (Ubuntu)
- Mode: System
- Taints: None (workloads can schedule here if needed)
User Node Pool (stdar{environment}):
- Propósito: Runs application workloads (Microservicios, monitoring)
- VM Size:
Standard_B2s(2 vCPU, 4 GB RAM) - Node Count: 1 (auto-scaling: min 1, max 3)
- OS: Linux (Ubuntu)
- Mode: User
- Taints: None
Scaling Behavior:
- Development: Scales down to 1 node during idle periods
- Production: Maintains min 1 node, scales up to 3 under load
- Métricas: CPU utilization > 80% triggers scale-up
Cluster Add-ons
| Add-on | Propósito | Configuration |
|---|---|---|
| Web App Routing | Managed ingress controller (nginx) | Enabled via Terraform (web_app_routing block) |
| Monitoring | Azure Monitor integration | Optional (currently using Prometheus/Grafana) |
| Azure Policy | Enforce security policies | Not enabled (consider for compliance) |
| Secrets Store CSI Driver | Azure Key Vault integration | Not enabled (future Implementación) |
Namespace Strategy
Environment Isolation
Each environment has dedicated namespaces for isolation and RBAC:
| Namespace | Propósito | Ingress Host | Resource Quotas |
|---|---|---|---|
development | Development Despliegues | algesta-api-dev.3astronautas.com | No limits (small cluster) |
Pruebas | QA and integration Pruebas | algesta-api-test.3astronautas.com | No limits |
production | Live customer traffic | algesta-api-prod.3astronautas.com | CPU: 4 cores, Memory: 8 GB (recommended) |
monitoring | Grafana, Prometheus, Loki | algesta.grafana.3astronautas.com | CPU: 2 cores, Memory: 4 GB |
cert-manager | Certificate management | N/A | Minimal resources |
connect-devops | CI/CD service accounts | N/A | Minimal resources |
Namespace Conventions:
- Environment namespaces (
development,Pruebas,production) host Microservicios - Shared services (
monitoring,cert-manager) in dedicated namespaces - No
defaultnamespace usage (all resources in named namespaces)
Creating Namespaces
Development Namespace:
kubectl create namespace development
# Add labels for organizationkubectl label namespace development environment=dev team=backendWith Resource Quotas (production):
kubectl create namespace production
# Apply resource quotacat <<EOF | kubectl apply -f -apiVersion: v1kind: ResourceQuotametadata: name: production-quota namespace: productionspec: hard: requests.cpu: "4" requests.memory: 8Gi limits.cpu: "8" limits.memory: 16Gi pods: "50"EOFWorkload Arquitectura
Microservicios Despliegue Pattern
Each Microservicio follows this standard Despliegue structure:
{microservice-name}-{environment}/├── Deployment├── Service (ClusterIP)├── HorizontalPodAutoscaler (HPA)└── ConfigMap / Secret (env vars)Example: API Gateway Despliegue
apiVersion: apps/v1kind: Deploymentmetadata: name: api-gateway-production namespace: production labels: app: api-gateway environment: productionspec: replicas: 2 selector: matchLabels: app: api-gateway environment: production template: metadata: labels: app: api-gateway environment: production spec: containers: - name: api-gateway image: acralgestaproduction.azurecr.io/api-gateway:latest imagePullPolicy: Always ports: - containerPort: 3000 name: http env: - name: NODE_ENV value: "production" - name: PORT value: "3000" - name: MONGODB_URI valueFrom: secretKeyRef: name: mongodb-credentials key: uri - name: JWT_SECRET valueFrom: secretKeyRef: name: jwt-credentials key: secret resources: requests: cpu: 250m memory: 512Mi limits: cpu: 1000m memory: 1Gi livenessProbe: httpGet: path: /health port: 3000 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /health port: 3000 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 2---apiVersion: v1kind: Servicemetadata: name: api-gateway-production namespace: productionspec: type: ClusterIP selector: app: api-gateway environment: production ports: - port: 80 targetPort: 3000 protocol: TCP name: http---apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: api-gateway-production-hpa namespace: productionspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-gateway-production minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80Current Deployed Microservicios
| Microservicio | Namespace | Despliegue Name | Service Name | Port |
|---|---|---|---|---|
| API Gateway | development, Pruebas, production | api-gateway-{env} | api-gateway-{env} | 3000 |
| Orders Service | development, Pruebas, production | ms-orders-{env} | ms-orders-{env} | 3001 |
| Notifications Service | development, Pruebas, production | ms-notifications-{env} | ms-notifications-{env} | 3002 |
| Provider Service | development, Pruebas, production | ms-provider-{env} | ms-provider-{env} | 3003 |
Viewing Deployed Workloads:
# List all deployments in production namespacekubectl get deployments -n production
# List all serviceskubectl get services -n production
# List all pods with labelskubectl get pods -n production --show-labelsIngress and TLS Configuration
Ingress Controller
Type: Azure Web App Routing (managed nginx ingress)
Funcionalidades:
- Managed by AKS (automatic updates)
- Integrated with Azure DNS (optional)
- Supports cert-manager for TLS automation
Ingress Class:
ingressClassName: webapprouting.kubernetes.azure.comAlternative: Manual nginx-ingress Despliegue (more control, requires maintenance)
Ingress Resources
Development Environment (ops-algesta/resources-k8s/ingress-nginx/ingress-aks/ingress-development.yaml):
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: ingress-development namespace: development annotations: kubernetes.io/ingress.class: webapprouting.kubernetes.azure.com cert-manager.io/cluster-issuer: letsencrypt-prod-webapprouting nginx.ingress.kubernetes.io/proxy-connect-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600" nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-body-size: "300m"spec: tls: - hosts: - algesta-api-dev.3astronautas.com secretName: algesta-api-dev-tls ingressClassName: webapprouting.kubernetes.azure.com rules: - host: algesta-api-dev.3astronautas.com http: paths: - path: / pathType: Prefix backend: service: name: api-gateway-development port: number: 80Key Configuration:
- TLS Certificate: Automatically provisioned by cert-manager via Let’s Encrypt
- Timeout Settings: Extended to 600s for long-running requests (PDF generation)
- Body Size: 300MB limit for file uploads (development), 10MB (production/Pruebas)
Pruebas Environment (ingress-Pruebas.yaml):
- Host:
algesta-api-test.3astronautas.com - TLS Secret:
algesta-api-test-tls - Backend Service:
api-gateway-Pruebas
Production Environment (ingress-production.yaml):
- Host:
algesta-api-prod.3astronautas.com - TLS Secret:
algesta-api-prod-tls - Backend Service:
api-gateway-production
Monitoring Ingress (ingress-monitoring.yaml):
- Host:
algesta.grafana.3astronautas.com - Backend Service:
grafana-service(monitoring namespace)
TLS Certificate Management
cert-manager Configuration:
ClusterIssuer (ops-algesta/resources-k8s/cert-manager/Issuer.yaml):
apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: letsencrypt-prod namespace: cert-managerspec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: j.leon@tresastronautas.com privateKeySecretRef: name: letsencrypt-prod solvers: - http01: ingress: ingressClassName: nginxClusterIssuer for Web App Routing (ops-algesta/resources-k8s/cert-manager/Issuer-webapprouting.yaml):
apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: letsencrypt-prod-webapprouting namespace: cert-managerspec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: j.leon@tresastronautas.com privateKeySecretRef: name: letsencrypt-prod-webapprouting solvers: - http01: ingress: ingressClassName: webapprouting.kubernetes.azure.comCertificate Lifecycle:
- Ingress created with
cert-manager.io/cluster-issuerannotation - cert-manager detects annotation, creates Certificate resource
- cert-manager initiates ACME challenge (HTTP-01)
- Let’s Encrypt validates domain ownership
- Certificate issued and stored in Secret (e.g.,
algesta-api-prod-tls) - Ingress controller uses Secret for TLS termination
- Auto-renewal 30 days before expiration
Checking Certificate Estado:
# List certificates in namespacekubectl get certificates -n production
# Check certificate detailskubectl describe certificate algesta-api-prod-tls -n production
# Check certificate expirationkubectl get secret algesta-api-prod-tls -n production -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddateManual Certificate Renewal (if auto-renewal fails):
# Delete certificate to trigger re-issuancekubectl delete certificate algesta-api-prod-tls -n production
# cert-manager will automatically recreate itkubectl get certificate algesta-api-prod-tls -n production -wSecrets Management
Kubernetes Secrets
Current Approach: Manual secret creation via kubectl
Production MongoDB Secret:
kubectl create secret generic mongodb-credentials \ --from-literal=uri="mongodb+srv://admin:SecurePassword@cluster.mongodb.net/algesta?retryWrites=true&w=majority" \ --namespace=production
# Verify secret createdkubectl get secret mongodb-credentials -n productionJWT Secret:
kubectl create secret generic jwt-credentials \ --from-literal=secret="supersecurejwtkey12345" \ --namespace=productionListing Secrets:
# List all secrets in namespacekubectl get secrets -n production
# View secret details (encoded)kubectl get secret mongodb-credentials -n production -o yaml
# Decode secret valuekubectl get secret mongodb-credentials -n production -o jsonpath='{.data.uri}' | base64 -dAzure Key Vault Integration (Future)
Recommended: Use Azure Key Vault Provider for Secrets Store CSI Driver
Benefits:
- Centralized secret management in Azure Key Vault
- Automatic secret rotation
- Audit logging
- Integration with Azure RBAC
Implementación (future):
- Enable CSI Driver in AKS:
az aks enable-addons --addons azure-keyvault-secrets-provider \ --resource-group rg-algesta-production \ --name aks-algesta-production- Create SecretProviderClass:
apiVersion: secrets-store.csi.x-k8s.io/v1kind: SecretProviderClassmetadata: name: azure-keyvault-provider namespace: productionspec: provider: azure parameters: usePodIdentity: "false" useVMManagedIdentity: "true" userAssignedIdentityID: "<managed-identity-client-id>" keyvaultName: "akv-algesta-production" objects: | array: - | objectName: mongodb-uri objectType: secret objectVersion: "" - | objectName: jwt-secret objectType: secret objectVersion: "" tenantId: "<azure-tenant-id>"- Mount Secrets in Pods:
spec: containers: - name: api-gateway volumeMounts: - name: secrets-store mountPath: "/mnt/secrets" readOnly: true volumes: - name: secrets-store csi: driver: secrets-store.csi.k8s.io readOnly: true volumeAttributes: secretProviderClass: "azure-keyvault-provider"Common kubectl Operaciones
Cluster Access
Configure kubectl:
# Get credentials for AKS clusteraz aks get-credentials \ --resource-group rg-algesta-production \ --name aks-algesta-production \ --overwrite-existing
# Verify connectionkubectl cluster-infokubectl get nodesSet Default Namespace:
# Set default namespace to productionkubectl config set-context --current --namespace=production
# Verify current namespacekubectl config view --minify | grep namespace:Despliegue Operaciones
Deploy New Version:
# Update deployment imagekubectl set image deployment/api-gateway-production \ api-gateway=acralgestaproduction.azurecr.io/api-gateway:12345 \ --namespace=production
# Watch rollout progresskubectl rollout status deployment/api-gateway-production -n production
# Check rollout historykubectl rollout history deployment/api-gateway-production -n productionRollback Despliegue:
# Rollback to previous versionkubectl rollout undo deployment/api-gateway-production -n production
# Rollback to specific revisionkubectl rollout undo deployment/api-gateway-production --to-revision=3 -n production
# Verify rollbackkubectl get pods -n production -l app=api-gatewayScale Despliegue:
# Scale to 5 replicaskubectl scale deployment/api-gateway-production --replicas=5 -n production
# Verify scalingkubectl get pods -n production -l app=api-gatewayRestart Despliegue (forces pod recreation):
kubectl rollout restart deployment/api-gateway-production -n productionPod Operaciones
Viewing Pods:
# List all pods in namespacekubectl get pods -n production
# List pods with more detailskubectl get pods -n production -o wide
# Watch pods in real-timekubectl get pods -n production -w
# Filter by labelkubectl get pods -n production -l app=api-gatewayPod Logs:
# View logs for single podkubectl logs api-gateway-production-xxxxx-yyyyy -n production
# Follow logs (tail -f)kubectl logs -f api-gateway-production-xxxxx-yyyyy -n production
# View logs from previous container (after crash)kubectl logs api-gateway-production-xxxxx-yyyyy -n production --previous
# View logs from all pods in deploymentkubectl logs -l app=api-gateway -n production --tail=100Execute Commands in Pod:
# Open shell in podkubectl exec -it api-gateway-production-xxxxx-yyyyy -n production -- /bin/sh
# Run single commandkubectl exec api-gateway-production-xxxxx-yyyyy -n production -- env | grep NODE_ENV
# Check application healthkubectl exec api-gateway-production-xxxxx-yyyyy -n production -- curl localhost:3000/healthDebugging Pods:
# Describe pod (shows events, status, volumes)kubectl describe pod api-gateway-production-xxxxx-yyyyy -n production
# Check resource usagekubectl top pod api-gateway-production-xxxxx-yyyyy -n production
# Check pod IP and node assignmentkubectl get pod api-gateway-production-xxxxx-yyyyy -n production -o jsonpath='{.status.podIP}{"\n"}{.spec.nodeName}{"\n"}'Service Operaciones
Viewing Services:
# List serviceskubectl get services -n production
# Describe servicekubectl describe service api-gateway-production -n production
# Get service endpointskubectl get endpoints api-gateway-production -n productionPruebas Service Connectivity:
# From within cluster (create debug pod)kubectl run -it --rm debug --image=curlimages/curl:latest --restart=Never -n production -- sh# Inside pod:curl http://api-gateway-production.production.svc.cluster.local/healthIngress Operaciones
Viewing Ingress:
# List ingress resourceskubectl get ingress -n production
# Describe ingress (shows rules, backends)kubectl describe ingress ingress-production -n production
# Check ingress external IPkubectl get ingress ingress-production -n production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'Pruebas Ingress:
# Test HTTP (should redirect to HTTPS)curl -v http://algesta-api-prod.3astronautas.com/health
# Test HTTPScurl -v https://algesta-api-prod.3astronautas.com/health
# Check TLS certificateopenssl s_client -connect algesta-api-prod.3astronautas.com:443 -servername algesta-api-prod.3astronautas.com < /dev/null 2>/dev/null | openssl x509 -noout -datesScaling and Resource Management
Horizontal Pod Autoscaling (HPA)
Current HPA Configuration:
# List HPAskubectl get hpa -n production
# Describe HPAkubectl describe hpa api-gateway-production-hpa -n productionHPA Métricas:
- CPU Target: 70% utilization
- Memory Target: 80% utilization
- Min Replicas: 2 (production), 1 (dev/test)
- Max Replicas: 10 (production), 3 (dev/test)
Manually Disable HPA (for Pruebas):
# Scale HPA to 0 (disables autoscaling)kubectl patch hpa api-gateway-production-hpa -n production --patch '{"spec":{"minReplicas":1,"maxReplicas":1}}'Cluster Autoscaling
Node Pool Autoscaling:
- Managed by Azure AKS (configured in Terraform)
- Trigger: Pods in Pendiente state due to insufficient resources
- Scale Up: Add nodes to user node pool (max 3)
- Scale Down: Remove idle nodes after 10 minutes
Check Node Estado:
# List nodeskubectl get nodes
# Check node resource usagekubectl top nodes
# Check pods per nodekubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -cResource Quotas and Limits
Namespace Resource Quota (production):
# View quotakubectl get resourcequota -n production
# Check quota usagekubectl describe resourcequota production-quota -n productionPod Resource Requests/Limits:
- Requests: Minimum resources guaranteed (used for scheduling)
- Limits: Maximum resources allowed (enforced by kubelet)
Recommended Settings:
| Workload | CPU Request | Memory Request | CPU Limit | Memory Limit |
|---|---|---|---|---|
| API Gateway | 250m | 512Mi | 1000m | 1Gi |
| Microservicios | 100m | 256Mi | 500m | 512Mi |
| Monitoring (Grafana) | 500m | 1Gi | 1000m | 2Gi |
Troubleshooting
Pod Not Starting (ImagePullBackOff)
Symptoms:
kubectl get pods -n production# NAME READY STATUS RESTARTS AGE# api-gateway-prod-xxxxx-yyyyy 0/1 ImagePullBackOff 0 2mDiagnosis:
kubectl describe pod api-gateway-prod-xxxxx-yyyyy -n production# Events:# Failed to pull image "acralgestaproduction.azurecr.io/api-gateway:12345": rpc error: code = Unknown desc = Error response from daemon: unauthorized: authentication requiredSolutions:
- Verify ACR Integration:
az aks check-acr --name aks-algesta-production \ --resource-group rg-algesta-production \ --acr acralgestaproduction.azurecr.io- Grant AcrPull Role to AKS:
ACR_ID=$(az acr show --name acralgestaproduction --query id --output tsv)AKS_IDENTITY=$(az aks show --name aks-algesta-production --resource-group rg-algesta-production --query identityProfile.kubeletidentity.objectId --output tsv)
az role assignment create --assignee $AKS_IDENTITY --role AcrPull --scope $ACR_ID- Verify Image Exists:
az acr repository show-tags --name acralgestaproduction --repository api-gatewayPod Crashing (CrashLoopBackOff)
Symptoms:
kubectl get pods -n production# NAME READY STATUS RESTARTS AGE# api-gateway-prod-xxxxx-yyyyy 0/1 CrashLoopBackOff 5 10mDiagnosis:
# Check logskubectl logs api-gateway-prod-xxxxx-yyyyy -n production
# Check previous container logs (after crash)kubectl logs api-gateway-prod-xxxxx-yyyyy -n production --previous
# Check eventskubectl describe pod api-gateway-prod-xxxxx-yyyyy -n productionCommon Causes:
- Missing Environment Variables:
kubectl exec api-gateway-prod-xxxxx-yyyyy -n production -- env | grep -E "(MONGODB|JWT|NODE_ENV)"- Base de datos Connection Failure:
# Test MongoDB connection from podkubectl exec api-gateway-prod-xxxxx-yyyyy -n production -- nc -zv cluster.mongodb.net 27017- Insufficient Resources:
kubectl describe pod api-gateway-prod-xxxxx-yyyyy -n production | grep -A 5 "Resource"Service Not Reachable
Symptoms:
curl: (7) Failed to connect to algesta-api-prod.3astronautas.com port 443: Connection refusedDiagnosis:
- Check Ingress:
kubectl get ingress ingress-production -n production# Ensure ADDRESS column shows IP- Check Service:
kubectl get service api-gateway-production -n production# Ensure CLUSTER-IP is assigned
# Check endpoints (should list pod IPs)kubectl get endpoints api-gateway-production -n production- Test Service from within Cluster:
kubectl run -it --rm debug --image=curlimages/curl:latest --restart=Never -n production -- \ curl http://api-gateway-production.production.svc.cluster.local/health- Check DNS Resolution:
# From Azure VM or local machine with VPNnslookup algesta-api-prod.3astronautas.com# Should resolve to ingress external IPRelated Documentoation:
- Infrastructure as Code: AKS cluster provisioning
- CI/CD Pipelines: Automated Despliegue to AKS
- Monitoring & Logging: Prometheus, Grafana, Loki setup
- Security Operaciones: Secrets and RBAC management
- Runbooks: Common operational procedures
For Support:
- Check pod Estado:
kubectl get pods -n production - Review logs:
kubectl logs -f <pod-name> -n production - Describe resources:
kubectl describe Despliegue/service/ingress <name> -n production - Contact DevOps team for AKS access and credentials