Operaciones de Kubernetes

Tabla de Contenidos

Propósito
¿Para quién es esto?
Descripción General del Cluster AKS
Estrategia de Namespaces
Arquitectura de Workloads
Configuración de Ingress y TLS
Gestión de Secrets
Operaciones Comunes con kubectl
Escalamiento y Gestión de Recursos
Solución de Problemas

Propósito

Este documento describe las operaciones del cluster Kubernetes (AKS) para la plataforma Algesta. Cubre arquitectura del cluster, organización de namespaces, configuración de ingress, gestión de secrets y procedimientos operacionales comunes.

Siguiendo esta guía, entenderás:

Diseño del cluster AKS y configuración del node pool
Aislamiento de entornos basado en namespaces (development, pruebas, production, monitoring)
Enrutamiento de ingress con certificados TLS vía cert-manager
Secrets de Kubernetes e integración con Azure Key Vault
Comandos kubectl comunes para despliegues, servicios y solución de problemas

¿Para quién es esto?

Esta guía es para ingenieros DevOps gestionando el cluster AKS, SREs solucionando problemas de producción e ingenieros de plataforma desplegando aplicaciones. Asume familiaridad con Kubernetes, kubectl y AKS.

AKS Cluster Descripción General

Cluster Details

Property	Valor	Notes
Cluster Name	`aks-algesta-{environment}`	Per-environment clusters (dev, production)
Azure Region	East US	Configurable in Terraform
Kubernetes Version	1.28+	Managed by Azure (auto-upgrade available)
SKU Tier	Free	Production should use Standard tier for SLA
Network Plugin	kubenet	Default AKS networking (consider Azure CNI for advanced Funcionalidades)
DNS Prefix	`algesta-{environment}`	Used for API server FQDN
Identity Type	System-assigned Managed Identity	For Azure resource access (ACR, Key Vault)

Node Pools

System Node Pool (default):

Propósito: Runs AKS system Componentes (CoreDNS, Métricas-server, tunnelfront)
VM Size: Standard_B2s (2 vCPU, 4 GB RAM)
Node Count: 1 (auto-scaling: min 1, max 1)
OS: Linux (Ubuntu)
Mode: System
Taints: None (workloads can schedule here if needed)

User Node Pool (stdar{environment}):

Propósito: Runs application workloads (Microservicios, monitoring)
VM Size: Standard_B2s (2 vCPU, 4 GB RAM)
Node Count: 1 (auto-scaling: min 1, max 3)
OS: Linux (Ubuntu)
Mode: User
Taints: None

Scaling Behavior:

Development: Scales down to 1 node during idle periods
Production: Maintains min 1 node, scales up to 3 under load
Métricas: CPU utilization > 80% triggers scale-up

Cluster Add-ons

Add-on	Propósito	Configuration
Web App Routing	Managed ingress controller (nginx)	Enabled via Terraform (`web_app_routing` block)
Monitoring	Azure Monitor integration	Optional (currently using Prometheus/Grafana)
Azure Policy	Enforce security policies	Not enabled (consider for compliance)
Secrets Store CSI Driver	Azure Key Vault integration	Not enabled (future Implementación)

Namespace Strategy

Environment Isolation

Each environment has dedicated namespaces for isolation and RBAC:

Namespace	Propósito	Ingress Host	Resource Quotas
`development`	Development Despliegues	`algesta-api-dev.3astronautas.com`	No limits (small cluster)
`Pruebas`	QA and integration Pruebas	`algesta-api-test.3astronautas.com`	No limits
`production`	Live customer traffic	`algesta-api-prod.3astronautas.com`	CPU: 4 cores, Memory: 8 GB (recommended)
`monitoring`	Grafana, Prometheus, Loki	`algesta.grafana.3astronautas.com`	CPU: 2 cores, Memory: 4 GB
`cert-manager`	Certificate management	N/A	Minimal resources
`connect-devops`	CI/CD service accounts	N/A	Minimal resources

Namespace Conventions:

Environment namespaces (development, Pruebas, production) host Microservicios
Shared services (monitoring, cert-manager) in dedicated namespaces
No default namespace usage (all resources in named namespaces)

Creating Namespaces

Development Namespace:

kubectl create namespace development

# Add labels for organization
kubectl label namespace development environment=dev team=backend

With Resource Quotas (production):

kubectl create namespace production

# Apply resource quota
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "50"
EOF

Workload Arquitectura

Microservicios Despliegue Pattern

Each Microservicio follows this standard Despliegue structure:

{microservice-name}-{environment}/
├── Deployment
├── Service (ClusterIP)
├── HorizontalPodAutoscaler (HPA)
└── ConfigMap / Secret (env vars)

Example: API Gateway Despliegue

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway-production
  namespace: production
  labels:
    app: api-gateway
    environment: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-gateway
      environment: production
  template:
    metadata:
      labels:
        app: api-gateway
        environment: production
    spec:
      containers:
        - name: api-gateway
          image: acralgestaproduction.azurecr.io/api-gateway:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 3000
              name: http
          env:
            - name: NODE_ENV
              value: "production"
            - name: PORT
              value: "3000"
            - name: MONGODB_URI
              valueFrom:
                secretKeyRef:
                  name: mongodb-credentials
                  key: uri
            - name: JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: jwt-credentials
                  key: secret
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
  name: api-gateway-production
  namespace: production
spec:
  type: ClusterIP
  selector:
    app: api-gateway
    environment: production
  ports:
    - port: 80
      targetPort: 3000
      protocol: TCP
      name: http
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-production-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway-production
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Current Deployed Microservicios

Microservicio	Namespace	Despliegue Name	Service Name	Port
API Gateway	`development`, `Pruebas`, `production`	`api-gateway-{env}`	`api-gateway-{env}`	3000
Orders Service	`development`, `Pruebas`, `production`	`ms-orders-{env}`	`ms-orders-{env}`	3001
Notifications Service	`development`, `Pruebas`, `production`	`ms-notifications-{env}`	`ms-notifications-{env}`	3002
Provider Service	`development`, `Pruebas`, `production`	`ms-provider-{env}`	`ms-provider-{env}`	3003

Viewing Deployed Workloads:

# List all deployments in production namespace
kubectl get deployments -n production

# List all services
kubectl get services -n production

# List all pods with labels
kubectl get pods -n production --show-labels

Ingress and TLS Configuration

Ingress Controller

Type: Azure Web App Routing (managed nginx ingress)

Funcionalidades:

Managed by AKS (automatic updates)
Integrated with Azure DNS (optional)
Supports cert-manager for TLS automation

Ingress Class:

ingressClassName: webapprouting.kubernetes.azure.com

Alternative: Manual nginx-ingress Despliegue (more control, requires maintenance)

Ingress Resources

Development Environment (ops-algesta/resources-k8s/ingress-nginx/ingress-aks/ingress-development.yaml):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ingress-development
  namespace: development
  annotations:
    kubernetes.io/ingress.class: webapprouting.kubernetes.azure.com
    cert-manager.io/cluster-issuer: letsencrypt-prod-webapprouting
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-body-size: "300m"
spec:
  tls:
    - hosts:
        - algesta-api-dev.3astronautas.com
      secretName: algesta-api-dev-tls
  ingressClassName: webapprouting.kubernetes.azure.com
  rules:
    - host: algesta-api-dev.3astronautas.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-gateway-development
                port:
                  number: 80

Key Configuration:

TLS Certificate: Automatically provisioned by cert-manager via Let’s Encrypt
Timeout Settings: Extended to 600s for long-running requests (PDF generation)
Body Size: 300MB limit for file uploads (development), 10MB (production/Pruebas)

Pruebas Environment (ingress-Pruebas.yaml):

Host: algesta-api-test.3astronautas.com
TLS Secret: algesta-api-test-tls
Backend Service: api-gateway-Pruebas

Production Environment (ingress-production.yaml):

Host: algesta-api-prod.3astronautas.com
TLS Secret: algesta-api-prod-tls
Backend Service: api-gateway-production

Monitoring Ingress (ingress-monitoring.yaml):

Host: algesta.grafana.3astronautas.com
Backend Service: grafana-service (monitoring namespace)

TLS Certificate Management

cert-manager Configuration:

ClusterIssuer (ops-algesta/resources-k8s/cert-manager/Issuer.yaml):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
  namespace: cert-manager
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: j.leon@tresastronautas.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx

ClusterIssuer for Web App Routing (ops-algesta/resources-k8s/cert-manager/Issuer-webapprouting.yaml):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod-webapprouting
  namespace: cert-manager
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: j.leon@tresastronautas.com
    privateKeySecretRef:
      name: letsencrypt-prod-webapprouting
    solvers:
      - http01:
          ingress:
            ingressClassName: webapprouting.kubernetes.azure.com

Certificate Lifecycle:

Ingress created with cert-manager.io/cluster-issuer annotation
cert-manager detects annotation, creates Certificate resource
cert-manager initiates ACME challenge (HTTP-01)
Let’s Encrypt validates domain ownership
Certificate issued and stored in Secret (e.g., algesta-api-prod-tls)
Ingress controller uses Secret for TLS termination
Auto-renewal 30 days before expiration

Checking Certificate Estado:

# List certificates in namespace
kubectl get certificates -n production

# Check certificate details
kubectl describe certificate algesta-api-prod-tls -n production

# Check certificate expiration
kubectl get secret algesta-api-prod-tls -n production -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate

Manual Certificate Renewal (if auto-renewal fails):

# Delete certificate to trigger re-issuance
kubectl delete certificate algesta-api-prod-tls -n production

# cert-manager will automatically recreate it
kubectl get certificate algesta-api-prod-tls -n production -w

Secrets Management

Kubernetes Secrets

Current Approach: Manual secret creation via kubectl

Production MongoDB Secret:

kubectl create secret generic mongodb-credentials \
  --from-literal=uri="mongodb+srv://admin:SecurePassword@cluster.mongodb.net/algesta?retryWrites=true&w=majority" \
  --namespace=production

# Verify secret created
kubectl get secret mongodb-credentials -n production

JWT Secret:

kubectl create secret generic jwt-credentials \
  --from-literal=secret="supersecurejwtkey12345" \
  --namespace=production

Listing Secrets:

# List all secrets in namespace
kubectl get secrets -n production

# View secret details (encoded)
kubectl get secret mongodb-credentials -n production -o yaml

# Decode secret value
kubectl get secret mongodb-credentials -n production -o jsonpath='{.data.uri}' | base64 -d

Azure Key Vault Integration (Future)

Recommended: Use Azure Key Vault Provider for Secrets Store CSI Driver

Benefits:

Centralized secret management in Azure Key Vault
Automatic secret rotation
Audit logging
Integration with Azure RBAC

Implementación (future):

Enable CSI Driver in AKS:

az aks enable-addons --addons azure-keyvault-secrets-provider \
  --resource-group rg-algesta-production \
  --name aks-algesta-production

Create SecretProviderClass:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: azure-keyvault-provider
  namespace: production
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "true"
    userAssignedIdentityID: "<managed-identity-client-id>"
    keyvaultName: "akv-algesta-production"
    objects: |
      array:
        - |
          objectName: mongodb-uri
          objectType: secret
          objectVersion: ""
        - |
          objectName: jwt-secret
          objectType: secret
          objectVersion: ""
    tenantId: "<azure-tenant-id>"

Mount Secrets in Pods:

spec:
  containers:
    - name: api-gateway
      volumeMounts:
        - name: secrets-store
          mountPath: "/mnt/secrets"
          readOnly: true
  volumes:
    - name: secrets-store
      csi:
        driver: secrets-store.csi.k8s.io
        readOnly: true
        volumeAttributes:
          secretProviderClass: "azure-keyvault-provider"

Common kubectl Operaciones

Cluster Access

Configure kubectl:

# Get credentials for AKS cluster
az aks get-credentials \
  --resource-group rg-algesta-production \
  --name aks-algesta-production \
  --overwrite-existing

# Verify connection
kubectl cluster-info
kubectl get nodes

Set Default Namespace:

# Set default namespace to production
kubectl config set-context --current --namespace=production

# Verify current namespace
kubectl config view --minify | grep namespace:

Despliegue Operaciones

Deploy New Version:

# Update deployment image
kubectl set image deployment/api-gateway-production \
  api-gateway=acralgestaproduction.azurecr.io/api-gateway:12345 \
  --namespace=production

# Watch rollout progress
kubectl rollout status deployment/api-gateway-production -n production

# Check rollout history
kubectl rollout history deployment/api-gateway-production -n production

Rollback Despliegue:

# Rollback to previous version
kubectl rollout undo deployment/api-gateway-production -n production

# Rollback to specific revision
kubectl rollout undo deployment/api-gateway-production --to-revision=3 -n production

# Verify rollback
kubectl get pods -n production -l app=api-gateway

Scale Despliegue:

# Scale to 5 replicas
kubectl scale deployment/api-gateway-production --replicas=5 -n production

# Verify scaling
kubectl get pods -n production -l app=api-gateway

Restart Despliegue (forces pod recreation):

kubectl rollout restart deployment/api-gateway-production -n production

Pod Operaciones

Viewing Pods:

# List all pods in namespace
kubectl get pods -n production

# List pods with more details
kubectl get pods -n production -o wide

# Watch pods in real-time
kubectl get pods -n production -w

# Filter by label
kubectl get pods -n production -l app=api-gateway

Pod Logs:

# View logs for single pod
kubectl logs api-gateway-production-xxxxx-yyyyy -n production

# Follow logs (tail -f)
kubectl logs -f api-gateway-production-xxxxx-yyyyy -n production

# View logs from previous container (after crash)
kubectl logs api-gateway-production-xxxxx-yyyyy -n production --previous

# View logs from all pods in deployment
kubectl logs -l app=api-gateway -n production --tail=100

Execute Commands in Pod:

# Open shell in pod
kubectl exec -it api-gateway-production-xxxxx-yyyyy -n production -- /bin/sh

# Run single command
kubectl exec api-gateway-production-xxxxx-yyyyy -n production -- env | grep NODE_ENV

# Check application health
kubectl exec api-gateway-production-xxxxx-yyyyy -n production -- curl localhost:3000/health

Debugging Pods:

# Describe pod (shows events, status, volumes)
kubectl describe pod api-gateway-production-xxxxx-yyyyy -n production

# Check resource usage
kubectl top pod api-gateway-production-xxxxx-yyyyy -n production

# Check pod IP and node assignment
kubectl get pod api-gateway-production-xxxxx-yyyyy -n production -o jsonpath='{.status.podIP}{"\n"}{.spec.nodeName}{"\n"}'

Service Operaciones

Viewing Services:

# List services
kubectl get services -n production

# Describe service
kubectl describe service api-gateway-production -n production

# Get service endpoints
kubectl get endpoints api-gateway-production -n production

Pruebas Service Connectivity:

# From within cluster (create debug pod)
kubectl run -it --rm debug --image=curlimages/curl:latest --restart=Never -n production -- sh
# Inside pod:
curl http://api-gateway-production.production.svc.cluster.local/health

Ingress Operaciones

Viewing Ingress:

# List ingress resources
kubectl get ingress -n production

# Describe ingress (shows rules, backends)
kubectl describe ingress ingress-production -n production

# Check ingress external IP
kubectl get ingress ingress-production -n production -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Pruebas Ingress:

# Test HTTP (should redirect to HTTPS)
curl -v http://algesta-api-prod.3astronautas.com/health

# Test HTTPS
curl -v https://algesta-api-prod.3astronautas.com/health

# Check TLS certificate
openssl s_client -connect algesta-api-prod.3astronautas.com:443 -servername algesta-api-prod.3astronautas.com < /dev/null 2>/dev/null | openssl x509 -noout -dates

Scaling and Resource Management

Horizontal Pod Autoscaling (HPA)

Current HPA Configuration:

# List HPAs
kubectl get hpa -n production

# Describe HPA
kubectl describe hpa api-gateway-production-hpa -n production

HPA Métricas:

CPU Target: 70% utilization
Memory Target: 80% utilization
Min Replicas: 2 (production), 1 (dev/test)
Max Replicas: 10 (production), 3 (dev/test)

Manually Disable HPA (for Pruebas):

# Scale HPA to 0 (disables autoscaling)
kubectl patch hpa api-gateway-production-hpa -n production --patch '{"spec":{"minReplicas":1,"maxReplicas":1}}'

Cluster Autoscaling

Node Pool Autoscaling:

Managed by Azure AKS (configured in Terraform)
Trigger: Pods in Pendiente state due to insufficient resources
Scale Up: Add nodes to user node pool (max 3)
Scale Down: Remove idle nodes after 10 minutes

Check Node Estado:

# List nodes
kubectl get nodes

# Check node resource usage
kubectl top nodes

# Check pods per node
kubectl get pods -A -o wide | awk '{print $8}' | sort | uniq -c

Resource Quotas and Limits

Namespace Resource Quota (production):

# View quota
kubectl get resourcequota -n production

# Check quota usage
kubectl describe resourcequota production-quota -n production

Pod Resource Requests/Limits:

Requests: Minimum resources guaranteed (used for scheduling)
Limits: Maximum resources allowed (enforced by kubelet)

Recommended Settings:

Workload	CPU Request	Memory Request	CPU Limit	Memory Limit
API Gateway	250m	512Mi	1000m	1Gi
Microservicios	100m	256Mi	500m	512Mi
Monitoring (Grafana)	500m	1Gi	1000m	2Gi

Troubleshooting

Pod Not Starting (ImagePullBackOff)

Symptoms:

kubectl get pods -n production
# NAME                             READY   STATUS             RESTARTS   AGE
# api-gateway-prod-xxxxx-yyyyy     0/1     ImagePullBackOff   0          2m

Diagnosis:

kubectl describe pod api-gateway-prod-xxxxx-yyyyy -n production
# Events:
#   Failed to pull image "acralgestaproduction.azurecr.io/api-gateway:12345": rpc error: code = Unknown desc = Error response from daemon: unauthorized: authentication required

Solutions:

Verify ACR Integration:

az aks check-acr --name aks-algesta-production \
  --resource-group rg-algesta-production \
  --acr acralgestaproduction.azurecr.io

Grant AcrPull Role to AKS:

ACR_ID=$(az acr show --name acralgestaproduction --query id --output tsv)
AKS_IDENTITY=$(az aks show --name aks-algesta-production --resource-group rg-algesta-production --query identityProfile.kubeletidentity.objectId --output tsv)

az role assignment create --assignee $AKS_IDENTITY --role AcrPull --scope $ACR_ID

Verify Image Exists:

az acr repository show-tags --name acralgestaproduction --repository api-gateway

Pod Crashing (CrashLoopBackOff)

Symptoms:

kubectl get pods -n production
# NAME                             READY   STATUS             RESTARTS   AGE
# api-gateway-prod-xxxxx-yyyyy     0/1     CrashLoopBackOff   5          10m

Diagnosis:

# Check logs
kubectl logs api-gateway-prod-xxxxx-yyyyy -n production

# Check previous container logs (after crash)
kubectl logs api-gateway-prod-xxxxx-yyyyy -n production --previous

# Check events
kubectl describe pod api-gateway-prod-xxxxx-yyyyy -n production

Common Causes:

Missing Environment Variables:

kubectl exec api-gateway-prod-xxxxx-yyyyy -n production -- env | grep -E "(MONGODB|JWT|NODE_ENV)"

Base de datos Connection Failure:

# Test MongoDB connection from pod
kubectl exec api-gateway-prod-xxxxx-yyyyy -n production -- nc -zv cluster.mongodb.net 27017

Insufficient Resources:

kubectl describe pod api-gateway-prod-xxxxx-yyyyy -n production | grep -A 5 "Resource"

Service Not Reachable

Symptoms:

curl: (7) Failed to connect to algesta-api-prod.3astronautas.com port 443: Connection refused

Diagnosis:

Check Ingress:

kubectl get ingress ingress-production -n production
# Ensure ADDRESS column shows IP

Check Service:

kubectl get service api-gateway-production -n production
# Ensure CLUSTER-IP is assigned

# Check endpoints (should list pod IPs)
kubectl get endpoints api-gateway-production -n production

Test Service from within Cluster:

kubectl run -it --rm debug --image=curlimages/curl:latest --restart=Never -n production -- \
  curl http://api-gateway-production.production.svc.cluster.local/health

Check DNS Resolution:

# From Azure VM or local machine with VPN
nslookup algesta-api-prod.3astronautas.com
# Should resolve to ingress external IP

Related Documentoation:

Infrastructure as Code: AKS cluster provisioning
CI/CD Pipelines: Automated Despliegue to AKS
Monitoring & Logging: Prometheus, Grafana, Loki setup
Security Operaciones: Secrets and RBAC management
Runbooks: Common operational procedures

For Support:

Check pod Estado: kubectl get pods -n production
Review logs: kubectl logs -f <pod-name> -n production
Describe resources: kubectl describe Despliegue/service/ingress <name> -n production
Contact DevOps team for AKS access and credentials