Resiliencia y Seguridad del API Gateway

Resiliencia y Manejo de Errores del API Gateway

Descripción General

En una arquitectura de microservicios, la resiliencia es crítica para prevenir fallas en cascada y asegurar degradación elegante cuando los servicios downstream experimentan problemas. El API Gateway implementa múltiples patrones de resiliencia para mantener la disponibilidad del servicio y proporcionar manejo consistente de errores.

Patrones Implementados:

Circuit Breaker: Prevenir fallas en cascada y fast-fail cuando los servicios no están disponibles
Health Checks: Monitorear el estado del gateway y servicios downstream
Manejo Global de Errores: Respuestas de error estandarizadas con seguimiento de solicitudes
Seguimiento de Solicitudes: traceId único para correlación de solicitudes de extremo a extremo
Mecanismos de Reintento: Reintentos automáticos de conexión para Redis/Kafka

Objetivos:

Prevenir fallas en cascada a través de microservicios
Proporcionar detección y recuperación rápida de fallas
Habilitar degradación elegante cuando los servicios no están disponibles
Mantener observabilidad a través de logging y monitoreo
Asegurar respuestas de error consistentes para clientes

Patrón Circuit Breaker

El Patrón Circuit Breaker previene llamadas repetidas a servicios que están fallando, dándoles tiempo para recuperarse mientras protege al gateway de fallas en cascada.

Estados del Circuit Breaker

stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: Failure threshold exceeded<br/>(5 consecutive failures)
    OPEN --> HALF_OPEN: Recovery timeout elapsed<br/>(60 seconds)
    HALF_OPEN --> CLOSED: Successful request
    HALF_OPEN --> OPEN: Failed request

    note right of CLOSED
        Normal operation
        All requests pass through
        Track failure count
    end note

    note right of OPEN
        Service unavailable
        Reject requests immediately
        No calls to microservice
    end note

    note right of HALF_OPEN
        Testing recovery
        Allow limited requests
        Evaluate service health
    end note

Descripciones de Estado

CLOSED (Operación Normal):

Todas las solicitudes pasan a los Microservicios
Los fallos se rastrean y cuentan
Transiciona a OPEN cuando se excede el umbral de fallos

OPEN (Circuito Abierto):

Las solicitudes fallan inmediatamente sin llamar al Microservicio
El comportamiento fast-fail protege los servicios downstream
Los mecanismos de respaldo proporcionan funcionalidad degradada
Transiciona automáticamente a HALF_OPEN después del timeout de recuperación

HALF_OPEN (Probando Recuperación):

Se permiten solicitudes limitadas para probar la salud del servicio
Éxito → Transición a CLOSED (servicio recuperado)
Fallo → Transición de vuelta a OPEN (servicio aún fallando)

Configuración

El comportamiento del circuit breaker se controla mediante variables de entorno:

Variable	Descripción	Por Defecto	Recomendación
`CIRCUIT_BREAKER_FAILURE_THRESHOLD`	Fallos consecutivos antes de abrir	5	3-10 dependiendo de criticidad
`CIRCUIT_BREAKER_RECOVERY_TIMEOUT`	Tiempo antes de intentar recuperación (ms)	60000	30000-120000 (30s-2min)
`CIRCUIT_BREAKER_MONITORING_PERIOD`	Ventana para rastrear fallos (ms)	300000	300000-600000 (5-10min)

Ejemplo de Configuración:

CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
CIRCUIT_BREAKER_RECOVERY_TIMEOUT=60000
CIRCUIT_BREAKER_MONITORING_PERIOD=300000

Ejemplo de Uso

import { CircuitBreakerService } from "@/infrastructure/messaging/circuit-breaker.service";

@Injectable()
export class GetOrdersQueryHandler {
  constructor(
    @Inject("MS_ORDERS") private msOrdersClient: ClientProxy,
    private circuitBreaker: CircuitBreakerService
  ) {}

  async execute(query: GetOrdersQuery): Promise<Order[]> {
    return await this.circuitBreaker.execute(
      // Primary function: call microservice
      async () => {
        return await this.msOrdersClient
          .send("orders.getAll", query)
          .toPromise();
      },
      // Fallback function: return cached or degraded response
      async () => {
        console.warn("Circuit breaker open, using fallback");
        return {
          status: "service_unavailable",
          message: "Order service temporarily unavailable",
          cached: true,
          data: [], // Return empty array or cached data
        };
      }
    );
  }
}

Mecanismos de Respaldo

Estrategias de Degradación Elegante:

Retornar Datos en Caché:

async () => {
  const cachedOrders = await this.cacheService.get("orders");
  return cachedOrders || [];
};

Retornar Datos Parciales:

async () => {
  return {
    status: "degraded",
    message: "Some features unavailable",
    availableData: {
      /* partial data */
    },
  };
};

Retornar Error con Guía para el Usuario:

async () => {
  throw new ServiceUnavailableException(
    "Order service is temporarily unavailable. Please try again later."
  );
};

Usar Servicio Alternativo:

async () => {
  // Fallback to read replica or alternative data source
  return await this.readReplicaClient
    .send("orders.getAll", query)
    .toPromise();
};

Control Manual del Circuit Breaker

// Get circuit breaker statistics
const stats = this.circuitBreaker.getStats();
console.log(stats);
// {
//   state: 'OPEN',
//   failureCount: 7,
//   lastFailureTime: '2025-01-15T10:30:00Z',
//   nextRetryTime: '2025-01-15T10:31:00Z'
// }

// Manually reset circuit breaker (operational control)
this.circuitBreaker.reset();

Casos de Uso para Reset Manual:

Después de desplegar un fix al servicio downstream
Durante ventanas de mantenimiento
Cuando el monitoreo confirma recuperación del servicio
Para propósitos de prueba

Archivo de Referencia: algesta-api-gateway-nestjs/src/infrastructure/messaging/circuit-breaker.service.ts

Verificaciones de Salud

Las verificaciones de salud proporcionan visibilidad del estado operacional del gateway y la salud de los Microservicios conectados.

Endpoint de Salud del Gateway

GET /health (no requiere autenticación)

Retorna el estado de salud del propio gateway para balanceadores de carga y herramientas de monitoreo.

Formato de Respuesta:

{
  "status": "ok",
  "service": "api-gateway",
  "timestamp": "2025-01-15T10:30:00Z",
  "uptime": 3600,
  "version": "1.0.0",
  "environment": "production"
}

Campos de Respuesta:

Campo	Tipo	Descripción
status	string	`"ok"` cuando está saludable; estados de error dependientes de implementación
service	string	Nombre del servicio (api-gateway)
timestamp	string	Timestamp UTC actual
uptime	number	Tiempo de actividad del proceso en segundos
version	string	Versión de la aplicación
environment	string	Entorno (development, production, etc.)

Nota: El campo status está configurado como "ok" por la implementación actual (HealthService o handler inline en main.ts). Este Endpoint no usa el wrapper de respuesta estándar.

Integración con Balanceadores de Carga:

# Kubernetes liveness probe example
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Nota: Este Endpoint está registrado antes del prefijo global /api para fácil acceso por herramientas de infraestructura.

Archivos de Referencia:

algesta-api-gateway-nestjs/src/main.ts (registro del Endpoint)
algesta-api-gateway-nestjs/src/shared/health/health.service.ts

Verificación de Salud de Servicios Downstream

GET /api/health/services (requiere autenticación)

Verifica la salud de todos los Microservicios conectados en paralelo y proporciona estado agregado.

Formato de Respuesta:

{
  "status": "degraded",
  "timestamp": "2025-01-15T10:30:00Z",
  "services": [
    {
      "service": "ms-auth",
      "status": "healthy",
      "responseTime": 45,
      "url": "http://localhost:3001/health"
    },
    {
      "service": "ms-patient",
      "status": "healthy",
      "responseTime": 32,
      "url": "http://localhost:3003/health"
    },
    {
      "service": "ms-ai-integrator",
      "status": "timeout",
      "responseTime": 5100,
      "error": "Connection timeout after 5000ms",
      "url": "http://localhost:3002/health"
    }
  ],
  "summary": {
    "total": 3,
    "healthy": 2,
    "unhealthy": 0,
    "degraded": 1
  }
}

Valores de Estado (basado en ServicesHealthService y ServiceHealthEstado):

Estado de nivel superior: Uno de 'healthy', 'degraded', o 'unhealthy'
Estado por servicio: Uno de 'healthy', 'unhealthy', o 'timeout'

Lógica de Estado Agregado:

healthy: Todos los servicios están saludables
degraded: Algunos servicios están saludables (mezcla de healthy/timeout/unhealthy)
unhealthy: Ningún servicio está saludable (todos timeout o unhealthy)

Servicios Configurados Actualmente (desde ServicesHealthService.getConfiguredServices()):

ms-auth (URL por defecto: http://localhost:3001)
ms-patient (URL por defecto: http://localhost:3003)
ms-ai-integrator (URL por defecto: http://localhost:3002)

Las URLs de servicios pueden ser sobrescritas mediante las variables de entorno MS_AUTH_URL, MS_PATIENT_URL, y MS_AI_INTEGRATOR_URL.

Configuración:

Variable	Descripción	Por Defecto	Recomendación
`HEALTH_CHECK_TIMEOUT`	Timeout para cada verificación de servicio (ms)	5000	3000-10000
`MS_AUTH_URL`	Endpoint de salud del servicio Auth	http://localhost:3001	URL de producción
`MS_PATIENT_URL`	Endpoint de salud del servicio Patient	http://localhost:3002	URL de producción
`MS_AI_INTEGRATOR_URL`	Endpoint de salud del AI Integrator	http://localhost:3003	URL de producción

Detalles de Implementación:

Verificaciones Paralelas: Todos los servicios se verifican concurrentemente para mejor rendimiento
Manejo de Timeout: Cada verificación tiene timeout independiente
Manejo de Errores: Las verificaciones fallidas no rompen el Endpoint
Seguimiento de Tiempo de Respuesta: Mide el tiempo de ida y vuelta para cada servicio

Probando Verificaciones de Salud:

# Check gateway health
curl http://localhost:3000/health

# Check all services health (requires authentication)
curl http://localhost:3000/api/health/services \
  -H "Authorization: Bearer <token>"

Archivos de Referencia:

algesta-api-gateway-nestjs/src/shared/health/services-health.service.ts
algesta-api-gateway-nestjs/src/shared/health/health.service.ts

Manejo de Errores

El gateway implementa manejo global de errores para asegurar respuestas de error consistentes y propagación adecuada de errores desde los Microservicios.

Filtro Global de Excepciones

HttpExceptionFilter captura todas las excepciones y las formatea consistentemente.

Implementación:

@Catch()
export class HttpExceptionFilter implements ExceptionFilter {
  catch(exception: unknown, host: ArgumentsHost) {
    const ctx = host.switchToHttp();
    const response = ctx.getResponse();
    const request = ctx.getRequest();

    const status =
      exception instanceof HttpException ? exception.getStatus() : 500;

    const message =
      exception instanceof HttpException
        ? exception.message
        : "Internal server error";

    const errorResponse = {
      statusCode: status,
      message,
      error:
        exception instanceof HttpException
          ? exception.name
          : "InternalServerError",
      timestamp: new Date().toISOString(),
      path: request.url,
      traceId: request.traceId || uuidv4(),
    };

    // Log error with Winston
    this.logger.error(message, {
      ...errorResponse,
      stack: exception instanceof Error ? exception.stack : undefined,
    });

    response.status(status).json(errorResponse);
  }
}

Formato de Respuesta de Error

Respuesta de Error Estándar:

{
  "statusCode": 400,
  "message": "Validation failed",
  "error": "BadRequestException",
  "timestamp": "2025-01-15T10:30:00Z",
  "path": "/api/orders",
  "traceId": "550e8400-e29b-41d4-a716-446655440000"
}

Campos de Respuesta:

Campo	Tipo	Descripción
statusCode	number	Código de estado HTTP (400, 401, 404, 500, etc.)
message	string	Mensaje de error legible
error	string	Tipo/nombre del error
timestamp	string	Timestamp UTC cuando ocurrió el error
path	string	Ruta de la solicitud que causó el error
traceId	string	Identificador único de solicitud para seguimiento

Tipos de Excepciones

El gateway maneja varios tipos de excepciones:

Errores de Cliente (4xx):

Excepción	Código de Estado	Uso
BadRequestException	400	Datos de solicitud inválidos, errores de validación
UnauthorizedException	401	Token de autenticación faltante o inválido
ForbiddenException	403	Permisos insuficientes, email no verificado
NotFoundException	404	Recurso no encontrado
ConflictException	409	Recurso ya existe
UnprocessableEntityException	422	Validación de lógica de negocio falló

Errores de Servidor (5xx):

Excepción	Código de Estado	Uso
InternalServerErrorException	500	Errores inesperados
ServiceUnavailableException	503	Microservicio no disponible, circuit breaker abierto
GatewayTimeoutException	504	Timeout del Microservicio

Errores de Validación

La integración de Class-validator proporciona errores de validación detallados:

Solicitud Inválida:

curl -X POST http://localhost:3000/api/orders \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{
    "service": "",
    "priority": "INVALID"
  }'

Respuesta de Error de Validación:

{
  "statusCode": 400,
  "message": [
    "service should not be empty",
    "priority must be one of: LOW, MEDIUM, HIGH, URGENT"
  ],
  "error": "BadRequestException",
  "timestamp": "2025-01-15T10:30:00Z",
  "path": "/api/orders",
  "traceId": "uuid-here"
}

Propagación de Errores de Microservicios

Cuando los Microservicios retornan errores, el gateway los envuelve y propaga:

Flujo de Error de Microservicio:

sequenceDiagram
    participant Client
    participant Gateway
    participant Handler
    participant MS as Microservice

    Client->>Gateway: POST /api/orders
    Gateway->>Handler: CreateOrderCommand
    Handler->>MS: Send 'orders.create' message
    MS-->>Handler: Error: "Service not available"
    Handler->>Handler: Wrap error
    Handler-->>Gateway: throw ServiceUnavailableException
    Gateway->>Gateway: HttpExceptionFilter
    Gateway-->>Client: 503 Service Unavailable<br/>+ traceId

Ejemplo de Envoltorio de Error:

try {
  const result = await this.msOrdersClient
    .send("orders.create", createOrderDto)
    .toPromise();
  return result;
} catch (error) {
  // Preserve original error context
  if (error.statusCode === 503) {
    throw new ServiceUnavailableException(
      `Orders service unavailable: ${error.message}`
    );
  }

  if (error.statusCode === 400) {
    throw new BadRequestException(error.message);
  }

  // Unexpected errors
  throw new InternalServerErrorException(
    "Failed to create order. Please try again later."
  );
}

Preservación del Contexto de Error:

Mensaje de error original incluido en el error del gateway
Código de estado HTTP mapeado apropiadamente
TraceId adjuntado para correlación
Error logueado con stack trace completo

Archivo de Referencia: algesta-api-gateway-nestjs/src/infrastructure/rest/filters/http-exception.filter.ts

Seguimiento de Solicitudes

Cada solicitud recibe un traceId único para seguimiento y debugging de extremo a extremo.

ResponseInterceptor

Implementación:

@Injectable()
export class ResponseInterceptor implements NestInterceptor {
  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    const request = context.switchToHttp().getRequest();
    const response = context.switchToHttp().getResponse();

    // Generate or use existing traceId
    const traceId = request.traceId || uuidv4();
    request.traceId = traceId;

    return next.handle().pipe(
      map((data) => ({
        statusCode: response.statusCode,
        message: this.getSuccessMessage(context),
        timestamp: new Date().toISOString(),
        path: request.url,
        method: request.method,
        data,
        traceId,
      }))
    );
  }
}

Respuesta Exitosa con TraceId

Ejemplo de Solicitud:

curl http://localhost:3000/api/orders/order-123 \
  -H "Authorization: Bearer <token>"

Respuesta:

{
  "statusCode": 200,
  "message": "Order retrieved successfully",
  "timestamp": "2025-01-15T10:30:00Z",
  "path": "/api/orders/order-123",
  "method": "GET",
  "data": {
    "orderId": "order-123",
    "status": "IN_PROGRESS"
    // ... order data
  },
  "traceId": "550e8400-e29b-41d4-a716-446655440000"
}

Beneficios del TraceId

Seguimiento de Solicitudes de Extremo a Extremo:

Identificador único para cada solicitud
Incluido en logs, respuestas y mensajes de error
Permite correlación a través de sistemas distribuidos

Debugging y Resolución de Problemas:

Buscar logs por traceId para ver el ciclo de vida completo de la solicitud
Identificar cuellos de botella de rendimiento
Rastrear errores a través de Microservicios

Análisis de Rendimiento:

Medir duración de solicitud de extremo a extremo
Identificar llamadas lentas a Microservicios
Detectar problemas de timeout

Ejemplo: Buscando Logs por TraceId

# Search logs for specific request
grep "550e8400-e29b-41d4-a716-446655440000" /var/log/api-gateway/*.log

# Example log entries with traceId
# 2025-01-15 10:30:00 [INFO] Request received { traceId: "550e8400...", method: "GET", path: "/api/orders/order-123" }
# 2025-01-15 10:30:00 [DEBUG] Calling MS_ORDERS { traceId: "550e8400...", pattern: "orders.getById" }
# 2025-01-15 10:30:01 [INFO] Response sent { traceId: "550e8400...", statusCode: 200, duration: 1050ms }

Guía de Uso para Usuarios:

Incluir traceId al reportar problemas
Usar traceId para rastrear estado de solicitud
Referenciar traceId en tickets de soporte

Archivo de Referencia: algesta-api-gateway-nestjs/src/infrastructure/rest/interceptors/response.interceptor.ts

Mecanismos de Reintento

El gateway implementa mecanismos de reintento automático para fallos transitorios en conexiones de Redis y Kafka.

Configuración de Reintento de Redis

Configuración:

{
  host: process.env.REDIS_HOST,
  port: parseInt(process.env.REDIS_PORT),
  password: process.env.REDIS_PASSWORD,
  db: parseInt(process.env.REDIS_DB),
  retryStrategy: (times) => {
    const maxRetries = parseInt(process.env.REDIS_RETRY_ATTEMPTS) || 5;
    const retryDelay = parseInt(process.env.REDIS_RETRY_DELAY) || 3000;

    if (times > maxRetries) {
      return null; // Stop retrying
    }

    // Exponential backoff: delay * times
    return retryDelay * times;
  },
  connectTimeout: parseInt(process.env.REDIS_CONNECT_TIMEOUT) || 60000,
  lazyConnect: true, // Connect on first use
}

Variables de Entorno:

Variable	Descripción	Por Defecto	Ejemplo
REDIS_RETRY_ATTEMPTS	Máximo de intentos de reintento	5	10
REDIS_RETRY_DELAY	Delay base entre reintentos (ms)	3000	5000
REDIS_CONNECT_TIMEOUT	Timeout de conexión (ms)	60000	30000

Comportamiento de Reintento:

1er reintento: Esperar 3 segundos (1 × 3000ms)
2do reintento: Esperar 6 segundos (2 × 3000ms)
3er reintento: Esperar 9 segundos (3 × 3000ms)
4to reintento: Esperar 12 segundos (4 × 3000ms)
5to reintento: Esperar 15 segundos (5 × 3000ms)
Después de 5 intentos: Dejar de reintentar, lanzar error

Configuración de Reintento de Kafka

Configuración:

{
  client: {
    clientId: process.env.KAFKA_CLIENT_ID,
    brokers: process.env.KAFKA_BROKERS?.split(','),
    retry: {
      initialRetryTime: 3000,
      retries: 5,
      multiplier: 2, // Exponential backoff
      maxRetryTime: 60000,
    },
  },
  consumer: {
    groupId: process.env.KAFKA_GROUP_ID,
    allowAutoTopicCreation: true,
    sessionTimeout: 30000,
    retry: {
      retries: 5,
    },
  },
}

Backoff Exponencial:

1er reintento: Esperar 3 segundos
2do reintento: Esperar 6 segundos (3s × 2)
3er reintento: Esperar 12 segundos (6s × 2)
4to reintento: Esperar 24 segundos (12s × 2)
5to reintento: Esperar 48 segundos (24s × 2)
Espera máxima: Limitada a 60 segundos

Estrategia de Conexión Lazy

Beneficios:

El Gateway puede iniciar incluso si Redis/Kafka está temporalmente no disponible
Reduce tiempo de inicio
Los servicios pueden iniciar en cualquier orden
Conexión establecida en el primer uso real

Implementación:

// Redis lazy connect
const redis = new Redis({
  lazyConnect: true,
  // ... other options
});

// Connection happens on first operation
await redis.get("key"); // Triggers connection if not connected

Archivo de Referencia: algesta-api-gateway-nestjs/src/config/config.transport.ts

Configuración de Timeout

Timeout de Verificación de Salud

Configuración:

HEALTH_CHECK_TIMEOUT=5000  # 5 seconds

Comportamiento:

Cada verificación de salud de Microservicio tiene timeout independiente
El timeout previene solicitudes colgadas
Verificaciones de salud fallidas marcadas como “unhealthy”

Timeout de Conexión Redis

Configuración:

REDIS_CONNECT_TIMEOUT=60000  # 60 seconds

Comportamiento:

Timeout de intento de conexión inicial
Aplica al establecimiento de conexión lazy
Después del timeout, comienzan los reintentos (si están configurados)

Timeouts de Llamadas a Microservicios

Brecha Actual: Política de timeout global no implementada para llamadas a Microservicios.

Implementación Recomendada:

// Timeout interceptor
@Injectable()
export class TimeoutInterceptor implements NestInterceptor {
  constructor(private readonly timeout: number = 30000) {} // 30 seconds default

  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    return next.handle().pipe(
      timeout(this.timeout),
      catchError((err) => {
        if (err.name === "TimeoutError") {
          throw new GatewayTimeoutException("Request timeout");
        }
        throw err;
      })
    );
  }
}

Recomendación:

Implementar interceptor de timeout global
Configurar timeouts por Endpoint para operaciones de larga duración
Establecer timeout por defecto: 30 segundos para la mayoría de operaciones
Incrementar timeout para operaciones específicas (generación de PDF, exportación de datos, etc.)

Logging y Monitoreo

Configuración de Winston Logger

El gateway usa Winston para logging estructurado:

Niveles de Log:

error: Eventos de error que podrían aún permitir que la aplicación continúe ejecutándose
warn: Eventos de advertencia que indican problemas potenciales
info: Mensajes informativos sobre el estado de la aplicación
debug: Información detallada de debugging

Formato de Log:

{
  "level": "info",
  "message": "Request received",
  "timestamp": "2025-01-15T10:30:00.000Z",
  "context": "OrdersController",
  "traceId": "550e8400-e29b-41d4-a716-446655440000",
  "method": "GET",
  "path": "/api/orders",
  "userId": "user-123"
}

Categorías de Log

1. Inicio/Apagado de Aplicación:

logger.info("API Gateway starting...", {
  environment: process.env.NODE_ENV,
  version: process.env.SERVICE_VERSION,
  port: process.env.PORT,
});

logger.info("All microservice connections established");

// On shutdown
logger.info("API Gateway shutting down gracefully...");

2. Logging de Solicitud/Respuesta:

logger.info("Request received", {
  traceId: request.traceId,
  method: request.method,
  path: request.path,
  userId: request.user?.userId,
  userAgent: request.headers["user-agent"],
});

logger.info("Response sent", {
  traceId: request.traceId,
  statusCode: response.statusCode,
  duration: Date.now() - request.startTime,
});

3. Eventos de Autenticación:

logger.info("User login successful", {
  userId: user.userId,
  email: user.email,
  role: user.role,
});

logger.warn("Failed login attempt", {
  email: loginDto.email,
  reason: "Invalid password",
  ipAddress: request.ip,
});

4. Comunicación con Microservicios:

logger.debug("Calling microservice", {
  service: "MS_ORDERS",
  pattern: "orders.create",
  traceId: request.traceId,
});

logger.error("Microservice call failed", {
  service: "MS_ORDERS",
  pattern: "orders.create",
  error: error.message,
  traceId: request.traceId,
});

5. Cambios de Estado del Circuit Breaker:

logger.warn("Circuit breaker opened", {
  service: "MS_ORDERS",
  failureCount: 5,
  threshold: 5,
});

logger.info("Circuit breaker closed", {
  service: "MS_ORDERS",
  message: "Service recovered",
});

6. Resultados de Verificación de Salud:

logger.info("Health check performed", {
  service: "ms-auth",
  status: "healthy",
  responseTime: 45,
});

logger.error("Health check failed", {
  service: "ms-orders",
  status: "unhealthy",
  error: "Connection timeout",
});

7. Errores y Excepciones:

logger.error("Unhandled exception", {
  error: error.message,
  stack: error.stack,
  traceId: request.traceId,
  path: request.path,
});

Integration with External Monitoring

Prometheus Métricas (Not Currently Implemented):

Recommended Métricas to expose:

// Request metrics
http_requests_total{method="GET", path="/api/orders", status="200"}
http_request_duration_seconds{method="GET", path="/api/orders"}

// Circuit breaker metrics
circuit_breaker_state{service="MS_ORDERS", state="OPEN"}
circuit_breaker_failures_total{service="MS_ORDERS"}

// Health check metrics
health_check_status{service="ms-auth", status="healthy"}
health_check_duration_seconds{service="ms-auth"}

// Authentication metrics
auth_attempts_total{result="success"}
auth_attempts_total{result="failure"}

Grafana Dashboards (Not Currently Implemented):

Recommended dashboard panels:

Request rate (requests/second)
Response time (p50, p95, p99)
Error rate by Endpoint
Circuit breaker Estado
Microservicio health Estado
Activo user sessions

ELK Stack Integration:

Winston logs can be shipped to Elasticsearch:

import { ElasticsearchTransport } from "winston-elasticsearch";

const logger = winston.createLogger({
  transports: [
    new ElasticsearchTransport({
      level: "info",
      clientOpts: {
        node: process.env.ELASTICSEARCH_URL,
      },
      index: "api-gateway-logs",
    }),
  ],
});

Reference File: algesta-api-gateway-nestjs/src/shared/logger/logger.service.ts

Graceful Shutdown

Proper shutdown handling ensures in-flight requests Completo and connections close cleanly.

Current Implementación Estado

Gap: Explicit shutdown handling not visible in codebase. NestJS provides default lifecycle hooks, but custom shutdown logic may be needed.

Recommended Implementación

Shutdown Handler:

import { NestFactory } from "@nestjs/core";

async function bootstrap() {
  const app = await NestFactory.create(AppModule);

  // Enable shutdown hooks
  app.enableShutdownHooks();

  // Custom shutdown handler
  process.on("SIGTERM", async () => {
    logger.info("SIGTERM received, starting graceful shutdown...");

    // Stop accepting new requests
    await app.close();

    logger.info("Graceful shutdown completed");
    process.exit(0);
  });

  await app.listen(3000);
}

Shutdown Steps:

Stop Accepting New Requests: Close HTTP server listener
Drain In-Flight Requests: Wait for Activo requests to Completo (with timeout)
Close Microservicio Connections: Disconnect from Redis/Kafka gracefully
Flush Logs: Ensure all logs are written
Signal Readiness to Load Balancer: Update health check to “unhealthy”
Exit Process: Terminate with exit code 0

Kubernetes Integration:

# Deployment with graceful shutdown
spec:
  template:
    spec:
      containers:
        - name: api-gateway
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]
          terminationGracePeriodSeconds: 30

Recommended Timeout:

Graceful shutdown timeout: 30 seconds
Load balancer deregistration: 15 seconds
Request drain period: 10 seconds
Connection close period: 5 seconds

Pruebas Resilience

Circuit Breaker Behavior Pruebas

Simulate Microservicio Failure:

# 1. Make repeated requests to trigger circuit breaker
for i in {1..10}; do
  curl http://localhost:3000/api/orders \
    -H "Authorization: Bearer <token>"
  sleep 1
done

# If MS_ORDERS is down, circuit should open after 5 failures
# Subsequent requests should fail immediately with fallback response

Expected Behavior:

First 5 requests attempt to call Microservicio (timeout/error)
Circuit breaker opens after 5th failure
Requests 6-10 fail immediately without calling Microservicio
Fallback response returned

Health Check Pruebas

Test Gateway Health:

# Should always return 200 if gateway is running
curl http://localhost:3000/health

# Expected response
{
  "status": "healthy",
  "service": "api-gateway",
  "timestamp": "2025-01-15T10:30:00Z",
  "uptime": 3600,
  "version": "1.0.0",
  "environment": "development"
}

Test Services Health:

# Requires authentication
curl http://localhost:3000/api/health/services \
  -H "Authorization: Bearer <token>"

# Expected response shows status of all microservices

Simulate Service Degradation:

# Stop one microservice (e.g., MS_ORDERS)
# Then check services health
curl http://localhost:3000/api/health/services \
  -H "Authorization: Bearer <token>"

# Expected: status "degraded" or "unhealthy"
# ms-orders should show "unhealthy" with error message

Error Handling Pruebas

Trigger Validation Error:

curl -X POST http://localhost:3000/api/orders \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{}'

# Expected: 400 Bad Request with validation errors

Trigger Authentication Error:

curl http://localhost:3000/api/orders \
  -H "Authorization: Bearer invalid-token"

# Expected: 401 Unauthorized

Trigger Authorization Error:

# CLIENT token trying ADMIN endpoint
curl -X PATCH http://localhost:3000/api/orders/order-123 \
  -H "Authorization: Bearer <client-token>" \
  -H "Content-Type: application/json" \
  -d '{"status":"PUBLISHED"}'

# Expected: 403 Forbidden

Timeout Scenario Pruebas

Simulate Slow Microservicio:

# If microservice takes longer than health check timeout (5s)
# Should see degraded status with timeout error

curl http://localhost:3000/api/health/services \
  -H "Authorization: Bearer <token>"

# Service with >5s response time should be marked "degraded"

Operational Runbook

Checking System Health

1. Check Gateway Health:

curl http://localhost:3000/health

Healthy Response:

{ "status": "ok", ... }

Unhealthy Response:

{ "status": "error", "error": "..." }

Note: The actual Estado Valors are "ok" for healthy. Error state Estado Valors are Implementación-dependent.

2. Check All Services Health:

curl http://localhost:3000/api/health/services \
  -H "Authorization: Bearer <admin-token>"

3. Interpret Health Check Responses:

Status	Meaning	Action Required
healthy	All services operational	None
degraded	Some services slow/degraded	Monitor, investigate slow services
unhealthy	One or more services down	Immediate investigation required

Investigating Errors Using TraceId

Scenario: User reports error

Steps:

Get traceId from user or error response

Search logs for traceId:

grep "550e8400-e29b-41d4-a716-446655440000" /var/log/api-gateway/*.log

Review Completo request lifecycle
Identify failure point (gateway, Microservicio, Base de datos, etc.)
Check Microservicio health
Review circuit breaker Estado

Example Log Analysis:

# Request received
10:30:00 [INFO] Request received { traceId: "550e8400...", path: "/api/orders" }

# Microservice called
10:30:00 [DEBUG] Calling MS_ORDERS { traceId: "550e8400...", pattern: "orders.getAll" }

# Error occurred
10:30:05 [ERROR] Microservice timeout { traceId: "550e8400...", service: "MS_ORDERS", duration: 5000ms }

# Circuit breaker opened
10:30:05 [WARN] Circuit breaker opened { service: "MS_ORDERS", failureCount: 5 }

Monitoring Circuit Breaker State

Check Circuit Breaker Estado (if monitoring Endpoint exists):

curl http://localhost:3000/api/health/circuit-breaker \
  -H "Authorization: Bearer <admin-token>"

# Expected response
{
  "MS_ORDERS": {
    "state": "OPEN",
    "failureCount": 7,
    "lastFailureTime": "2025-01-15T10:30:00Z",
    "nextRetryTime": "2025-01-15T10:31:00Z"
  },
  "MS_AUTH": {
    "state": "CLOSED",
    "failureCount": 0
  }
}

Manual Circuit Breaker Reset (not currently exposed, recommended feature):

# After fixing downstream service
curl -X POST http://localhost:3000/api/admin/circuit-breaker/reset \
  -H "Authorization: Bearer <admin-token>" \
  -d '{ "service": "MS_ORDERS" }'

Escalation Procedures

Degraded State:

Review service health check responses
Check service-specific logs
Investigate slow queries, resource constraints
Monitor for recovery
If persists >15 minutes, escalate to on-call engineer

Unhealthy State:

Identify which service(s) are unhealthy
Check service availability (network, process, health Endpoint)
Review recent Despliegues/changes
Check infrastructure (servers, databases, message queues)
Restart service if needed
If service doesn’t recover in 5 minutes, page on-call engineer

Circuit Breaker Open:

Identify affected Microservicio
Check Microservicio health and logs
Investigate root cause (Base de datos issue, infinite loop, external API failure)
Fix underlying issue
Manually reset circuit breaker if needed
Monitor for successful recovery

Gaps and Recommendations

Cross-Reference: Rate Limiting and Security

Rate Limiting: Not currently implemented at the gateway level. For details on this and other security gaps, see the “Security Gaps” Sección in api-gateway.md. Consider using @nestjs/throttler for request-level rate limiting and DoS protection.

Security Headers (Helmet): Not currently implemented. Helmet middleware can protect against some classes of attacks and is part of the broader resilience story. Refer to api-gateway.md for recommendations on implementing Helmet and other security headers.

Current Gaps

1. Global Timeout Policy

Estado: Not implemented for Microservicio calls
Impact: Requests may hang indefinitely
Recommendation: Implement timeout interceptor with 30-second default
Prioridad: High

2. Prometheus Métricas

Estado: Not exposed
Impact: Limited observability of performance and usage
Recommendation: Add @nestjs/prometheus for Métricas collection
Métricas to Track:
- Request rate, duration, errors
- Circuit breaker state
- Health check Estado
- Activo connections
Prioridad: High

3. Distributed Tracing (OpenTelemetry)

Estado: Not implemented
Impact: Cannot trace requests across Microservicios
Recommendation: Implement OpenTelemetry for distributed tracing
Benefits: End-to-end request visibility, performance analysis
Prioridad: Medium

4. Bulkhead Pattern

Estado: Not implemented
Impact: One slow Microservicio can exhaust all connections
Recommendation: Isolate thread pools per Microservicio
Prioridad: Medium

5. Retry with Exponential Backoff (Application Level)

Estado: Only implemented for Redis/Kafka connections
Impact: Transient Microservicio errors not retried
Recommendation: Implement retry decorator for handlers

Example:

@Retry({ maxAttempts: 3, backoff: 'exponential' })
async execute(query: GetOrdersQuery) {
  return await this.msOrdersClient.send('orders.getAll', query).toPromise();
}

Prioridad: Low

6. Dead Letter Queue

Estado: Not configured for failed messages
Impact: Failed messages are lost
Recommendation: Configure DLQ for Kafka/Redis to capture failed messages
Prioridad: Medium

7. Chaos Engineering Pruebas

Estado: Not performed
Impact: Unknown system behavior under failure conditions
Recommendation: Implement chaos Pruebas (random service failures, latency injection)
Tools: Chaos Monkey, Toxiproxy
Prioridad: Low

8. Request Rate Limiting per Client

Estado: Not implemented
Impact: No protection against abusive clients
Recommendation: Implement per-user and per-IP rate limiting
Prioridad: High (see security recommendations in authentication docs)

9. Adaptive Timeout

Estado: Static timeouts only
Impact: Fixed timeouts may be too short or too long
Recommendation: Adjust timeouts dynamically based on service performance
Prioridad: Low

10. Manual Circuit Breaker Management

Estado: No admin Endpoint for circuit breaker control
Impact: Cannot manually reset circuit breaker
Recommendation: Add admin Endpoint for manual reset and Estado checking
Prioridad: Medium

Recommended Improvements

Prioridad: High

Implement global timeout policy for all Microservicio calls
Add Prometheus Métricas for observability
Implement rate limiting per client
Add manual circuit breaker management Endpoints

Prioridad: Medium 5. Implement distributed tracing with OpenTelemetry 6. Add bulkhead pattern for connection isolation 7. Configure dead letter queue for failed messages 8. Implement comprehensive audit logging

Prioridad: Low 9. Add retry with exponential backoff at application level 10. Implement adaptive timeout based on service performance 11. Perform chaos engineering Pruebas

Monitoring Dashboard Recommendations

Essential Dashboards:

Request Dashboard:
- Request rate (total, per Endpoint)
- Response time (p50, p95, p99)
- Error rate
- Activo requests
Microservicio Health Dashboard:
- Health check Estado (healthy/degraded/unhealthy)
- Response time trends
- Availability percentage (uptime)
Circuit Breaker Dashboard:
- State per Microservicio (CLOSED/OPEN/HALF_OPEN)
- Failure counts
- Time in each state
Authentication Dashboard:
- Login attempts (success/failure)
- Activo sessions
- Token validation rate
Infrastructure Dashboard:
- CPU, memory usage
- Network throughput
- Connection pool Estado

Referencias Cruzadas

Main Gateway Documentoation: API Gateway Arquitectura
Authentication Details: API Gateway Authentication
API Reference: API Gateway API Reference
Inter-Service Communication: Inter-Service Communication
Backend Microservicios: Backend Microservicios Descripción General

Referenced Files

algesta-api-gateway-nestjs/src/infrastructure/messaging/circuit-breaker.service.ts
algesta-api-gateway-nestjs/src/shared/health/services-health.service.ts
algesta-api-gateway-nestjs/src/shared/health/health.service.ts
algesta-api-gateway-nestjs/src/infrastructure/rest/filters/http-exception.filter.ts
algesta-api-gateway-nestjs/src/infrastructure/rest/interceptors/response.interceptor.ts
algesta-api-gateway-nestjs/src/config/config.transport.ts
algesta-api-gateway-nestjs/src/main.ts

Resumen

The API Gateway implements comprehensive resilience patterns to ensure high availability and graceful degradation in the face of failures. The Circuit Breaker pattern prevents cascading failures, health checks provide visibility, global error handling ensures consistency, and request tracking enables debugging.

Key Strengths:

Circuit breaker prevents cascading failures
Comprehensive health checks for gateway and Microservicios
Standardized error responses with traceId
Automatic retry mechanisms for connections
Structured logging for observability

Prioridad Improvements:

Add global timeout policy
Implement Prometheus Métricas and Grafana dashboards
Add distributed tracing with OpenTelemetry
Implement manual circuit breaker management
Configure dead letter queue for failed messages

For operational procedures and monitoring setup, refer to the Operational Runbook Sección above.

Resiliencia y Seguridad del API Gateway

Resiliencia y Manejo de Errores del API Gateway

Descripción General

Patrón Circuit Breaker

Estados del Circuit Breaker

Descripciones de Estado

Configuración

Ejemplo de Uso

Mecanismos de Respaldo

Control Manual del Circuit Breaker

Verificaciones de Salud

Endpoint de Salud del Gateway

Verificación de Salud de Servicios Downstream

Manejo de Errores

Filtro Global de Excepciones

Formato de Respuesta de Error

Tipos de Excepciones

Errores de Validación

Propagación de Errores de Microservicios

Seguimiento de Solicitudes

ResponseInterceptor

Respuesta Exitosa con TraceId

Beneficios del TraceId

Mecanismos de Reintento

Configuración de Reintento de Redis

Configuración de Reintento de Kafka

Estrategia de Conexión Lazy

Configuración de Timeout

Timeout de Verificación de Salud

Timeout de Conexión Redis

Timeouts de Llamadas a Microservicios

Logging y Monitoreo

Configuración de Winston Logger

Categorías de Log

Integration with External Monitoring

Graceful Shutdown

Current Implementación Estado

Recommended Implementación

Pruebas Resilience

Circuit Breaker Behavior Pruebas

Health Check Pruebas

Error Handling Pruebas

Timeout Scenario Pruebas

Operational Runbook

Checking System Health

Investigating Errors Using TraceId

Monitoring Circuit Breaker State

Escalation Procedures

Gaps and Recommendations

Cross-Reference: Rate Limiting and Security

Current Gaps

Recommended Improvements

Monitoring Dashboard Recommendations

Referencias Cruzadas

Related Documentoation

Referenced Files

Resumen