Resiliencia y Seguridad del API Gateway
Resiliencia y Manejo de Errores del API Gateway
Descripción General
En una arquitectura de microservicios, la resiliencia es crítica para prevenir fallas en cascada y asegurar degradación elegante cuando los servicios downstream experimentan problemas. El API Gateway implementa múltiples patrones de resiliencia para mantener la disponibilidad del servicio y proporcionar manejo consistente de errores.
Patrones Implementados:
- Circuit Breaker: Prevenir fallas en cascada y fast-fail cuando los servicios no están disponibles
- Health Checks: Monitorear el estado del gateway y servicios downstream
- Manejo Global de Errores: Respuestas de error estandarizadas con seguimiento de solicitudes
- Seguimiento de Solicitudes: traceId único para correlación de solicitudes de extremo a extremo
- Mecanismos de Reintento: Reintentos automáticos de conexión para Redis/Kafka
Objetivos:
- Prevenir fallas en cascada a través de microservicios
- Proporcionar detección y recuperación rápida de fallas
- Habilitar degradación elegante cuando los servicios no están disponibles
- Mantener observabilidad a través de logging y monitoreo
- Asegurar respuestas de error consistentes para clientes
Patrón Circuit Breaker
El Patrón Circuit Breaker previene llamadas repetidas a servicios que están fallando, dándoles tiempo para recuperarse mientras protege al gateway de fallas en cascada.
Estados del Circuit Breaker
stateDiagram-v2
[*] --> CLOSED
CLOSED --> OPEN: Failure threshold exceeded<br/>(5 consecutive failures)
OPEN --> HALF_OPEN: Recovery timeout elapsed<br/>(60 seconds)
HALF_OPEN --> CLOSED: Successful request
HALF_OPEN --> OPEN: Failed request
note right of CLOSED
Normal operation
All requests pass through
Track failure count
end note
note right of OPEN
Service unavailable
Reject requests immediately
No calls to microservice
end note
note right of HALF_OPEN
Testing recovery
Allow limited requests
Evaluate service health
end note
Descripciones de Estado
CLOSED (Operación Normal):
- Todas las solicitudes pasan a los Microservicios
- Los fallos se rastrean y cuentan
- Transiciona a OPEN cuando se excede el umbral de fallos
OPEN (Circuito Abierto):
- Las solicitudes fallan inmediatamente sin llamar al Microservicio
- El comportamiento fast-fail protege los servicios downstream
- Los mecanismos de respaldo proporcionan funcionalidad degradada
- Transiciona automáticamente a HALF_OPEN después del timeout de recuperación
HALF_OPEN (Probando Recuperación):
- Se permiten solicitudes limitadas para probar la salud del servicio
- Éxito → Transición a CLOSED (servicio recuperado)
- Fallo → Transición de vuelta a OPEN (servicio aún fallando)
Configuración
El comportamiento del circuit breaker se controla mediante variables de entorno:
| Variable | Descripción | Por Defecto | Recomendación |
|---|---|---|---|
CIRCUIT_BREAKER_FAILURE_THRESHOLD | Fallos consecutivos antes de abrir | 5 | 3-10 dependiendo de criticidad |
CIRCUIT_BREAKER_RECOVERY_TIMEOUT | Tiempo antes de intentar recuperación (ms) | 60000 | 30000-120000 (30s-2min) |
CIRCUIT_BREAKER_MONITORING_PERIOD | Ventana para rastrear fallos (ms) | 300000 | 300000-600000 (5-10min) |
Ejemplo de Configuración:
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5CIRCUIT_BREAKER_RECOVERY_TIMEOUT=60000CIRCUIT_BREAKER_MONITORING_PERIOD=300000Ejemplo de Uso
import { CircuitBreakerService } from "@/infrastructure/messaging/circuit-breaker.service";
@Injectable()export class GetOrdersQueryHandler { constructor( @Inject("MS_ORDERS") private msOrdersClient: ClientProxy, private circuitBreaker: CircuitBreakerService ) {}
async execute(query: GetOrdersQuery): Promise<Order[]> { return await this.circuitBreaker.execute( // Primary function: call microservice async () => { return await this.msOrdersClient .send("orders.getAll", query) .toPromise(); }, // Fallback function: return cached or degraded response async () => { console.warn("Circuit breaker open, using fallback"); return { status: "service_unavailable", message: "Order service temporarily unavailable", cached: true, data: [], // Return empty array or cached data }; } ); }}Mecanismos de Respaldo
Estrategias de Degradación Elegante:
-
Retornar Datos en Caché:
async () => {const cachedOrders = await this.cacheService.get("orders");return cachedOrders || [];}; -
Retornar Datos Parciales:
async () => {return {status: "degraded",message: "Some features unavailable",availableData: {/* partial data */},};}; -
Retornar Error con Guía para el Usuario:
async () => {throw new ServiceUnavailableException("Order service is temporarily unavailable. Please try again later.");}; -
Usar Servicio Alternativo:
async () => {// Fallback to read replica or alternative data sourcereturn await this.readReplicaClient.send("orders.getAll", query).toPromise();};
Control Manual del Circuit Breaker
// Get circuit breaker statisticsconst stats = this.circuitBreaker.getStats();console.log(stats);// {// state: 'OPEN',// failureCount: 7,// lastFailureTime: '2025-01-15T10:30:00Z',// nextRetryTime: '2025-01-15T10:31:00Z'// }
// Manually reset circuit breaker (operational control)this.circuitBreaker.reset();Casos de Uso para Reset Manual:
- Después de desplegar un fix al servicio downstream
- Durante ventanas de mantenimiento
- Cuando el monitoreo confirma recuperación del servicio
- Para propósitos de prueba
Archivo de Referencia: algesta-api-gateway-nestjs/src/infrastructure/messaging/circuit-breaker.service.ts
Verificaciones de Salud
Las verificaciones de salud proporcionan visibilidad del estado operacional del gateway y la salud de los Microservicios conectados.
Endpoint de Salud del Gateway
GET /health (no requiere autenticación)
Retorna el estado de salud del propio gateway para balanceadores de carga y herramientas de monitoreo.
Formato de Respuesta:
{ "status": "ok", "service": "api-gateway", "timestamp": "2025-01-15T10:30:00Z", "uptime": 3600, "version": "1.0.0", "environment": "production"}Campos de Respuesta:
| Campo | Tipo | Descripción |
|---|---|---|
| status | string | "ok" cuando está saludable; estados de error dependientes de implementación |
| service | string | Nombre del servicio (api-gateway) |
| timestamp | string | Timestamp UTC actual |
| uptime | number | Tiempo de actividad del proceso en segundos |
| version | string | Versión de la aplicación |
| environment | string | Entorno (development, production, etc.) |
Nota: El campo status está configurado como "ok" por la implementación actual (HealthService o handler inline en main.ts). Este Endpoint no usa el wrapper de respuesta estándar.
Integración con Balanceadores de Carga:
# Kubernetes liveness probe examplelivenessProbe: httpGet: path: /health port: 3000 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3Nota: Este Endpoint está registrado antes del prefijo global /api para fácil acceso por herramientas de infraestructura.
Archivos de Referencia:
algesta-api-gateway-nestjs/src/main.ts(registro del Endpoint)algesta-api-gateway-nestjs/src/shared/health/health.service.ts
Verificación de Salud de Servicios Downstream
GET /api/health/services (requiere autenticación)
Verifica la salud de todos los Microservicios conectados en paralelo y proporciona estado agregado.
Formato de Respuesta:
{ "status": "degraded", "timestamp": "2025-01-15T10:30:00Z", "services": [ { "service": "ms-auth", "status": "healthy", "responseTime": 45, "url": "http://localhost:3001/health" }, { "service": "ms-patient", "status": "healthy", "responseTime": 32, "url": "http://localhost:3003/health" }, { "service": "ms-ai-integrator", "status": "timeout", "responseTime": 5100, "error": "Connection timeout after 5000ms", "url": "http://localhost:3002/health" } ], "summary": { "total": 3, "healthy": 2, "unhealthy": 0, "degraded": 1 }}Valores de Estado (basado en ServicesHealthService y ServiceHealthEstado):
- Estado de nivel superior: Uno de
'healthy','degraded', o'unhealthy' - Estado por servicio: Uno de
'healthy','unhealthy', o'timeout'
Lógica de Estado Agregado:
- healthy: Todos los servicios están saludables
- degraded: Algunos servicios están saludables (mezcla de healthy/timeout/unhealthy)
- unhealthy: Ningún servicio está saludable (todos timeout o unhealthy)
Servicios Configurados Actualmente (desde ServicesHealthService.getConfiguredServices()):
ms-auth(URL por defecto:http://localhost:3001)ms-patient(URL por defecto:http://localhost:3003)ms-ai-integrator(URL por defecto:http://localhost:3002)
Las URLs de servicios pueden ser sobrescritas mediante las variables de entorno MS_AUTH_URL, MS_PATIENT_URL, y MS_AI_INTEGRATOR_URL.
Configuración:
| Variable | Descripción | Por Defecto | Recomendación |
|---|---|---|---|
HEALTH_CHECK_TIMEOUT | Timeout para cada verificación de servicio (ms) | 5000 | 3000-10000 |
MS_AUTH_URL | Endpoint de salud del servicio Auth | http://localhost:3001 | URL de producción |
MS_PATIENT_URL | Endpoint de salud del servicio Patient | http://localhost:3002 | URL de producción |
MS_AI_INTEGRATOR_URL | Endpoint de salud del AI Integrator | http://localhost:3003 | URL de producción |
Detalles de Implementación:
- Verificaciones Paralelas: Todos los servicios se verifican concurrentemente para mejor rendimiento
- Manejo de Timeout: Cada verificación tiene timeout independiente
- Manejo de Errores: Las verificaciones fallidas no rompen el Endpoint
- Seguimiento de Tiempo de Respuesta: Mide el tiempo de ida y vuelta para cada servicio
Probando Verificaciones de Salud:
# Check gateway healthcurl http://localhost:3000/health
# Check all services health (requires authentication)curl http://localhost:3000/api/health/services \ -H "Authorization: Bearer <token>"Archivos de Referencia:
algesta-api-gateway-nestjs/src/shared/health/services-health.service.tsalgesta-api-gateway-nestjs/src/shared/health/health.service.ts
Manejo de Errores
El gateway implementa manejo global de errores para asegurar respuestas de error consistentes y propagación adecuada de errores desde los Microservicios.
Filtro Global de Excepciones
HttpExceptionFilter captura todas las excepciones y las formatea consistentemente.
Implementación:
@Catch()export class HttpExceptionFilter implements ExceptionFilter { catch(exception: unknown, host: ArgumentsHost) { const ctx = host.switchToHttp(); const response = ctx.getResponse(); const request = ctx.getRequest();
const status = exception instanceof HttpException ? exception.getStatus() : 500;
const message = exception instanceof HttpException ? exception.message : "Internal server error";
const errorResponse = { statusCode: status, message, error: exception instanceof HttpException ? exception.name : "InternalServerError", timestamp: new Date().toISOString(), path: request.url, traceId: request.traceId || uuidv4(), };
// Log error with Winston this.logger.error(message, { ...errorResponse, stack: exception instanceof Error ? exception.stack : undefined, });
response.status(status).json(errorResponse); }}Formato de Respuesta de Error
Respuesta de Error Estándar:
{ "statusCode": 400, "message": "Validation failed", "error": "BadRequestException", "timestamp": "2025-01-15T10:30:00Z", "path": "/api/orders", "traceId": "550e8400-e29b-41d4-a716-446655440000"}Campos de Respuesta:
| Campo | Tipo | Descripción |
|---|---|---|
| statusCode | number | Código de estado HTTP (400, 401, 404, 500, etc.) |
| message | string | Mensaje de error legible |
| error | string | Tipo/nombre del error |
| timestamp | string | Timestamp UTC cuando ocurrió el error |
| path | string | Ruta de la solicitud que causó el error |
| traceId | string | Identificador único de solicitud para seguimiento |
Tipos de Excepciones
El gateway maneja varios tipos de excepciones:
Errores de Cliente (4xx):
| Excepción | Código de Estado | Uso |
|---|---|---|
| BadRequestException | 400 | Datos de solicitud inválidos, errores de validación |
| UnauthorizedException | 401 | Token de autenticación faltante o inválido |
| ForbiddenException | 403 | Permisos insuficientes, email no verificado |
| NotFoundException | 404 | Recurso no encontrado |
| ConflictException | 409 | Recurso ya existe |
| UnprocessableEntityException | 422 | Validación de lógica de negocio falló |
Errores de Servidor (5xx):
| Excepción | Código de Estado | Uso |
|---|---|---|
| InternalServerErrorException | 500 | Errores inesperados |
| ServiceUnavailableException | 503 | Microservicio no disponible, circuit breaker abierto |
| GatewayTimeoutException | 504 | Timeout del Microservicio |
Errores de Validación
La integración de Class-validator proporciona errores de validación detallados:
Solicitud Inválida:
curl -X POST http://localhost:3000/api/orders \ -H "Content-Type: application/json" \ -H "Authorization: Bearer <token>" \ -d '{ "service": "", "priority": "INVALID" }'Respuesta de Error de Validación:
{ "statusCode": 400, "message": [ "service should not be empty", "priority must be one of: LOW, MEDIUM, HIGH, URGENT" ], "error": "BadRequestException", "timestamp": "2025-01-15T10:30:00Z", "path": "/api/orders", "traceId": "uuid-here"}Propagación de Errores de Microservicios
Cuando los Microservicios retornan errores, el gateway los envuelve y propaga:
Flujo de Error de Microservicio:
sequenceDiagram
participant Client
participant Gateway
participant Handler
participant MS as Microservice
Client->>Gateway: POST /api/orders
Gateway->>Handler: CreateOrderCommand
Handler->>MS: Send 'orders.create' message
MS-->>Handler: Error: "Service not available"
Handler->>Handler: Wrap error
Handler-->>Gateway: throw ServiceUnavailableException
Gateway->>Gateway: HttpExceptionFilter
Gateway-->>Client: 503 Service Unavailable<br/>+ traceId
Ejemplo de Envoltorio de Error:
try { const result = await this.msOrdersClient .send("orders.create", createOrderDto) .toPromise(); return result;} catch (error) { // Preserve original error context if (error.statusCode === 503) { throw new ServiceUnavailableException( `Orders service unavailable: ${error.message}` ); }
if (error.statusCode === 400) { throw new BadRequestException(error.message); }
// Unexpected errors throw new InternalServerErrorException( "Failed to create order. Please try again later." );}Preservación del Contexto de Error:
- Mensaje de error original incluido en el error del gateway
- Código de estado HTTP mapeado apropiadamente
- TraceId adjuntado para correlación
- Error logueado con stack trace completo
Archivo de Referencia: algesta-api-gateway-nestjs/src/infrastructure/rest/filters/http-exception.filter.ts
Seguimiento de Solicitudes
Cada solicitud recibe un traceId único para seguimiento y debugging de extremo a extremo.
ResponseInterceptor
Implementación:
@Injectable()export class ResponseInterceptor implements NestInterceptor { intercept(context: ExecutionContext, next: CallHandler): Observable<any> { const request = context.switchToHttp().getRequest(); const response = context.switchToHttp().getResponse();
// Generate or use existing traceId const traceId = request.traceId || uuidv4(); request.traceId = traceId;
return next.handle().pipe( map((data) => ({ statusCode: response.statusCode, message: this.getSuccessMessage(context), timestamp: new Date().toISOString(), path: request.url, method: request.method, data, traceId, })) ); }}Respuesta Exitosa con TraceId
Ejemplo de Solicitud:
curl http://localhost:3000/api/orders/order-123 \ -H "Authorization: Bearer <token>"Respuesta:
{ "statusCode": 200, "message": "Order retrieved successfully", "timestamp": "2025-01-15T10:30:00Z", "path": "/api/orders/order-123", "method": "GET", "data": { "orderId": "order-123", "status": "IN_PROGRESS" // ... order data }, "traceId": "550e8400-e29b-41d4-a716-446655440000"}Beneficios del TraceId
Seguimiento de Solicitudes de Extremo a Extremo:
- Identificador único para cada solicitud
- Incluido en logs, respuestas y mensajes de error
- Permite correlación a través de sistemas distribuidos
Debugging y Resolución de Problemas:
- Buscar logs por traceId para ver el ciclo de vida completo de la solicitud
- Identificar cuellos de botella de rendimiento
- Rastrear errores a través de Microservicios
Análisis de Rendimiento:
- Medir duración de solicitud de extremo a extremo
- Identificar llamadas lentas a Microservicios
- Detectar problemas de timeout
Ejemplo: Buscando Logs por TraceId
# Search logs for specific requestgrep "550e8400-e29b-41d4-a716-446655440000" /var/log/api-gateway/*.log
# Example log entries with traceId# 2025-01-15 10:30:00 [INFO] Request received { traceId: "550e8400...", method: "GET", path: "/api/orders/order-123" }# 2025-01-15 10:30:00 [DEBUG] Calling MS_ORDERS { traceId: "550e8400...", pattern: "orders.getById" }# 2025-01-15 10:30:01 [INFO] Response sent { traceId: "550e8400...", statusCode: 200, duration: 1050ms }Guía de Uso para Usuarios:
- Incluir traceId al reportar problemas
- Usar traceId para rastrear estado de solicitud
- Referenciar traceId en tickets de soporte
Archivo de Referencia: algesta-api-gateway-nestjs/src/infrastructure/rest/interceptors/response.interceptor.ts
Mecanismos de Reintento
El gateway implementa mecanismos de reintento automático para fallos transitorios en conexiones de Redis y Kafka.
Configuración de Reintento de Redis
Configuración:
{ host: process.env.REDIS_HOST, port: parseInt(process.env.REDIS_PORT), password: process.env.REDIS_PASSWORD, db: parseInt(process.env.REDIS_DB), retryStrategy: (times) => { const maxRetries = parseInt(process.env.REDIS_RETRY_ATTEMPTS) || 5; const retryDelay = parseInt(process.env.REDIS_RETRY_DELAY) || 3000;
if (times > maxRetries) { return null; // Stop retrying }
// Exponential backoff: delay * times return retryDelay * times; }, connectTimeout: parseInt(process.env.REDIS_CONNECT_TIMEOUT) || 60000, lazyConnect: true, // Connect on first use}Variables de Entorno:
| Variable | Descripción | Por Defecto | Ejemplo |
|---|---|---|---|
| REDIS_RETRY_ATTEMPTS | Máximo de intentos de reintento | 5 | 10 |
| REDIS_RETRY_DELAY | Delay base entre reintentos (ms) | 3000 | 5000 |
| REDIS_CONNECT_TIMEOUT | Timeout de conexión (ms) | 60000 | 30000 |
Comportamiento de Reintento:
- 1er reintento: Esperar 3 segundos (1 × 3000ms)
- 2do reintento: Esperar 6 segundos (2 × 3000ms)
- 3er reintento: Esperar 9 segundos (3 × 3000ms)
- 4to reintento: Esperar 12 segundos (4 × 3000ms)
- 5to reintento: Esperar 15 segundos (5 × 3000ms)
- Después de 5 intentos: Dejar de reintentar, lanzar error
Configuración de Reintento de Kafka
Configuración:
{ client: { clientId: process.env.KAFKA_CLIENT_ID, brokers: process.env.KAFKA_BROKERS?.split(','), retry: { initialRetryTime: 3000, retries: 5, multiplier: 2, // Exponential backoff maxRetryTime: 60000, }, }, consumer: { groupId: process.env.KAFKA_GROUP_ID, allowAutoTopicCreation: true, sessionTimeout: 30000, retry: { retries: 5, }, },}Backoff Exponencial:
- 1er reintento: Esperar 3 segundos
- 2do reintento: Esperar 6 segundos (3s × 2)
- 3er reintento: Esperar 12 segundos (6s × 2)
- 4to reintento: Esperar 24 segundos (12s × 2)
- 5to reintento: Esperar 48 segundos (24s × 2)
- Espera máxima: Limitada a 60 segundos
Estrategia de Conexión Lazy
Beneficios:
- El Gateway puede iniciar incluso si Redis/Kafka está temporalmente no disponible
- Reduce tiempo de inicio
- Los servicios pueden iniciar en cualquier orden
- Conexión establecida en el primer uso real
Implementación:
// Redis lazy connectconst redis = new Redis({ lazyConnect: true, // ... other options});
// Connection happens on first operationawait redis.get("key"); // Triggers connection if not connectedArchivo de Referencia: algesta-api-gateway-nestjs/src/config/config.transport.ts
Configuración de Timeout
Timeout de Verificación de Salud
Configuración:
HEALTH_CHECK_TIMEOUT=5000 # 5 secondsComportamiento:
- Cada verificación de salud de Microservicio tiene timeout independiente
- El timeout previene solicitudes colgadas
- Verificaciones de salud fallidas marcadas como “unhealthy”
Timeout de Conexión Redis
Configuración:
REDIS_CONNECT_TIMEOUT=60000 # 60 secondsComportamiento:
- Timeout de intento de conexión inicial
- Aplica al establecimiento de conexión lazy
- Después del timeout, comienzan los reintentos (si están configurados)
Timeouts de Llamadas a Microservicios
Brecha Actual: Política de timeout global no implementada para llamadas a Microservicios.
Implementación Recomendada:
// Timeout interceptor@Injectable()export class TimeoutInterceptor implements NestInterceptor { constructor(private readonly timeout: number = 30000) {} // 30 seconds default
intercept(context: ExecutionContext, next: CallHandler): Observable<any> { return next.handle().pipe( timeout(this.timeout), catchError((err) => { if (err.name === "TimeoutError") { throw new GatewayTimeoutException("Request timeout"); } throw err; }) ); }}Recomendación:
- Implementar interceptor de timeout global
- Configurar timeouts por Endpoint para operaciones de larga duración
- Establecer timeout por defecto: 30 segundos para la mayoría de operaciones
- Incrementar timeout para operaciones específicas (generación de PDF, exportación de datos, etc.)
Logging y Monitoreo
Configuración de Winston Logger
El gateway usa Winston para logging estructurado:
Niveles de Log:
- error: Eventos de error que podrían aún permitir que la aplicación continúe ejecutándose
- warn: Eventos de advertencia que indican problemas potenciales
- info: Mensajes informativos sobre el estado de la aplicación
- debug: Información detallada de debugging
Formato de Log:
{ "level": "info", "message": "Request received", "timestamp": "2025-01-15T10:30:00.000Z", "context": "OrdersController", "traceId": "550e8400-e29b-41d4-a716-446655440000", "method": "GET", "path": "/api/orders", "userId": "user-123"}Categorías de Log
1. Inicio/Apagado de Aplicación:
logger.info("API Gateway starting...", { environment: process.env.NODE_ENV, version: process.env.SERVICE_VERSION, port: process.env.PORT,});
logger.info("All microservice connections established");
// On shutdownlogger.info("API Gateway shutting down gracefully...");2. Logging de Solicitud/Respuesta:
logger.info("Request received", { traceId: request.traceId, method: request.method, path: request.path, userId: request.user?.userId, userAgent: request.headers["user-agent"],});
logger.info("Response sent", { traceId: request.traceId, statusCode: response.statusCode, duration: Date.now() - request.startTime,});3. Eventos de Autenticación:
logger.info("User login successful", { userId: user.userId, email: user.email, role: user.role,});
logger.warn("Failed login attempt", { email: loginDto.email, reason: "Invalid password", ipAddress: request.ip,});4. Comunicación con Microservicios:
logger.debug("Calling microservice", { service: "MS_ORDERS", pattern: "orders.create", traceId: request.traceId,});
logger.error("Microservice call failed", { service: "MS_ORDERS", pattern: "orders.create", error: error.message, traceId: request.traceId,});5. Cambios de Estado del Circuit Breaker:
logger.warn("Circuit breaker opened", { service: "MS_ORDERS", failureCount: 5, threshold: 5,});
logger.info("Circuit breaker closed", { service: "MS_ORDERS", message: "Service recovered",});6. Resultados de Verificación de Salud:
logger.info("Health check performed", { service: "ms-auth", status: "healthy", responseTime: 45,});
logger.error("Health check failed", { service: "ms-orders", status: "unhealthy", error: "Connection timeout",});7. Errores y Excepciones:
logger.error("Unhandled exception", { error: error.message, stack: error.stack, traceId: request.traceId, path: request.path,});Integration with External Monitoring
Prometheus Métricas (Not Currently Implemented):
Recommended Métricas to expose:
// Request metricshttp_requests_total{method="GET", path="/api/orders", status="200"}http_request_duration_seconds{method="GET", path="/api/orders"}
// Circuit breaker metricscircuit_breaker_state{service="MS_ORDERS", state="OPEN"}circuit_breaker_failures_total{service="MS_ORDERS"}
// Health check metricshealth_check_status{service="ms-auth", status="healthy"}health_check_duration_seconds{service="ms-auth"}
// Authentication metricsauth_attempts_total{result="success"}auth_attempts_total{result="failure"}Grafana Dashboards (Not Currently Implemented):
Recommended dashboard panels:
- Request rate (requests/second)
- Response time (p50, p95, p99)
- Error rate by Endpoint
- Circuit breaker Estado
- Microservicio health Estado
- Activo user sessions
ELK Stack Integration:
Winston logs can be shipped to Elasticsearch:
import { ElasticsearchTransport } from "winston-elasticsearch";
const logger = winston.createLogger({ transports: [ new ElasticsearchTransport({ level: "info", clientOpts: { node: process.env.ELASTICSEARCH_URL, }, index: "api-gateway-logs", }), ],});Reference File: algesta-api-gateway-nestjs/src/shared/logger/logger.service.ts
Graceful Shutdown
Proper shutdown handling ensures in-flight requests Completo and connections close cleanly.
Current Implementación Estado
Gap: Explicit shutdown handling not visible in codebase. NestJS provides default lifecycle hooks, but custom shutdown logic may be needed.
Recommended Implementación
Shutdown Handler:
import { NestFactory } from "@nestjs/core";
async function bootstrap() { const app = await NestFactory.create(AppModule);
// Enable shutdown hooks app.enableShutdownHooks();
// Custom shutdown handler process.on("SIGTERM", async () => { logger.info("SIGTERM received, starting graceful shutdown...");
// Stop accepting new requests await app.close();
logger.info("Graceful shutdown completed"); process.exit(0); });
await app.listen(3000);}Shutdown Steps:
- Stop Accepting New Requests: Close HTTP server listener
- Drain In-Flight Requests: Wait for Activo requests to Completo (with timeout)
- Close Microservicio Connections: Disconnect from Redis/Kafka gracefully
- Flush Logs: Ensure all logs are written
- Signal Readiness to Load Balancer: Update health check to “unhealthy”
- Exit Process: Terminate with exit code 0
Kubernetes Integration:
# Deployment with graceful shutdownspec: template: spec: containers: - name: api-gateway lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 15"] terminationGracePeriodSeconds: 30Recommended Timeout:
- Graceful shutdown timeout: 30 seconds
- Load balancer deregistration: 15 seconds
- Request drain period: 10 seconds
- Connection close period: 5 seconds
Pruebas Resilience
Circuit Breaker Behavior Pruebas
Simulate Microservicio Failure:
# 1. Make repeated requests to trigger circuit breakerfor i in {1..10}; do curl http://localhost:3000/api/orders \ -H "Authorization: Bearer <token>" sleep 1done
# If MS_ORDERS is down, circuit should open after 5 failures# Subsequent requests should fail immediately with fallback responseExpected Behavior:
- First 5 requests attempt to call Microservicio (timeout/error)
- Circuit breaker opens after 5th failure
- Requests 6-10 fail immediately without calling Microservicio
- Fallback response returned
Health Check Pruebas
Test Gateway Health:
# Should always return 200 if gateway is runningcurl http://localhost:3000/health
# Expected response{ "status": "healthy", "service": "api-gateway", "timestamp": "2025-01-15T10:30:00Z", "uptime": 3600, "version": "1.0.0", "environment": "development"}Test Services Health:
# Requires authenticationcurl http://localhost:3000/api/health/services \ -H "Authorization: Bearer <token>"
# Expected response shows status of all microservicesSimulate Service Degradation:
# Stop one microservice (e.g., MS_ORDERS)# Then check services healthcurl http://localhost:3000/api/health/services \ -H "Authorization: Bearer <token>"
# Expected: status "degraded" or "unhealthy"# ms-orders should show "unhealthy" with error messageError Handling Pruebas
Trigger Validation Error:
curl -X POST http://localhost:3000/api/orders \ -H "Content-Type: application/json" \ -H "Authorization: Bearer <token>" \ -d '{}'
# Expected: 400 Bad Request with validation errorsTrigger Authentication Error:
curl http://localhost:3000/api/orders \ -H "Authorization: Bearer invalid-token"
# Expected: 401 UnauthorizedTrigger Authorization Error:
# CLIENT token trying ADMIN endpointcurl -X PATCH http://localhost:3000/api/orders/order-123 \ -H "Authorization: Bearer <client-token>" \ -H "Content-Type: application/json" \ -d '{"status":"PUBLISHED"}'
# Expected: 403 ForbiddenTimeout Scenario Pruebas
Simulate Slow Microservicio:
# If microservice takes longer than health check timeout (5s)# Should see degraded status with timeout error
curl http://localhost:3000/api/health/services \ -H "Authorization: Bearer <token>"
# Service with >5s response time should be marked "degraded"Operational Runbook
Checking System Health
1. Check Gateway Health:
curl http://localhost:3000/healthHealthy Response:
{ "status": "ok", ... }Unhealthy Response:
{ "status": "error", "error": "..." }Note: The actual Estado Valors are "ok" for healthy. Error state Estado Valors are Implementación-dependent.
2. Check All Services Health:
curl http://localhost:3000/api/health/services \ -H "Authorization: Bearer <admin-token>"3. Interpret Health Check Responses:
| Status | Meaning | Action Required |
|---|---|---|
| healthy | All services operational | None |
| degraded | Some services slow/degraded | Monitor, investigate slow services |
| unhealthy | One or more services down | Immediate investigation required |
Investigating Errors Using TraceId
Scenario: User reports error
Steps:
- Get traceId from user or error response
- Search logs for traceId:
Ventana de terminal grep "550e8400-e29b-41d4-a716-446655440000" /var/log/api-gateway/*.log - Review Completo request lifecycle
- Identify failure point (gateway, Microservicio, Base de datos, etc.)
- Check Microservicio health
- Review circuit breaker Estado
Example Log Analysis:
# Request received10:30:00 [INFO] Request received { traceId: "550e8400...", path: "/api/orders" }
# Microservice called10:30:00 [DEBUG] Calling MS_ORDERS { traceId: "550e8400...", pattern: "orders.getAll" }
# Error occurred10:30:05 [ERROR] Microservice timeout { traceId: "550e8400...", service: "MS_ORDERS", duration: 5000ms }
# Circuit breaker opened10:30:05 [WARN] Circuit breaker opened { service: "MS_ORDERS", failureCount: 5 }Monitoring Circuit Breaker State
Check Circuit Breaker Estado (if monitoring Endpoint exists):
curl http://localhost:3000/api/health/circuit-breaker \ -H "Authorization: Bearer <admin-token>"
# Expected response{ "MS_ORDERS": { "state": "OPEN", "failureCount": 7, "lastFailureTime": "2025-01-15T10:30:00Z", "nextRetryTime": "2025-01-15T10:31:00Z" }, "MS_AUTH": { "state": "CLOSED", "failureCount": 0 }}Manual Circuit Breaker Reset (not currently exposed, recommended feature):
# After fixing downstream servicecurl -X POST http://localhost:3000/api/admin/circuit-breaker/reset \ -H "Authorization: Bearer <admin-token>" \ -d '{ "service": "MS_ORDERS" }'Escalation Procedures
Degraded State:
- Review service health check responses
- Check service-specific logs
- Investigate slow queries, resource constraints
- Monitor for recovery
- If persists >15 minutes, escalate to on-call engineer
Unhealthy State:
- Identify which service(s) are unhealthy
- Check service availability (network, process, health Endpoint)
- Review recent Despliegues/changes
- Check infrastructure (servers, databases, message queues)
- Restart service if needed
- If service doesn’t recover in 5 minutes, page on-call engineer
Circuit Breaker Open:
- Identify affected Microservicio
- Check Microservicio health and logs
- Investigate root cause (Base de datos issue, infinite loop, external API failure)
- Fix underlying issue
- Manually reset circuit breaker if needed
- Monitor for successful recovery
Gaps and Recommendations
Cross-Reference: Rate Limiting and Security
Rate Limiting: Not currently implemented at the gateway level. For details on this and other security gaps, see the “Security Gaps” Sección in api-gateway.md. Consider using @nestjs/throttler for request-level rate limiting and DoS protection.
Security Headers (Helmet): Not currently implemented. Helmet middleware can protect against some classes of attacks and is part of the broader resilience story. Refer to api-gateway.md for recommendations on implementing Helmet and other security headers.
Current Gaps
1. Global Timeout Policy
- Estado: Not implemented for Microservicio calls
- Impact: Requests may hang indefinitely
- Recommendation: Implement timeout interceptor with 30-second default
- Prioridad: High
2. Prometheus Métricas
- Estado: Not exposed
- Impact: Limited observability of performance and usage
- Recommendation: Add @nestjs/prometheus for Métricas collection
- Métricas to Track:
- Request rate, duration, errors
- Circuit breaker state
- Health check Estado
- Activo connections
- Prioridad: High
3. Distributed Tracing (OpenTelemetry)
- Estado: Not implemented
- Impact: Cannot trace requests across Microservicios
- Recommendation: Implement OpenTelemetry for distributed tracing
- Benefits: End-to-end request visibility, performance analysis
- Prioridad: Medium
4. Bulkhead Pattern
- Estado: Not implemented
- Impact: One slow Microservicio can exhaust all connections
- Recommendation: Isolate thread pools per Microservicio
- Prioridad: Medium
5. Retry with Exponential Backoff (Application Level)
- Estado: Only implemented for Redis/Kafka connections
- Impact: Transient Microservicio errors not retried
- Recommendation: Implement retry decorator for handlers
- Example:
@Retry({ maxAttempts: 3, backoff: 'exponential' })async execute(query: GetOrdersQuery) {return await this.msOrdersClient.send('orders.getAll', query).toPromise();}
- Prioridad: Low
6. Dead Letter Queue
- Estado: Not configured for failed messages
- Impact: Failed messages are lost
- Recommendation: Configure DLQ for Kafka/Redis to capture failed messages
- Prioridad: Medium
7. Chaos Engineering Pruebas
- Estado: Not performed
- Impact: Unknown system behavior under failure conditions
- Recommendation: Implement chaos Pruebas (random service failures, latency injection)
- Tools: Chaos Monkey, Toxiproxy
- Prioridad: Low
8. Request Rate Limiting per Client
- Estado: Not implemented
- Impact: No protection against abusive clients
- Recommendation: Implement per-user and per-IP rate limiting
- Prioridad: High (see security recommendations in authentication docs)
9. Adaptive Timeout
- Estado: Static timeouts only
- Impact: Fixed timeouts may be too short or too long
- Recommendation: Adjust timeouts dynamically based on service performance
- Prioridad: Low
10. Manual Circuit Breaker Management
- Estado: No admin Endpoint for circuit breaker control
- Impact: Cannot manually reset circuit breaker
- Recommendation: Add admin Endpoint for manual reset and Estado checking
- Prioridad: Medium
Recommended Improvements
Prioridad: High
- Implement global timeout policy for all Microservicio calls
- Add Prometheus Métricas for observability
- Implement rate limiting per client
- Add manual circuit breaker management Endpoints
Prioridad: Medium 5. Implement distributed tracing with OpenTelemetry 6. Add bulkhead pattern for connection isolation 7. Configure dead letter queue for failed messages 8. Implement comprehensive audit logging
Prioridad: Low 9. Add retry with exponential backoff at application level 10. Implement adaptive timeout based on service performance 11. Perform chaos engineering Pruebas
Monitoring Dashboard Recommendations
Essential Dashboards:
-
Request Dashboard:
- Request rate (total, per Endpoint)
- Response time (p50, p95, p99)
- Error rate
- Activo requests
-
Microservicio Health Dashboard:
- Health check Estado (healthy/degraded/unhealthy)
- Response time trends
- Availability percentage (uptime)
-
Circuit Breaker Dashboard:
- State per Microservicio (CLOSED/OPEN/HALF_OPEN)
- Failure counts
- Time in each state
-
Authentication Dashboard:
- Login attempts (success/failure)
- Activo sessions
- Token validation rate
-
Infrastructure Dashboard:
- CPU, memory usage
- Network throughput
- Connection pool Estado
Referencias Cruzadas
Related Documentoation
- Main Gateway Documentoation: API Gateway Arquitectura
- Authentication Details: API Gateway Authentication
- API Reference: API Gateway API Reference
- Inter-Service Communication: Inter-Service Communication
- Backend Microservicios: Backend Microservicios Descripción General
Referenced Files
algesta-api-gateway-nestjs/src/infrastructure/messaging/circuit-breaker.service.tsalgesta-api-gateway-nestjs/src/shared/health/services-health.service.tsalgesta-api-gateway-nestjs/src/shared/health/health.service.tsalgesta-api-gateway-nestjs/src/infrastructure/rest/filters/http-exception.filter.tsalgesta-api-gateway-nestjs/src/infrastructure/rest/interceptors/response.interceptor.tsalgesta-api-gateway-nestjs/src/config/config.transport.tsalgesta-api-gateway-nestjs/src/main.ts
Resumen
The API Gateway implements comprehensive resilience patterns to ensure high availability and graceful degradation in the face of failures. The Circuit Breaker pattern prevents cascading failures, health checks provide visibility, global error handling ensures consistency, and request tracking enables debugging.
Key Strengths:
- Circuit breaker prevents cascading failures
- Comprehensive health checks for gateway and Microservicios
- Standardized error responses with traceId
- Automatic retry mechanisms for connections
- Structured logging for observability
Prioridad Improvements:
- Add global timeout policy
- Implement Prometheus Métricas and Grafana dashboards
- Add distributed tracing with OpenTelemetry
- Implement manual circuit breaker management
- Configure dead letter queue for failed messages
For operational procedures and monitoring setup, refer to the Operational Runbook Sección above.