Saltearse al contenido

Respuesta a Incidentes

Tabla de Contenidos

  1. Propósito
  2. Niveles de Severidad de Incidentes
  3. Proceso de Respuesta a Incidentes
  4. Roles en Respuesta a Incidentes
  5. Procedimientos de Comunicación
  6. Escenarios de Incidentes Comunes
  7. Revisión Post-Incidente

Propósito

Este documento define los procedimientos de respuesta a incidentes para la plataforma Algesta, incluyendo clasificación de severidad, proceso de respuesta, roles y responsabilidades, protocolos de comunicación y procedimientos de revisión post-incidente.


Incident Severity Levels

SeverityImpactResponse TimeExamples
SEV-1 (Critical)Completo service outage, data loss, security breachImmediate response (< 15 min)API gateway down, Base de datos unreachable, all pods crashing
SEV-2 (High)Major functionality degraded, significant customer impact1 hourSingle Microservicio down, high error rate (>10%), slow response times
SEV-3 (Medium)Minor functionality degraded, limited customer impact4 hoursNon-critical feature broken, intermittent errors (<5%), certificate expiring soon
SEV-4 (Low)No customer impact, informationalNext business dayLow disk space warning, outdated dependencies, Documentoation issues

Escalation Criteria:

  • SEV-2 → SEV-1: If incident persists for >1 hour or impacts multiple services
  • SEV-3 → SEV-2: If incident impacts >20% of users or critical business Operaciones

Incident Response Process

graph TD
    A[Incident Detected<br/>Alert or User Report] --> B{Assess Severity}
    B -->|SEV-1| C[Immediate Response<br/>Page On-Call]
    B -->|SEV-2/3/4| D[Create Incident Ticket]

    C --> E[Incident Manager Assigned]
    D --> E

    E --> F[Assemble Response Team]
    F --> G[Investigate Root Cause]
    G --> H[Implement Mitigation]

    H --> I{Issue Resolved?}
    I -->|No| J[Escalate / Implement<br/>Disaster Recovery]
    J --> G
    I -->|Yes| K[Verify Service Restored]

    K --> L[Update Status Page]
    L --> M[Close Incident]
    M --> N[Post-Incident Review]

Phase 1: Detection and Triage (0-15 minutes)

1. Incident Detected:

  • Monitoring alert (Grafana, Prometheus)
  • User report (support ticket, social media)
  • ProActivo monitoring (health checks failing)

2. Acknowledge Incident:

Ventana de terminal
# Check cluster status
kubectl get pods -A
kubectl get nodes
# Check Grafana dashboards
# Access: https://algesta.grafana.3astronautas.com
# Check recent deployments
kubectl rollout history deployment -n production

3. Assess Severity:

  • SEV-1: API returning 5xx errors for all requests, Base de datos unreachable
  • SEV-2: Single Microservicio down, error rate >10%
  • SEV-3: Minor feature broken, <5% error rate
  • SEV-4: Warning alerts, no user impact

4. Create Incident Ticket:

  • Title: “[SEV-X] Brief Descripción”
  • Descripción: What’s broken, observed symptoms, impact
  • Assign to on-call engineer

Phase 2: Investigation (15-60 minutes)

1. Gather Information:

Ventana de terminal
# Check pod status
kubectl get pods -n production
# Check logs
kubectl logs -l app=api-gateway -n production --tail=100 --timestamps
# Check recent events
kubectl get events -n production --sort-by='.lastTimestamp'
# Check resource usage
kubectl top pods -n production
kubectl top nodes

2. Check Monitoring:

  • Grafana: Request rate, error rate, latency
  • Prometheus: CPU, memory, disk usage
  • Loki: Application logs, error patterns

3. Identify Root Cause:

  • Recent Despliegues (check rollout history)
  • Infrastructure changes (Terraform apply logs)
  • External dependencies (MongoDB Atlas Estado, Azure Estado)
  • Resource exhaustion (CPU, memory, disk)

Phase 3: Mitigation (Varies by incident)

Common Mitigation Strategies:

Rollback Failed Despliegue:

Ventana de terminal
kubectl rollout undo deployment/api-gateway-production -n production

Restart Crashed Pods:

Ventana de terminal
kubectl rollout restart deployment/api-gateway-production -n production

Scale Up Resources:

Ventana de terminal
kubectl scale deployment/api-gateway-production --replicas=5 -n production

Base de datos Connection Issues:

Ventana de terminal
# Verify MongoDB Atlas status
# Check connection string in secrets
kubectl get secret mongodb-credentials -n production -o jsonpath='{.data.uri}' | base64 -d
# Test connectivity from pod
kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017

Refer to Runbooks for detailed procedures

Phase 4: Resolution and Verification

1. Verify Service Restored:

Ventana de terminal
# Check health endpoints
curl https://algesta-api-prod.3astronautas.com/health
# Verify all pods running
kubectl get pods -n production
# Check error rate in Grafana (should be <1%)

2. Monitor Stability:

  • Watch for 15-30 minutes to ensure issue doesn’t recur
  • Check key Métricas: request rate, latency, error rate

3. Update Stakeholders:

  • Post Estado update: “Incident resolved, monitoring for stability”
  • Communicate via Slack, email, Estado page

Phase 5: Post-Incident Activities

1. Documento Incident:

  • Root cause
  • Timeline of events
  • Actions taken
  • Lessons learned

2. Close Incident Ticket:

  • Final Estado: Resolved
  • Total duration: Detection → Resolution
  • Customer impact: Estimated downtime

3. Schedule Post-Incident Review (See Post-Incident Review)


Incident Response Roles

Incident Manager (IM)

Responsibilities:

  • Coordinate response efforts
  • Communicate with stakeholders
  • Make decisions on escalation and disaster recovery
  • Ensure Documentoation

Who: On-call DevOps engineer or senior SRE

Technical Lead (TL)

Responsibilities:

  • Lead technical investigation
  • Implement mitigation strategies
  • Coordinate with developers and platform engineers

Who: On-call backend engineer or DevOps engineer

Communications Lead (CL)

Responsibilities:

  • Update internal stakeholders (Slack, email)
  • Update Estado page
  • Communicate with customers if needed
  • Draft post-incident report

Who: Product manager or customer support lead

Subject Matter Expert (SME)

Responsibilities:

  • Provide specialized knowledge (Base de datos, networking, security)
  • Support technical investigation
  • Review mitigation strategies

Who: Base de datos admin, network engineer, security engineer (as needed)


Communication Procedures

Internal Communication (Slack)

Incident Channel: #incidents

SEV-1 Announcement:

@here SEV-1 INCIDENT
[12:34 PM] API Gateway Down - All production traffic failing
Incident Manager: @john.doe
Status: Investigating
Live doc: [Link to incident document]

Regular Updates (every 15-30 minutes):

[12:45 PM] UPDATE: Root cause identified - database connection pool exhausted. Implementing mitigation (restart pods).

Resolution:

[1:15 PM] RESOLVED: All services restored. Monitoring for stability. Post-incident review scheduled for tomorrow 10 AM.

External Communication (Estado Page)

Tools: GitHub Pages, Atlassian Estadopage, or custom Estado page

SEV-1 Example:

INVESTIGATING - API Unavailable
Posted: Nov 20, 2025 at 12:35 PM
We are currently experiencing issues with the Algesta API.
Our team is actively investigating. Updates will be posted here.

Update:

UPDATE - API Unavailable
Posted: Nov 20, 2025 at 12:50 PM
We have identified the root cause and are implementing a fix.
We expect service to be restored within 15 minutes.

Resolution:

RESOLVED - API Unavailable
Posted: Nov 20, 2025 at 1:15 PM
The issue has been resolved. All services are operational.
We will publish a post-incident report within 48 hours.

Common Incident Scenarios

Scenario 1: Completo Service Outage (SEV-1)

Symptoms:

  • All API requests returning 5xx errors
  • Health checks failing
  • Grafana shows 0 req/sec

Response:

  1. Check AKS cluster Estado: kubectl get nodes
  2. Check pod Estado: kubectl get pods -A
  3. Check ingress: kubectl get ingress -A
  4. If pods CrashLoopBackOff: kubectl logs <pod> -n production
  5. Rollback recent Despliegue: kubectl rollout undo Despliegue/<name> -n production
  6. If infrastructure issue: Escalate to Azure support, initiate DR plan

Escalation: If not resolved in 30 minutes, escalate to senior SRE and consider DR


Scenario 2: High Error Rate (SEV-2)

Symptoms:

  • Error rate >10% (normal <1%)
  • 5xx errors in logs
  • User complaints

Response:

  1. Check Grafana for error patterns
  2. Check Loki logs: {namespace="production"} |= "ERROR"
  3. Identify failing service
  4. Check Base de datos connectivity
  5. Check external dependencies (SendGrid, Twilio)
  6. Restart pods or rollback if needed

Scenario 3: Base de datos Connection Failure (SEV-1/SEV-2)

Symptoms:

  • “MongoNetworkError” in logs
  • Pods restarting frequently
  • Health checks failing

Response:

  1. Check MongoDB Atlas Estado: https://Estado.mongodb.com
  2. Verify connection string in secrets
  3. Test connectivity: kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017
  4. Check IP whitelist in MongoDB Atlas
  5. If Atlas down: Wait for resolution or restore from backup to new cluster

Escalation: If Atlas issue persists >1 hour, initiate Base de datos failover to backup cluster


Scenario 4: Certificate Expiration (SEV-3, escalates to SEV-1 if not resolved)

Symptoms:

  • Certificate expiring in <7 days
  • HTTPS connections failing with SSL errors
  • cert-manager logs show renewal failures

Response:

  1. Check certificate Estado: kubectl get certificates -n production
  2. Check cert-manager logs: kubectl logs -n cert-manager -l app=cert-manager
  3. Delete certificate to trigger renewal: kubectl delete certificate <name> -n production
  4. If auto-renewal fails: Manual certificate upload required

Prevention: Configure monitoring alerts for certificates expiring in 14 days


Scenario 5: High Memory Usage / OOMKilled Pods (SEV-2)

Symptoms:

  • Pods constantly restarting
  • “OOMKilled” in pod Estado
  • Grafana shows memory usage at 100%

Response:

  1. Check memory usage: kubectl top pods -n production
  2. Check for memory leaks in logs
  3. Increase memory limits temporarily:
    Ventana de terminal
    kubectl set resources deployment/<name> -n production --limits=memory=2Gi
  4. Investigate code for memory leaks
  5. Restart pods to clear memory

Follow-up: Performance optimization, code review for memory leaks


Post-Incident Review

Objective: Learn from incidents, prevent recurrence

Schedule: Within 48 hours of incident resolution (SEV-1/SEV-2)

Attendees:

  • Incident Manager
  • Technical Lead
  • Engineering team members involved
  • Product/management stakeholders (SEV-1 only)

Agenda:

  1. Incident Timeline (10 minutes)

    • Detection: When and how was incident detected?
    • Response: Who responded, when?
    • Mitigation: What actions were taken?
    • Resolution: When was service restored?
  2. Root Cause Analysis (15 minutes)

    • What was the root cause?
    • Why did it happen?
    • Why wasn’t it caught before production?
  3. What Went Well (10 minutes)

    • Fast detection
    • Effective communication
    • Successful mitigation
  4. What Could Be Improved (15 minutes)

    • Detection delays
    • Documentoation gaps
    • Tool limitations
  5. Action Items (10 minutes)

    • Preventive measures (monitoring, alerting, Pruebas)
    • Process improvements (runbooks, automation)
    • Technical debt (code fixes, infrastructure upgrades)
    • Assign owners and due dates

Post-Incident Report Template:

# Post-Incident Report: [SEV-X] [Brief Description]
**Date:** 2025-11-20
**Duration:** 1 hour 15 minutes (12:30 PM - 1:45 PM)
**Severity:** SEV-1
**Incident Manager:** John Doe
## Summary
Brief description of incident and customer impact.
## Timeline
- 12:30 PM: Incident detected via Grafana alert
- 12:35 PM: On-call engineer acknowledged
- 12:40 PM: Root cause identified (database connection pool exhausted)
- 12:50 PM: Mitigation implemented (restarted pods)
- 1:15 PM: Service restored
- 1:45 PM: Monitoring confirmed stable
## Root Cause
Database connection pool size set too low (10 connections) for production traffic.
## Impact
- 100% of API requests failed for 45 minutes
- Estimated 500 customer requests affected
- Revenue impact: $X (estimated)
## What Went Well
- Fast detection (< 5 minutes via monitoring)
- Clear runbook for pod restart
- Effective internal communication
## What Could Be Improved
- Database connection pool should have been load tested
- Alerting on connection pool exhaustion
- Better capacity planning
## Action Items
1. [OWNER] Increase connection pool size to 50 (Due: Nov 21)
2. [OWNER] Add monitoring for database connection pool usage (Due: Nov 23)
3. [OWNER] Conduct load testing for all services (Due: Dec 1)
4. [OWNER] Update deployment checklist to include load testing (Due: Nov 22)
## Lessons Learned
- Always load test before production deployment
- Monitor resource exhaustion (connections, file descriptors, etc.)
- Document capacity limits in service README

Related Documentoation:

For Support:

  • On-call engineer: Check on-call schedule in PagerDuty/Opsgenie
  • Escalation: Senior SRE → Engineering Manager → CTO
  • Emergency contacts: [Internal wiki or contact list]