Respuesta a Incidentes

Tabla de Contenidos

Propósito
Niveles de Severidad de Incidentes
Proceso de Respuesta a Incidentes
Roles en Respuesta a Incidentes
Procedimientos de Comunicación
Escenarios de Incidentes Comunes
Revisión Post-Incidente

Propósito

Este documento define los procedimientos de respuesta a incidentes para la plataforma Algesta, incluyendo clasificación de severidad, proceso de respuesta, roles y responsabilidades, protocolos de comunicación y procedimientos de revisión post-incidente.

Incident Severity Levels

Severity	Impact	Response Time	Examples
SEV-1 (Critical)	Completo service outage, data loss, security breach	Immediate response (< 15 min)	API gateway down, Base de datos unreachable, all pods crashing
SEV-2 (High)	Major functionality degraded, significant customer impact	1 hour	Single Microservicio down, high error rate (>10%), slow response times
SEV-3 (Medium)	Minor functionality degraded, limited customer impact	4 hours	Non-critical feature broken, intermittent errors (<5%), certificate expiring soon
SEV-4 (Low)	No customer impact, informational	Next business day	Low disk space warning, outdated dependencies, Documentoation issues

Escalation Criteria:

SEV-2 → SEV-1: If incident persists for >1 hour or impacts multiple services
SEV-3 → SEV-2: If incident impacts >20% of users or critical business Operaciones

Incident Response Process

graph TD
    A[Incident Detected<br/>Alert or User Report] --> B{Assess Severity}
    B -->|SEV-1| C[Immediate Response<br/>Page On-Call]
    B -->|SEV-2/3/4| D[Create Incident Ticket]

    C --> E[Incident Manager Assigned]
    D --> E

    E --> F[Assemble Response Team]
    F --> G[Investigate Root Cause]
    G --> H[Implement Mitigation]

    H --> I{Issue Resolved?}
    I -->|No| J[Escalate / Implement<br/>Disaster Recovery]
    J --> G
    I -->|Yes| K[Verify Service Restored]

    K --> L[Update Status Page]
    L --> M[Close Incident]
    M --> N[Post-Incident Review]

Phase 1: Detection and Triage (0-15 minutes)

1. Incident Detected:

Monitoring alert (Grafana, Prometheus)
User report (support ticket, social media)
ProActivo monitoring (health checks failing)

2. Acknowledge Incident:

# Check cluster status
kubectl get pods -A
kubectl get nodes

# Check Grafana dashboards
# Access: https://algesta.grafana.3astronautas.com

# Check recent deployments
kubectl rollout history deployment -n production

3. Assess Severity:

SEV-1: API returning 5xx errors for all requests, Base de datos unreachable
SEV-2: Single Microservicio down, error rate >10%
SEV-3: Minor feature broken, <5% error rate
SEV-4: Warning alerts, no user impact

4. Create Incident Ticket:

Title: “[SEV-X] Brief Descripción”
Descripción: What’s broken, observed symptoms, impact
Assign to on-call engineer

Phase 2: Investigation (15-60 minutes)

1. Gather Information:

# Check pod status
kubectl get pods -n production

# Check logs
kubectl logs -l app=api-gateway -n production --tail=100 --timestamps

# Check recent events
kubectl get events -n production --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods -n production
kubectl top nodes

2. Check Monitoring:

Grafana: Request rate, error rate, latency
Prometheus: CPU, memory, disk usage
Loki: Application logs, error patterns

3. Identify Root Cause:

Recent Despliegues (check rollout history)
Infrastructure changes (Terraform apply logs)
External dependencies (MongoDB Atlas Estado, Azure Estado)
Resource exhaustion (CPU, memory, disk)

Phase 3: Mitigation (Varies by incident)

Common Mitigation Strategies:

Rollback Failed Despliegue:

kubectl rollout undo deployment/api-gateway-production -n production

Restart Crashed Pods:

kubectl rollout restart deployment/api-gateway-production -n production

Scale Up Resources:

kubectl scale deployment/api-gateway-production --replicas=5 -n production

Base de datos Connection Issues:

# Verify MongoDB Atlas status
# Check connection string in secrets
kubectl get secret mongodb-credentials -n production -o jsonpath='{.data.uri}' | base64 -d

# Test connectivity from pod
kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017

Refer to Runbooks for detailed procedures

Phase 4: Resolution and Verification

1. Verify Service Restored:

# Check health endpoints
curl https://algesta-api-prod.3astronautas.com/health

# Verify all pods running
kubectl get pods -n production

# Check error rate in Grafana (should be <1%)

2. Monitor Stability:

Watch for 15-30 minutes to ensure issue doesn’t recur
Check key Métricas: request rate, latency, error rate

3. Update Stakeholders:

Post Estado update: “Incident resolved, monitoring for stability”
Communicate via Slack, email, Estado page

Phase 5: Post-Incident Activities

1. Documento Incident:

Root cause
Timeline of events
Actions taken
Lessons learned

2. Close Incident Ticket:

Final Estado: Resolved
Total duration: Detection → Resolution
Customer impact: Estimated downtime

3. Schedule Post-Incident Review (See Post-Incident Review)

Incident Response Roles

Incident Manager (IM)

Responsibilities:

Coordinate response efforts
Communicate with stakeholders
Make decisions on escalation and disaster recovery
Ensure Documentoation

Who: On-call DevOps engineer or senior SRE

Technical Lead (TL)

Responsibilities:

Lead technical investigation
Implement mitigation strategies
Coordinate with developers and platform engineers

Who: On-call backend engineer or DevOps engineer

Communications Lead (CL)

Responsibilities:

Update internal stakeholders (Slack, email)
Update Estado page
Communicate with customers if needed
Draft post-incident report

Who: Product manager or customer support lead

Subject Matter Expert (SME)

Responsibilities:

Provide specialized knowledge (Base de datos, networking, security)
Support technical investigation
Review mitigation strategies

Who: Base de datos admin, network engineer, security engineer (as needed)

Communication Procedures

Internal Communication (Slack)

Incident Channel: #incidents

SEV-1 Announcement:

@here SEV-1 INCIDENT
[12:34 PM] API Gateway Down - All production traffic failing
Incident Manager: @john.doe
Status: Investigating
Live doc: [Link to incident document]

Regular Updates (every 15-30 minutes):

[12:45 PM] UPDATE: Root cause identified - database connection pool exhausted. Implementing mitigation (restart pods).

Resolution:

[1:15 PM] RESOLVED: All services restored. Monitoring for stability. Post-incident review scheduled for tomorrow 10 AM.

External Communication (Estado Page)

Tools: GitHub Pages, Atlassian Estadopage, or custom Estado page

SEV-1 Example:

INVESTIGATING - API Unavailable
Posted: Nov 20, 2025 at 12:35 PM

We are currently experiencing issues with the Algesta API.
Our team is actively investigating. Updates will be posted here.

Update:

UPDATE - API Unavailable
Posted: Nov 20, 2025 at 12:50 PM

We have identified the root cause and are implementing a fix.
We expect service to be restored within 15 minutes.

Resolution:

RESOLVED - API Unavailable
Posted: Nov 20, 2025 at 1:15 PM

The issue has been resolved. All services are operational.
We will publish a post-incident report within 48 hours.

Common Incident Scenarios

Scenario 1: Completo Service Outage (SEV-1)

Symptoms:

All API requests returning 5xx errors
Health checks failing
Grafana shows 0 req/sec

Response:

Check AKS cluster Estado: kubectl get nodes
Check pod Estado: kubectl get pods -A
Check ingress: kubectl get ingress -A
If pods CrashLoopBackOff: kubectl logs <pod> -n production
Rollback recent Despliegue: kubectl rollout undo Despliegue/<name> -n production
If infrastructure issue: Escalate to Azure support, initiate DR plan

Escalation: If not resolved in 30 minutes, escalate to senior SRE and consider DR

Scenario 2: High Error Rate (SEV-2)

Symptoms:

Error rate >10% (normal <1%)
5xx errors in logs
User complaints

Response:

Check Grafana for error patterns
Check Loki logs: {namespace="production"} |= "ERROR"
Identify failing service
Check Base de datos connectivity
Check external dependencies (SendGrid, Twilio)
Restart pods or rollback if needed

Scenario 3: Base de datos Connection Failure (SEV-1/SEV-2)

Symptoms:

“MongoNetworkError” in logs
Pods restarting frequently
Health checks failing

Response:

Check MongoDB Atlas Estado: https://Estado.mongodb.com
Verify connection string in secrets
Test connectivity: kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017
Check IP whitelist in MongoDB Atlas
If Atlas down: Wait for resolution or restore from backup to new cluster

Escalation: If Atlas issue persists >1 hour, initiate Base de datos failover to backup cluster

Scenario 4: Certificate Expiration (SEV-3, escalates to SEV-1 if not resolved)

Symptoms:

Certificate expiring in <7 days
HTTPS connections failing with SSL errors
cert-manager logs show renewal failures

Response:

Check certificate Estado: kubectl get certificates -n production
Check cert-manager logs: kubectl logs -n cert-manager -l app=cert-manager
Delete certificate to trigger renewal: kubectl delete certificate <name> -n production
If auto-renewal fails: Manual certificate upload required

Prevention: Configure monitoring alerts for certificates expiring in 14 days

Scenario 5: High Memory Usage / OOMKilled Pods (SEV-2)

Symptoms:

Pods constantly restarting
“OOMKilled” in pod Estado
Grafana shows memory usage at 100%

Response:

Check memory usage: kubectl top pods -n production
Check for memory leaks in logs

Increase memory limits temporarily:

kubectl set resources deployment/<name> -n production --limits=memory=2Gi

Investigate code for memory leaks
Restart pods to clear memory

Follow-up: Performance optimization, code review for memory leaks

Post-Incident Review

Objective: Learn from incidents, prevent recurrence

Schedule: Within 48 hours of incident resolution (SEV-1/SEV-2)

Attendees:

Incident Manager
Technical Lead
Engineering team members involved
Product/management stakeholders (SEV-1 only)

Agenda:

Incident Timeline (10 minutes)
- Detection: When and how was incident detected?
- Response: Who responded, when?
- Mitigation: What actions were taken?
- Resolution: When was service restored?
Root Cause Analysis (15 minutes)
- What was the root cause?
- Why did it happen?
- Why wasn’t it caught before production?
What Went Well (10 minutes)
- Fast detection
- Effective communication
- Successful mitigation
What Could Be Improved (15 minutes)
- Detection delays
- Documentoation gaps
- Tool limitations
Action Items (10 minutes)
- Preventive measures (monitoring, alerting, Pruebas)
- Process improvements (runbooks, automation)
- Technical debt (code fixes, infrastructure upgrades)
- Assign owners and due dates

Post-Incident Report Template:

# Post-Incident Report: [SEV-X] [Brief Description]

**Date:** 2025-11-20
**Duration:** 1 hour 15 minutes (12:30 PM - 1:45 PM)
**Severity:** SEV-1
**Incident Manager:** John Doe

## Summary

Brief description of incident and customer impact.

## Timeline

- 12:30 PM: Incident detected via Grafana alert
- 12:35 PM: On-call engineer acknowledged
- 12:40 PM: Root cause identified (database connection pool exhausted)
- 12:50 PM: Mitigation implemented (restarted pods)
- 1:15 PM: Service restored
- 1:45 PM: Monitoring confirmed stable

## Root Cause

Database connection pool size set too low (10 connections) for production traffic.

## Impact

- 100% of API requests failed for 45 minutes
- Estimated 500 customer requests affected
- Revenue impact: $X (estimated)

## What Went Well

- Fast detection (< 5 minutes via monitoring)
- Clear runbook for pod restart
- Effective internal communication

## What Could Be Improved

- Database connection pool should have been load tested
- Alerting on connection pool exhaustion
- Better capacity planning

## Action Items

1. [OWNER] Increase connection pool size to 50 (Due: Nov 21)
2. [OWNER] Add monitoring for database connection pool usage (Due: Nov 23)
3. [OWNER] Conduct load testing for all services (Due: Dec 1)
4. [OWNER] Update deployment checklist to include load testing (Due: Nov 22)

## Lessons Learned

- Always load test before production deployment
- Monitor resource exhaustion (connections, file descriptors, etc.)
- Document capacity limits in service README

Related Documentoation:

Runbooks: Detailed procedures for common issues
Monitoring & Logging: How to use Grafana and Loki during incidents
Backup & DR: Disaster recovery procedures

For Support:

On-call engineer: Check on-call schedule in PagerDuty/Opsgenie
Escalation: Senior SRE → Engineering Manager → CTO
Emergency contacts: [Internal wiki or contact list]