Respuesta a Incidentes
Tabla de Contenidos
- Propósito
- Niveles de Severidad de Incidentes
- Proceso de Respuesta a Incidentes
- Roles en Respuesta a Incidentes
- Procedimientos de Comunicación
- Escenarios de Incidentes Comunes
- Revisión Post-Incidente
Propósito
Este documento define los procedimientos de respuesta a incidentes para la plataforma Algesta, incluyendo clasificación de severidad, proceso de respuesta, roles y responsabilidades, protocolos de comunicación y procedimientos de revisión post-incidente.
Incident Severity Levels
| Severity | Impact | Response Time | Examples |
|---|---|---|---|
| SEV-1 (Critical) | Completo service outage, data loss, security breach | Immediate response (< 15 min) | API gateway down, Base de datos unreachable, all pods crashing |
| SEV-2 (High) | Major functionality degraded, significant customer impact | 1 hour | Single Microservicio down, high error rate (>10%), slow response times |
| SEV-3 (Medium) | Minor functionality degraded, limited customer impact | 4 hours | Non-critical feature broken, intermittent errors (<5%), certificate expiring soon |
| SEV-4 (Low) | No customer impact, informational | Next business day | Low disk space warning, outdated dependencies, Documentoation issues |
Escalation Criteria:
- SEV-2 → SEV-1: If incident persists for >1 hour or impacts multiple services
- SEV-3 → SEV-2: If incident impacts >20% of users or critical business Operaciones
Incident Response Process
graph TD
A[Incident Detected<br/>Alert or User Report] --> B{Assess Severity}
B -->|SEV-1| C[Immediate Response<br/>Page On-Call]
B -->|SEV-2/3/4| D[Create Incident Ticket]
C --> E[Incident Manager Assigned]
D --> E
E --> F[Assemble Response Team]
F --> G[Investigate Root Cause]
G --> H[Implement Mitigation]
H --> I{Issue Resolved?}
I -->|No| J[Escalate / Implement<br/>Disaster Recovery]
J --> G
I -->|Yes| K[Verify Service Restored]
K --> L[Update Status Page]
L --> M[Close Incident]
M --> N[Post-Incident Review]
Phase 1: Detection and Triage (0-15 minutes)
1. Incident Detected:
- Monitoring alert (Grafana, Prometheus)
- User report (support ticket, social media)
- ProActivo monitoring (health checks failing)
2. Acknowledge Incident:
# Check cluster statuskubectl get pods -Akubectl get nodes
# Check Grafana dashboards# Access: https://algesta.grafana.3astronautas.com
# Check recent deploymentskubectl rollout history deployment -n production3. Assess Severity:
- SEV-1: API returning 5xx errors for all requests, Base de datos unreachable
- SEV-2: Single Microservicio down, error rate >10%
- SEV-3: Minor feature broken, <5% error rate
- SEV-4: Warning alerts, no user impact
4. Create Incident Ticket:
- Title: “[SEV-X] Brief Descripción”
- Descripción: What’s broken, observed symptoms, impact
- Assign to on-call engineer
Phase 2: Investigation (15-60 minutes)
1. Gather Information:
# Check pod statuskubectl get pods -n production
# Check logskubectl logs -l app=api-gateway -n production --tail=100 --timestamps
# Check recent eventskubectl get events -n production --sort-by='.lastTimestamp'
# Check resource usagekubectl top pods -n productionkubectl top nodes2. Check Monitoring:
- Grafana: Request rate, error rate, latency
- Prometheus: CPU, memory, disk usage
- Loki: Application logs, error patterns
3. Identify Root Cause:
- Recent Despliegues (check rollout history)
- Infrastructure changes (Terraform apply logs)
- External dependencies (MongoDB Atlas Estado, Azure Estado)
- Resource exhaustion (CPU, memory, disk)
Phase 3: Mitigation (Varies by incident)
Common Mitigation Strategies:
Rollback Failed Despliegue:
kubectl rollout undo deployment/api-gateway-production -n productionRestart Crashed Pods:
kubectl rollout restart deployment/api-gateway-production -n productionScale Up Resources:
kubectl scale deployment/api-gateway-production --replicas=5 -n productionBase de datos Connection Issues:
# Verify MongoDB Atlas status# Check connection string in secretskubectl get secret mongodb-credentials -n production -o jsonpath='{.data.uri}' | base64 -d
# Test connectivity from podkubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017Refer to Runbooks for detailed procedures
Phase 4: Resolution and Verification
1. Verify Service Restored:
# Check health endpointscurl https://algesta-api-prod.3astronautas.com/health
# Verify all pods runningkubectl get pods -n production
# Check error rate in Grafana (should be <1%)2. Monitor Stability:
- Watch for 15-30 minutes to ensure issue doesn’t recur
- Check key Métricas: request rate, latency, error rate
3. Update Stakeholders:
- Post Estado update: “Incident resolved, monitoring for stability”
- Communicate via Slack, email, Estado page
Phase 5: Post-Incident Activities
1. Documento Incident:
- Root cause
- Timeline of events
- Actions taken
- Lessons learned
2. Close Incident Ticket:
- Final Estado: Resolved
- Total duration: Detection → Resolution
- Customer impact: Estimated downtime
3. Schedule Post-Incident Review (See Post-Incident Review)
Incident Response Roles
Incident Manager (IM)
Responsibilities:
- Coordinate response efforts
- Communicate with stakeholders
- Make decisions on escalation and disaster recovery
- Ensure Documentoation
Who: On-call DevOps engineer or senior SRE
Technical Lead (TL)
Responsibilities:
- Lead technical investigation
- Implement mitigation strategies
- Coordinate with developers and platform engineers
Who: On-call backend engineer or DevOps engineer
Communications Lead (CL)
Responsibilities:
- Update internal stakeholders (Slack, email)
- Update Estado page
- Communicate with customers if needed
- Draft post-incident report
Who: Product manager or customer support lead
Subject Matter Expert (SME)
Responsibilities:
- Provide specialized knowledge (Base de datos, networking, security)
- Support technical investigation
- Review mitigation strategies
Who: Base de datos admin, network engineer, security engineer (as needed)
Communication Procedures
Internal Communication (Slack)
Incident Channel: #incidents
SEV-1 Announcement:
@here SEV-1 INCIDENT[12:34 PM] API Gateway Down - All production traffic failingIncident Manager: @john.doeStatus: InvestigatingLive doc: [Link to incident document]Regular Updates (every 15-30 minutes):
[12:45 PM] UPDATE: Root cause identified - database connection pool exhausted. Implementing mitigation (restart pods).Resolution:
[1:15 PM] RESOLVED: All services restored. Monitoring for stability. Post-incident review scheduled for tomorrow 10 AM.External Communication (Estado Page)
Tools: GitHub Pages, Atlassian Estadopage, or custom Estado page
SEV-1 Example:
INVESTIGATING - API UnavailablePosted: Nov 20, 2025 at 12:35 PM
We are currently experiencing issues with the Algesta API.Our team is actively investigating. Updates will be posted here.Update:
UPDATE - API UnavailablePosted: Nov 20, 2025 at 12:50 PM
We have identified the root cause and are implementing a fix.We expect service to be restored within 15 minutes.Resolution:
RESOLVED - API UnavailablePosted: Nov 20, 2025 at 1:15 PM
The issue has been resolved. All services are operational.We will publish a post-incident report within 48 hours.Common Incident Scenarios
Scenario 1: Completo Service Outage (SEV-1)
Symptoms:
- All API requests returning 5xx errors
- Health checks failing
- Grafana shows 0 req/sec
Response:
- Check AKS cluster Estado:
kubectl get nodes - Check pod Estado:
kubectl get pods -A - Check ingress:
kubectl get ingress -A - If pods CrashLoopBackOff:
kubectl logs <pod> -n production - Rollback recent Despliegue:
kubectl rollout undo Despliegue/<name> -n production - If infrastructure issue: Escalate to Azure support, initiate DR plan
Escalation: If not resolved in 30 minutes, escalate to senior SRE and consider DR
Scenario 2: High Error Rate (SEV-2)
Symptoms:
- Error rate >10% (normal <1%)
- 5xx errors in logs
- User complaints
Response:
- Check Grafana for error patterns
- Check Loki logs:
{namespace="production"} |= "ERROR" - Identify failing service
- Check Base de datos connectivity
- Check external dependencies (SendGrid, Twilio)
- Restart pods or rollback if needed
Scenario 3: Base de datos Connection Failure (SEV-1/SEV-2)
Symptoms:
- “MongoNetworkError” in logs
- Pods restarting frequently
- Health checks failing
Response:
- Check MongoDB Atlas Estado: https://Estado.mongodb.com
- Verify connection string in secrets
- Test connectivity:
kubectl exec <pod> -n production -- nc -zv cluster.mongodb.net 27017 - Check IP whitelist in MongoDB Atlas
- If Atlas down: Wait for resolution or restore from backup to new cluster
Escalation: If Atlas issue persists >1 hour, initiate Base de datos failover to backup cluster
Scenario 4: Certificate Expiration (SEV-3, escalates to SEV-1 if not resolved)
Symptoms:
- Certificate expiring in <7 days
- HTTPS connections failing with SSL errors
- cert-manager logs show renewal failures
Response:
- Check certificate Estado:
kubectl get certificates -n production - Check cert-manager logs:
kubectl logs -n cert-manager -l app=cert-manager - Delete certificate to trigger renewal:
kubectl delete certificate <name> -n production - If auto-renewal fails: Manual certificate upload required
Prevention: Configure monitoring alerts for certificates expiring in 14 days
Scenario 5: High Memory Usage / OOMKilled Pods (SEV-2)
Symptoms:
- Pods constantly restarting
- “OOMKilled” in pod Estado
- Grafana shows memory usage at 100%
Response:
- Check memory usage:
kubectl top pods -n production - Check for memory leaks in logs
- Increase memory limits temporarily:
Ventana de terminal kubectl set resources deployment/<name> -n production --limits=memory=2Gi - Investigate code for memory leaks
- Restart pods to clear memory
Follow-up: Performance optimization, code review for memory leaks
Post-Incident Review
Objective: Learn from incidents, prevent recurrence
Schedule: Within 48 hours of incident resolution (SEV-1/SEV-2)
Attendees:
- Incident Manager
- Technical Lead
- Engineering team members involved
- Product/management stakeholders (SEV-1 only)
Agenda:
-
Incident Timeline (10 minutes)
- Detection: When and how was incident detected?
- Response: Who responded, when?
- Mitigation: What actions were taken?
- Resolution: When was service restored?
-
Root Cause Analysis (15 minutes)
- What was the root cause?
- Why did it happen?
- Why wasn’t it caught before production?
-
What Went Well (10 minutes)
- Fast detection
- Effective communication
- Successful mitigation
-
What Could Be Improved (15 minutes)
- Detection delays
- Documentoation gaps
- Tool limitations
-
Action Items (10 minutes)
- Preventive measures (monitoring, alerting, Pruebas)
- Process improvements (runbooks, automation)
- Technical debt (code fixes, infrastructure upgrades)
- Assign owners and due dates
Post-Incident Report Template:
# Post-Incident Report: [SEV-X] [Brief Description]
**Date:** 2025-11-20**Duration:** 1 hour 15 minutes (12:30 PM - 1:45 PM)**Severity:** SEV-1**Incident Manager:** John Doe
## Summary
Brief description of incident and customer impact.
## Timeline
- 12:30 PM: Incident detected via Grafana alert- 12:35 PM: On-call engineer acknowledged- 12:40 PM: Root cause identified (database connection pool exhausted)- 12:50 PM: Mitigation implemented (restarted pods)- 1:15 PM: Service restored- 1:45 PM: Monitoring confirmed stable
## Root Cause
Database connection pool size set too low (10 connections) for production traffic.
## Impact
- 100% of API requests failed for 45 minutes- Estimated 500 customer requests affected- Revenue impact: $X (estimated)
## What Went Well
- Fast detection (< 5 minutes via monitoring)- Clear runbook for pod restart- Effective internal communication
## What Could Be Improved
- Database connection pool should have been load tested- Alerting on connection pool exhaustion- Better capacity planning
## Action Items
1. [OWNER] Increase connection pool size to 50 (Due: Nov 21)2. [OWNER] Add monitoring for database connection pool usage (Due: Nov 23)3. [OWNER] Conduct load testing for all services (Due: Dec 1)4. [OWNER] Update deployment checklist to include load testing (Due: Nov 22)
## Lessons Learned
- Always load test before production deployment- Monitor resource exhaustion (connections, file descriptors, etc.)- Document capacity limits in service READMERelated Documentoation:
- Runbooks: Detailed procedures for common issues
- Monitoring & Logging: How to use Grafana and Loki during incidents
- Backup & DR: Disaster recovery procedures
For Support:
- On-call engineer: Check on-call schedule in PagerDuty/Opsgenie
- Escalation: Senior SRE → Engineering Manager → CTO
- Emergency contacts: [Internal wiki or contact list]