Creating and Managing Incidents

Incidents are events that affect the availability or performance of your services. They allow you to communicate issues to users and track the resolution process with detailed updates.<h2>Incident Properties</h2>Each incident has the following properties:<ul><li>Title - Clear description of the issue (e.g., "API latency affecting mobile app")</li><li>Status Type - Severity level (degraded_performance, partial_outage, major_outage, incident)</li><li>State - Current stage (active or resolved)</li><li>Impact - Affected services (minor, major, critical)</li><li>Affected Services - List of services impacted by this incident</li><li>Description - Initial description of the incident</li><li>Created At - When the incident was first reported</li><li>Resolved At - When the incident was marked as resolved (if applicable)</li></ul><h2>Creating an Incident</h2>When an issue occurs, create an incident to communicate with users:<ol><li>Identify the Issue - Determine what services are affected and the severity</li><li>Create the Incident - Add to your status page with a clear title</li><li>Set the Status Type - Choose appropriate severity (partial_outage, major_outage, etc.)</li><li>Mark Affected Services - Select all services impacted by this incident</li><li>Add Initial Description - Provide initial details about what's happening</li><li>Publish - Make the incident visible to users on your status page</li></ol><h2>Incident Lifecycle</h2>An incident goes through several states:<h3>1. Investigating</h3>The initial state when an incident is first created. The team is gathering information and assessing the impact.<h3>2. Identified</h3>The root cause has been identified and a fix is being prepared. This provides users with confidence that the issue is understood.<h3>3. Monitoring</h3>A fix has been deployed and the team is monitoring for resolution. Services should be improving at this stage.<h3>4. Resolved</h3>The incident has been fixed and services have been restored to normal operation. Users can resume using affected services.<h2>Incident Updates</h2>Each incident can have multiple updates that track the resolution progress:<ul><li>Initial Update - Created automatically when the incident is created</li><li>Status Updates - Add updates when the incident state changes</li><li>Progress Updates - Communicate progress even when state doesn't change</li><li>Resolution Update - Final update when the incident is resolved</li></ul><h2>Updating an Incident</h2>As you work on resolving an incident, update it regularly:<ol><li>Change State - Update to "Identified", "Monitoring", or "Resolved"</li><li>Add Message - Describe what's new (e.g., "Deploying fix now")</li><li>Update Affected Services - Add or remove services as needed</li><li>Save Update - Each update creates a timeline entry</li></ol><h2>Impact Levels</h2><h3>Minor Impact</h3><ul><li>Affects a small subset of users</li><li>Service degradation is minimal</li><li>Workarounds may be available</li><li>Example: Mobile app shows error for 5% of users</li></ul><h3>Major Impact</h3><ul><li>Affects many users</li><li>Significant service degradation</li><li>Core functionality impaired</li><li>Example: API response times are 3x normal</li></ul><h3>Critical Impact</h3><ul><li>Affects most or all users</li><li>Services are completely unavailable</li><li>No workarounds available</li><li>Example: Entire API is down</li></ul><h2>Affected Services</h2>Mark which services are impacted by each incident:<ul><li>Multi-select - Choose multiple affected services</li><li>Dynamic Updates - Update affected services as you learn more</li><li>Service Status - Affected services automatically show outage status</li><li>Visual Indicators - Users can see which services are impacted</li></ul><h2>Incident History</h2>Resolved incidents are displayed in the "Past Incidents" section of your status page:<ul><li>Title and Date - Shows what happened and when</li><li>Duration - Displays how long the incident lasted</li><li>Impact Badge - Shows severity level</li><li>Resolved Badge - Indicates the incident is resolved</li><li>Clickable - Users can click to view full incident details</li></ul><h2>Best Practices</h2><ul><li>Create incidents promptly when issues are discovered</li><li>Update incidents at least every 30 minutes during active resolution</li><li>Use clear, non-technical language in incident descriptions</li><li>Always mark incidents as resolved when services are fully restored</li><li>Conduct post-incident reviews to prevent recurrence</li><li>Keep incident titles concise but informative</li><li>Include affected services to help users understand scope</li></ul><h2>Example Incident Timeline</h2><ol><li>10:00 AM - Incident created: "API latency affecting mobile app" (Investigating)</li><li>10:15 AM - Update: "We're investigating increased response times on our API servers"</li><li>10:30 AM - State change to "Identified": "Database load caused by recent deployment"</li><li>10:45 AM - Update: "Rolling back problematic changes"</li><li>11:00 AM - State change to "Monitoring": "Rollback complete. Monitoring API performance"</li><li>11:15 AM - State change to "Resolved": "API performance back to normal"</li></ol>