Skip to content

Stable Recipe

Change requires PR.

Reviewers: default_reviewers

[Project Name] Production Runbook

This runbook provides operational procedures, troubleshooting guidelines, and support information for managing the application in production environments.

Table of Contents

Version Information

  • Version Number: [e.g., 1.0.0]
  • Last Updated: [e.g., April 22, 2025]
  • Author(s): [Names of contributors]

System Architecture

For general project overview and technical architecture, see README-template.md and OVERVIEW-template.md

Production Architecture Diagram

[Insert production-specific architecture diagram here]

Component Inventory

ComponentDescriptionServer/HostDependencies
Component 1Brief descriptionhostname/URLList of dependencies
Component 2Brief descriptionhostname/URLList of dependencies
Component 3Brief descriptionhostname/URLList of dependencies

Network Configuration

  • Ports: [e.g., 80, 443, 3306]
  • Firewall Rules: [e.g., Allow inbound 443 from public, 3306 from app server only]
  • DNS Configuration: [DNS entries and routing information]
  • VPN/VPC Settings: [Virtual private network/cloud configuration]

External Integrations

IntegrationTypeEndpointAuthentication MethodRate Limits
Integration 1[REST, SOAP, etc.]URL[OAuth, API Key, etc.][e.g., 1000 req/min]
Integration 2[REST, SOAP, etc.]URL[OAuth, API Key, etc.][e.g., 1000 req/min]
Integration 3[REST, SOAP, etc.]URL[OAuth, API Key, etc.][e.g., 1000 req/min]

Service Level Agreements (SLAs)

  • Availability: [e.g., 99.9% uptime]
  • Response Time: [e.g., 95% of requests < 500ms]
  • Recovery Time Objective (RTO): [e.g., 4 hours]
  • Recovery Point Objective (RPO): [e.g., 15 minutes]

Production Environments

For deployment instructions, see BUILD-template.md

Production Environment

  • URL: [Production URL]
  • Region/Location: [e.g., Azure East US, AWS us-east-1]
  • Resource Group/Cluster: [Resource group or cluster name]
  • Instance Type/Size: [e.g., Standard_DS3_v2, m5.large]

DR/Backup Environment

  • URL: [DR URL]
  • Region/Location: [e.g., Azure West US, AWS us-west-2]
  • Failover Mechanism: [Description of failover process]
  • Testing Schedule: [e.g., Quarterly DR tests]

Operational Procedures

Startup Procedure

  1. [Step 1 for starting the application]
  2. [Step 2 for starting the application]
  3. [Step 3 for starting the application]

Verification:

  • [How to verify successful startup]
  • [Expected startup time]

Shutdown Procedure

  1. [Step 1 for graceful shutdown]
  2. [Step 2 for graceful shutdown]
  3. [Step 3 for graceful shutdown]

Verification:

  • [How to verify successful shutdown]

Restart Procedure

  1. [Step 1 for restarting the application]
  2. [Step 2 for restarting the application]
  3. [Step 3 for restarting the application]

Verification:

  • [How to verify successful restart]
  • [Expected restart time]

Scaling Procedures

Horizontal Scaling

  1. [Step 1 for adding/removing instances]
  2. [Step 2 for adding/removing instances]
  3. [Step 3 for adding/removing instances]

Vertical Scaling

  1. [Step 1 for increasing/decreasing resources]
  2. [Step 2 for increasing/decreasing resources]
  3. [Step 3 for increasing/decreasing resources]

Database Procedures

Connection Information

  • Host: [Database host]
  • Port: [Database port]
  • Admin Tool: [e.g., pgAdmin, MySQL Workbench]
  • Access Method: [e.g., Jump box, VPN]

Common Database Operations

  • Query Performance: [Command to check query performance]
  • Connection Management: [Command to view/manage connections]
  • Schema Changes: [Process for schema migrations in production]

Scheduled Jobs

Job NameDescriptionScheduleAverage RuntimeOwner
Job 1Description[e.g., Daily at 2 AM][e.g., 15 min][Team/Owner]
Job 2Description[e.g., Weekly on Sundays][e.g., 30 min][Team/Owner]
Job 3Description[e.g., Monthly on the 1st][e.g., 1 hour][Team/Owner]

Log Management

  • Application Logs: [Location and access method]
  • System Logs: [Location and access method]
  • Access Logs: [Location and access method]
  • Log Retention: [e.g., 30 days online, 1 year archived]
  • Log Search Tool: [e.g., Kibana, Splunk]

Monitoring

Monitoring Tools

  • System Monitoring: [e.g., Prometheus, Azure Monitor]
  • Application Monitoring: [e.g., New Relic, Application Insights]
  • Synthetic Monitoring: [e.g., Pingdom, Uptrends]
  • Dashboard URL: [URL to monitoring dashboard]

Key Metrics

MetricDescriptionWarning ThresholdCritical ThresholdAction
CPU UsageServer CPU utilization70%90%[Action to take]
Memory UsageServer memory utilization80%95%[Action to take]
Response TimeAPI response time500ms2000ms[Action to take]
Error RatePercentage of 5xx errors1%5%[Action to take]

Alerts

AlertConditionPriorityNotificationOwner
Alert 1[e.g., CPU > 90% for 5 min][High/Medium/Low][Email, SMS, etc.][Team/Owner]
Alert 2[e.g., Error rate > 5%][High/Medium/Low][Email, SMS, etc.][Team/Owner]
Alert 3[e.g., Disk space < 10%][High/Medium/Low][Email, SMS, etc.][Team/Owner]

Troubleshooting

Common Issues and Solutions

IssueSymptomsTroubleshooting StepsResolution
Issue 1[What users/monitors observe][Steps to diagnose][Steps to resolve]
Issue 2[What users/monitors observe][Steps to diagnose][Steps to resolve]
Issue 3[What users/monitors observe][Steps to diagnose][Steps to resolve]

Diagnostic Commands

PurposeCommandExpected Output
Check service status[command][Expected normal output]
Check connectivity[command][Expected normal output]
Check resource usage[command][Expected normal output]

Support Escalation Path

First Level Support

  • Team: [e.g., Operations]
  • Contact: [Contact information]
  • Hours: [e.g., 24/7, Business hours]
  • Response Time: [e.g., 30 minutes]

Second Level Support

  • Team: [e.g., Application Support]
  • Contact: [Contact information]
  • Hours: [e.g., 24/7, Business hours]
  • Response Time: [e.g., 1 hour]

Third Level Support

  • Team: [e.g., Development]
  • Contact: [Contact information]
  • Hours: [e.g., Business hours]
  • Response Time: [e.g., 4 hours]

Maintenance

Backup Procedures

Database Backups

  • Type: [e.g., Full, Differential]
  • Frequency: [e.g., Daily at 1 AM]
  • Location: [Where backups are stored]
  • Retention: [e.g., 30 days]
  • Verification: [How backups are verified]

Application Backups

  • Components: [What is backed up]
  • Frequency: [e.g., Weekly on Sundays]
  • Location: [Where backups are stored]
  • Retention: [e.g., 90 days]

Restore Procedures

Database Restore

  1. [Step 1 for database restore]
  2. [Step 2 for database restore]
  3. [Step 3 for database restore]

Application Restore

  1. [Step 1 for application restore]
  2. [Step 2 for application restore]
  3. [Step 3 for application restore]

Regular Maintenance Tasks

TaskFrequencyDescriptionOwner
Task 1[e.g., Daily][Description of maintenance task][Team/Owner]
Task 2[e.g., Weekly][Description of maintenance task][Team/Owner]
Task 3[e.g., Monthly][Description of maintenance task][Team/Owner]

Security

Access Management

  • Production Access Process: [Process for requesting production access]
  • Access Review: [e.g., Quarterly review of access permissions]
  • Privileged Access: [Process for obtaining privileged access]

Certificate Management

CertificatePurposeExpirationRenewal ProcessOwner
Certificate 1[e.g., TLS/SSL][Date][Process for renewal][Team/Owner]
Certificate 2[e.g., Client Auth][Date][Process for renewal][Team/Owner]
Certificate 3[e.g., Signing][Date][Process for renewal][Team/Owner]

Security Monitoring

  • Vulnerability Scanning: [e.g., Weekly scans using Tool X]
  • Penetration Testing: [e.g., Annual tests by Vendor Y]
  • Compliance Checks: [e.g., Monthly checks against Standard Z]

Contact Information

Product Team

RoleNameEmailPhone
Product Owner[Name][Email][Phone]
Product Manager[Name][Email][Phone]
Technical Lead[Name][Email][Phone]

Operations Team

RoleNameEmailPhone
Operations Manager[Name][Email][Phone]
DevOps Engineer[Name][Email][Phone]
Database Administrator[Name][Email][Phone]

Vendor Contacts

VendorServiceContactEmailPhoneSupport URL
Vendor 1[Service][Name][Email][Phone][URL]
Vendor 2[Service][Name][Email][Phone][URL]
Vendor 3[Service][Name][Email][Phone][URL]

Appendix

Glossary

TermDefinition
Term 1Definition of Term 1
Term 2Definition of Term 2
Term 3Definition of Term 3

Reference Documents

  • [Link to related documentation]
  • [Link to vendor documentation]
  • [Link to compliance requirements]

Change Log

DateVersionAuthorDescription of Changes
[Date][Version][Author][Description]
[Date][Version][Author][Description]
[Date][Version][Author][Description]