Major Incident - 14/09/2020 - AMS Login Issue
Make a copy of this template and use it for a new Major Incident.
For an example of a completed template use the following URL: https://savethechildrenintl-sandbox.atlassian.net/wiki/pages/viewpage.action?pageId=20022393
NOTE: All times as in UK (UTC)
1 - Incident Report Summary
The incident report summary is to be filled in by the Major Incident Manager in collaboration with the Problem Manager and technical team.
The Incident report summary is the high level information for sharing with executives.
Major Incident Report Summary | |
|---|---|
| |
Major Incident Title | AMS Login Issue |
System(s) Impacted | AMS |
JIRA Ticket Number | AMSIII-1944 |
Region(s) Impacted | All |
Time/Date of First Report | 15:37, 14 September 2020 |
Total Reported Outage Time | 8hr53 |
Time/Date of Incident Closure | 00:30, 15 September 2020 |
Nature of Problem | The AMS login kept displaying the following message "An application error occurred (503)". This stopped users logging into the AMS system. |
Impact | Global |
Resolution details including: permanent fix/workaround | (Refer to the resolution in the Technical Summary) |
Root Cause | (Refer to the root cause in the Technical Summary) |
Post Incident Recommendations | (refer to the post incident recommendations from the Technical Summary) |
2 - Technical Summary
The incident Technical summary is to be filled in by the Problem Manager and technical team.
This will involve technical information and terminology that will be relevant for the technical team for troubleshooting future similar incidents.
Major Incident Technical Summary | |
|---|---|
| |
Resolution Details (including steps, screenshots, permanent fix or workaround) | Microsoft resources were resumed in phases and SQL database was the last one revived around 12:30 AM. AMS started slow in the morning with delayed response time and AMS came back to normal by 15th September early morning. |
Root cause | Azure UK South experienced Virtual Machine, Storage, SQL Database, and Networking failure, and that impacted AMS as All of the AMS resources are deployed in UK South region only. AMS was mainly impacted because of SQL Database, Storage and networking. |
Post Incident Recommendations | Options to look for AMS Alerts for support team and key people in case server goes down. |
Additional comments (if required) |
|
3 - Incident Management
The Incident Management section details how people were first notified about this incident, who the issue was escalated to, the team that was mobilised, and the communications that were sent out.
Major Incident Management | |
|---|---|
| |
Notification | First notified via: JIRA |
JIRA Ticket Number(s) | |
Escalation/Mobilisation: | A Major incident team was formed that consisted of the following people: Major Incident Manager: Jay Padharia Problem Manager: Imran Beg Major Incident Team (MIT):
|
Communications (descending order): | Copy and paste all comms associated with this major incident:
|
4 - Incident Timeline:
The incident timeline should capture all actions and events that took place until the incident closure.
Everyone involved with the incident should contribute towards the timeline.
Time (HH:MM) | Date (DD-MON-YY) | Observation/Action | Action taken by (press TAB for new row) |
|---|---|---|---|
15:37 | 14-Sep-20 | First report of AMS down reported via JIRA (AMSIII-1944) | User |
16:22 | 14-Sep-20 | Sibani Gale sent an email to IT Stakeholders notifying AMS is down | Sibani Gale |
16:33 | 14-Sep-20 | Jay Padharia sent an IT Announcement to the user base notifying AMS is down | Jay Padharia |
16:35 | 14-Sep-20 | Imran Beg notified relevant stakeholders that there is a major issue with Microsoft Data Centre which is affecting the AMS service | Imran Beg |
17:00 | 14-Sep-20 | Major Incident Meeting | MIT |
00.30 | 15-Sep-20 | Microsoft inform AMS service is restored. Imran Beg confirms | Imran Beg |
00:56 | 15-Sep-20 | IT Announcement sent to the user base confirming that AMS is back online | Cristian Alfaro |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The following sections DO NOT need to be completed during a Major Incident . They are post incident activities that the IT Service Delivery Manager will conduct with the relevant stakeholders.
5 - Problem Management_ Ticket AMSIII-1944
(Refer to ADI problem management module for help) Click here to access ADI Problem Management Module
This part of MI report is to ensure that underlying reasons of this incident are known and preventive actions have been taken to avoid it in future.
5.1.Define the Issue (What do we need to resolve?)
5.1.1. Write the problem description below describing the gap between actual and desired state.
Write problem statement here.
On 14th September 2020 around 3:30 pm, all AMS users were unable to access the system across the organization, preventing the Award Management functions in SCI and Members conducting their Award related operations on AMS, impacting the functions’ ability to capture information, process approvals, generate reports, and provide spend authorization to FMS |
|---|
5.2.Measure (What is current situation and business pain?)
5.2.1. Report showing measurement of current situation (Insert current situation report here in any format)
Microsoft Region services were restored around 12:30 AM. Initial performance issues were identified but those were solved within 12 hours. We are completely up and running without any further issues. |
|---|
5.3.Analyze the Causes (What caused it and how do we know it?)
Analyze the incident to find root causes using ADI recommended tools 5 Whys or Ishikawa
Following are the key categories/dimensions to use for analysis or consider more if required.
People and Organization
Information and Technology
Processes and Value Streams
Vendors and Suppliers
5.3.1. Record the identified root causes from Ishikawa or 5 Whys and their possible solutions
No | Root Cause | Category/Dimension | Possible Solution |
|---|---|---|---|
1 | Azure UK South experienced Virtual Machine, Storage, SQL Database, and Networking failure, and that impacted AMS as All of the AMS resources are deployed in UK South region only. AMS was mainly impacted because of SQL Database, Storage and networking. |
| Solution 1: Initiate the Disaster Recovery Solution and bring the AMS system up in UK West region Solution 2: Wait for the Microsoft to restore the UK South region |
2 |
|
|
|
5.4.Improve the Situation (How can it be fixed?)
Assess solutions value and associated risks to select most suitable option to implement.
5.4.1. Assess the solutions value and effort. (Green is the better choice)
|
| Solutions Value and Effort Assessment | |||
No | Possible Solution (From step 3.1) | Low Value/High Effort (No Go Area) | |||