Major Incident - 14/09/2020 - AMS Login Issue

Make a copy of this template and use it for a new Major Incident.

For an example of a completed template use the following URL: https://savethechildrenintl-sandbox.atlassian.net/wiki/pages/viewpage.action?pageId=20022393

NOTE: All times as in UK (UTC)

1 - Incident Report Summary

The incident report summary is to be filled in by the Major Incident Manager in collaboration with the Problem Manager and technical team.

The Incident report summary is the high level information for sharing with executives.

Major Incident Report Summary

Major Incident Report Summary

Major Incident Title	AMS Login Issue
System(s) Impacted	AMS
JIRA Ticket Number	AMSIII-1944
Region(s) Impacted	All
Time/Date of First Report	15:37, 14 September 2020
Total Reported Outage Time	8hr53
Time/Date of Incident Closure	00:30, 15 September 2020
Nature of Problem	The AMS login kept displaying the following message "An application error occurred (503)". This stopped users logging into the AMS system.
Impact	Global
Resolution details including: permanent fix/workaround	(Refer to the resolution in the Technical Summary)
Root Cause	(Refer to the root cause in the Technical Summary)
Post Incident Recommendations	(refer to the post incident recommendations from the Technical Summary)

2 - Technical Summary

The incident Technical summary is to be filled in by the Problem Manager and technical team.

This will involve technical information and terminology that will be relevant for the technical team for troubleshooting future similar incidents.

Major Incident Technical Summary

Major Incident Technical Summary

Resolution Details (including steps, screenshots, permanent fix or workaround)	Microsoft resources were resumed in phases and SQL database was the last one revived around 12:30 AM. AMS started slow in the morning with delayed response time and AMS came back to normal by 15th September early morning.
Root cause	Azure UK South experienced Virtual Machine, Storage, SQL Database, and Networking failure, and that impacted AMS as All of the AMS resources are deployed in UK South region only. AMS was mainly impacted because of SQL Database, Storage and networking.
Post Incident Recommendations	Options to look for AMS Alerts for support team and key people in case server goes down.
Additional comments (if required)

3 - Incident Management

The Incident Management section details how people were first notified about this incident, who the issue was escalated to, the team that was mobilised, and the communications that were sent out.

Major Incident Management

Major Incident Management

Notification	First notified via: JIRA
JIRA Ticket Number(s)	AMSIII-1944
Escalation/Mobilisation:	A Major incident team was formed that consisted of the following people: Major Incident Manager: Jay Padharia Problem Manager: Imran Beg Major Incident Team (MIT): Sibani Gale Imran Beg Jay Padharia Khady Ndiaye Mahmood Khan Gerald Waterfield Jitin Mangal Cristian Alfaro
Communications (descending order):	Copy and paste all comms associated with this major incident:

4 - Incident Timeline:

The incident timeline should capture all actions and events that took place until the incident closure.

Everyone involved with the incident should contribute towards the timeline.

Time (HH:MM)	Date (DD-MON-YY)	Observation/Action	Action taken by (press TAB for new row)

Time (HH:MM)	Date (DD-MON-YY)	Observation/Action	Action taken by (press TAB for new row)
15:37	14-Sep-20	First report of AMS down reported via JIRA (AMSIII-1944)	User
16:22	14-Sep-20	Sibani Gale sent an email to IT Stakeholders notifying AMS is down	Sibani Gale
16:33	14-Sep-20	Jay Padharia sent an IT Announcement to the user base notifying AMS is down	Jay Padharia
16:35	14-Sep-20	Imran Beg notified relevant stakeholders that there is a major issue with Microsoft Data Centre which is affecting the AMS service	Imran Beg
17:00	14-Sep-20	Major Incident Meeting	MIT
00.30	15-Sep-20	Microsoft inform AMS service is restored. Imran Beg confirms	Imran Beg
00:56	15-Sep-20	IT Announcement sent to the user base confirming that AMS is back online	Cristian Alfaro

The following sections DO NOT need to be completed during a Major Incident . They are post incident activities that the IT Service Delivery Manager will conduct with the relevant stakeholders.

5 - Problem Management_ Ticket AMSIII-1944

(Refer to ADI problem management module for help) Click here to access ADI Problem Management Module

This part of MI report is to ensure that underlying reasons of this incident are known and preventive actions have been taken to avoid it in future.

5.1.Define the Issue (What do we need to resolve?)

5.1.1. Write the problem description below describing the gap between actual and desired state.

Write problem statement here.

On 14th September 2020 around 3:30 pm, all AMS users were unable to access the system across the organization, preventing the Award Management functions in SCI and Members conducting their Award related operations on AMS, impacting the functions’ ability to capture information, process approvals, generate reports, and provide spend authorization to FMS

On 14th September 2020 around 3:30 pm, all AMS users were unable to access the system across the organization, preventing the Award Management functions in SCI and Members conducting their Award related operations on AMS, impacting the functions’ ability to capture information, process approvals, generate reports, and provide spend authorization to FMS

5.2.Measure (What is current situation and business pain?)

5.2.1. Report showing measurement of current situation (Insert current situation report here in any format)

Microsoft Region services were restored around 12:30 AM. Initial performance issues were identified but those were solved within 12 hours. We are completely up and running without any further issues.

Microsoft Region services were restored around 12:30 AM. Initial performance issues were identified but those were solved within 12 hours. We are completely up and running without any further issues.

5.3.Analyze the Causes (What caused it and how do we know it?)

Analyze the incident to find root causes using ADI recommended tools 5 Whys or Ishikawa

Following are the key categories/dimensions to use for analysis or consider more if required.

People and Organization
Information and Technology
Processes and Value Streams
Vendors and Suppliers

5.3.1. Record the identified root causes from Ishikawa or 5 Whys and their possible solutions

No	Root Cause	Category/Dimension	Possible Solution

No

Root Cause

Category/Dimension

Possible Solution

1

Azure UK South experienced Virtual Machine, Storage, SQL Database, and Networking failure, and that impacted AMS as All of the AMS resources are deployed in UK South region only.

AMS was mainly impacted because of SQL Database, Storage and networking.

Solution 1: Initiate the Disaster Recovery Solution and bring the AMS system up in UK West region

Solution 2: Wait for the Microsoft to restore the UK South region

2

5.4.Improve the Situation (How can it be fixed?)

Assess solutions value and associated risks to select most suitable option to implement.

5.4.1. Assess the solutions value and effort. (Green is the better choice)

Solutions Value and Effort Assessment

No

Possible Solution

(From step 3.1)

Low Value/High Effort

(No Go Area)

Major Incident - 14/09/2020 - AMS Login Issue

1 - Incident Report Summary

Major Incident Report Summary

Major Incident Report Summary

Time/Date of First Report

Total Reported Outage Time

Time/Date of Incident Closure

Nature of Problem

Impact

Resolution details including: permanent fix/workaround

(Refer to the resolution in the Technical Summary)

Root Cause

(Refer to the root cause in the Technical Summary)

Post Incident Recommendations

(refer to the post incident recommendations from the Technical Summary)

2 - Technical Summary

Major Incident Technical Summary