Major Incident - 14/09/2020 - AMS Login Issue

Major Incident - 14/09/2020 - AMS Login Issue

 

Make a copy of this template and use it for a new Major Incident.

For an example of a completed template use the following URL: https://savethechildrenintl-sandbox.atlassian.net/wiki/pages/viewpage.action?pageId=20022393

 

NOTE: All times as in UK (UTC)

 


1 - Incident Report Summary

The incident report summary is to be filled in by the Major Incident Manager in collaboration with the Problem Manager and technical team. 

The Incident report summary is the high level information for sharing with executives. 

 

Major Incident Report Summary

Major Incident Report Summary

 

Major Incident Title

AMS Login Issue

System(s) Impacted

AMS

JIRA Ticket Number

AMSIII-1944

Region(s) Impacted

All

Time/Date of First Report

15:37, 14 September 2020

Total Reported Outage Time

8hr53

Time/Date of Incident Closure

00:30, 15 September 2020

Nature of Problem

The AMS login kept displaying the following message "An application error occurred (503)". This stopped users logging into the AMS system. 

Impact

Global

Resolution details including: permanent fix/workaround 

(Refer to the resolution in the Technical Summary)

Root Cause 

(Refer to the root cause in the Technical Summary)

Post Incident Recommendations 

(refer to the post incident recommendations from the Technical Summary)

 

 


2 - Technical Summary

The incident Technical summary is to be filled in by the Problem Manager and technical team.

This will involve technical information and terminology that will be relevant for the technical team for troubleshooting future similar incidents. 

 

Major Incident Technical Summary

Major Incident Technical Summary

 

Resolution Details (including steps, screenshots, permanent fix or workaround)

Microsoft resources were resumed in phases and SQL database was the last one revived around 12:30 AM. AMS started slow in the morning with delayed response time and AMS came back to normal by 15th September early morning.

Root cause

Azure UK South experienced Virtual Machine, Storage, SQL Database, and Networking failure, and that impacted AMS as All of the AMS resources are deployed in UK South region only.

AMS was mainly impacted because of SQL Database, Storage and networking.

Post Incident Recommendations

Options to look for AMS Alerts for support team and key people in case server goes down.

Additional comments (if required)

 

 

 


 

3 - Incident Management

The Incident Management section details how people were first notified about this incident, who the issue was escalated to, the team that was mobilised, and the communications that were sent out.

 

Major Incident Management

Major Incident Management

 

Notification 

First notified via: JIRA

JIRA Ticket Number(s)

AMSIII-1944

Escalation/Mobilisation:

A Major incident team was formed that consisted of the following people:

Major Incident Manager: Jay Padharia

Problem Manager: Imran Beg

Major Incident Team (MIT):

  • Sibani Gale

  • Imran Beg

  • Jay Padharia

  • Khady Ndiaye

  • Mahmood Khan

  • Gerald Waterfield

  • Jitin Mangal

  • Cristian Alfaro

 

 

Communications (descending order):

Copy and paste all comms associated with this major incident:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


4 - Incident Timeline:

The incident timeline should capture all actions and events that took place until the incident closure.

Everyone involved with the incident should contribute towards the timeline. 

 

Time (HH:MM)

Date (DD-MON-YY)

Observation/Action

Action taken by (press TAB for new row)

Time (HH:MM)

Date (DD-MON-YY)

Observation/Action

Action taken by (press TAB for new row)

15:37

14-Sep-20

First report of AMS down reported via JIRA (AMSIII-1944)

User

16:22

14-Sep-20

Sibani Gale sent an email to IT Stakeholders notifying AMS is down

Sibani Gale

16:33

14-Sep-20

Jay Padharia sent an IT Announcement to the user base notifying AMS is down

Jay Padharia

16:35

14-Sep-20

Imran Beg notified relevant stakeholders that there is a major issue with Microsoft Data Centre which is affecting the AMS service 

Imran Beg

17:00

14-Sep-20

Major Incident Meeting 

MIT

00.30

15-Sep-20

Microsoft inform AMS service is restored. Imran Beg confirms

Imran Beg

00:56

15-Sep-20

IT Announcement sent to the user base confirming that AMS is back online 

Cristian Alfaro

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


The following sections DO NOT need to be completed during a Major Incident . They are post incident activities that the IT Service Delivery Manager will conduct with the relevant stakeholders.

5 - Problem Management_ Ticket AMSIII-1944

(Refer to ADI problem management module for help) Click here to access ADI Problem Management Module

This part of MI report is to ensure that underlying reasons of this incident are known and preventive actions have been taken to avoid it in future.

5.1.Define the Issue (What do we need to resolve?)

5.1.1.   Write the problem description below describing the gap between actual and desired state.  

Write problem statement here.

On 14th September 2020 around 3:30 pm, all AMS users were unable to access the system across the organization, preventing the Award Management functions in SCI and Members conducting their Award related operations on AMS, impacting the functions’ ability to capture information, process approvals, generate reports, and provide spend authorization to FMS

On 14th September 2020 around 3:30 pm, all AMS users were unable to access the system across the organization, preventing the Award Management functions in SCI and Members conducting their Award related operations on AMS, impacting the functions’ ability to capture information, process approvals, generate reports, and provide spend authorization to FMS

5.2.Measure (What is current situation and business pain?)

5.2.1.   Report showing measurement of current situation (Insert current situation report here in any format)

Microsoft Region services were restored around 12:30 AM. Initial performance issues were identified but those were solved within 12 hours.

We are completely up and running without any further issues.

Microsoft Region services were restored around 12:30 AM. Initial performance issues were identified but those were solved within 12 hours.

We are completely up and running without any further issues.

5.3.Analyze the Causes (What caused it and how do we know it?) 

Analyze the incident to find root causes using ADI recommended tools 5 Whys or Ishikawa 

Following are the key categories/dimensions to use for analysis or consider more if required.

  1. People and Organization

  2. Information and Technology 

  3. Processes and Value Streams

  4. Vendors and Suppliers

5.3.1. Record the identified root causes from Ishikawa or 5 Whys and their possible solutions 

No

Root Cause

Category/Dimension

Possible Solution

No

Root Cause

Category/Dimension

Possible Solution

1

Azure UK South experienced Virtual Machine, Storage, SQL Database, and Networking failure, and that impacted AMS as All of the AMS resources are deployed in UK South region only.

AMS was mainly impacted because of SQL Database, Storage and networking.

 

Solution 1: Initiate the Disaster Recovery Solution and bring the AMS system up in UK West region

Solution 2: Wait for the Microsoft to restore the UK South region

2

 

 

 

5.4.Improve the Situation (How can it be fixed?)

Assess solutions value and associated risks to select most suitable option to implement. 

5.4.1.   Assess the solutions value and effort. (Green is the better choice)

 

 

Solutions Value and Effort Assessment

No

Possible Solution

(From step 3.1)

Low Value/High Effort 

(No Go Area)