DESIGN NAME: IBM Cloud Event Management
PRIMARY FUNCTION: Software Application
INSPIRATION: IT Operation teams are often faced with high expectations to resolve and reduce product service outages, while working with teams and tools that are not cohesive and out of sync with each other. Ops teams feel pressured and spend a lot of time and energy trying to do their job with outdated tools and team structures. We aimed to create a software services to help IT Ops teams prioritize their work and connect them with the right people to resolve product outages.
UNIQUE PROPERTIES / PROJECT DESCRIPTION: IBM Cloud Event Management (CEM) is a holistic app that changes the way IT Ops teams work in identifying, diagnosing and resolving service outages. CEM features an intuitive workflow that integrates tasks normally spread across different systems and users, and instead creates a compelling experience. Data from thousands of sources are correlated, prioritized, and routed to the right people, with built-in procedures to fix problems quickly, and teams are able to collaborate and stay in sync.
OPERATION / FLOW / INTERACTION: IBM CEM allows users to automatically ingest thousands of events from disparate sources and consolidate them into prioritized incidents that tell them what's really causing faults in their environment. Users can resolve simple outages automatically, freeing up first responders to concentrate on faults that need their intervention. CEM relieves pressure on teams by automatically routing incidents to the right people and bringing together development and operations teams to resolve complex faults fast.
PROJECT DURATION AND LOCATION: The design team started the journey in January 2016 in London. It went experimental in March 2017, Beta in July 2017 and finally GA in October 2017. It was exhibited at Think in Las Vegas in February 2017.
|
PRODUCTION / REALIZATION TECHNOLOGY: -
SPECIFICATIONS / TECHNICAL PROPERTIES: -
TAGS: Cloud, Event, management, correlation, runbook, incident, notification, policies, event source, fault, alarm, alert, event, outage, DevOps, Ops, IT Ops, on-prem, Cloud native
RESEARCH ABSTRACT: Taking a qualitative UX research approach, our objective was to investigate how to improve IT engineers’ ability to manage large amounts of unstructured network data efficiently. We conducted interviews, surveys, site visits and usability testing, taking video recordings and using online collaboration tools such as Mural, Box and Github to share findings. Our research participants were sponsored users, internal SMEs and users recruited online. We learned that users needed to trust the system to correlate data so we designed a way to generate sample data so they could try before full setup.
CHALLENGE: We researched, designed, and developed Cloud Event Management in a series of iterative loops. The challenge in this was to start with working hypotheses, act on them, and, without fear of failure, validate them and iterate again based on the feedback. Our research challenge was in developing our hypothesis about who Cloud Event Management is for. Our initial designs were for an IT Ops operator had few IT skills and little understanding of what she or he worked on beyond following protocols and documented procedures. They had little autonomy. One of the consequences, which informed our hypothesis, was that operators don’t spend long doing the job, so there is high turnover. When we tested this hypothesis, we discovered that some operators had over 20 years’ experience on the job, had developed extensive skills and were highly autonomous. We, therefore, had to modify what had been previously a very tight workflow of operators triaging and attempting to resolve incidents and quickly handing over unresolved problems to more skilled engineers. The changes allowed an operator to dive into details of an incident, exposed more complexity for isolation and diagnosis. The resulting workflows had a high-level outline of each incident, with access to documented or automated procedures that are likely to resolve the problem. Beyond that, experienced and skilled operators had the opportunity to access the raw data that was received from the monitored apps and services that generated the incident so that they can apply their in-depth knowledge to diagnosing the underlying fault.
ADDED DATE: 2018-02-23 19:03:13
TEAM MEMBERS (9) : Kieran Geoghegan, Daniel Brown, Britta Binning, Fabienne Schoeman, Shaun Lynch, Dave Clark, Angela Boodoo, Myriam Battelli and Randa Salem
IMAGE CREDITS: IBM Design
PATENTS/COPYRIGHTS: IBM, 2017
|