Staffing levels within IT operations (ITOps) departments are flat or declining, enterprise IT environments are more complex by the day, and the transition to the cloud is accelerating. Meanwhile, the volume of data generated by monitoring and alerting systems is skyrocketing, and Ops teams are under pressure to respond faster to incidents.
Faced with these challenges, companies are increasingly turning to AIOps — using machine learning and artificial intelligence to analyze large volumes of IT operations data — to help automate and optimize IT operations. Yet before investing in new technology, leaders want confidence that it will indeed bring value to end-users, customers, and the business at large.
Leaders looking to measure the benefits of AIOps and build KPIs for both IT and business audiences should focus on critical factors. These should include uptime, incident response and remediation time, and predictive maintenance so that potential outages affecting employees and customers can be prevented.
Business KPIs connected to AIOps include employee productivity, customer satisfaction, and web site metrics such as conversion rate or lead generation. Bottom line, AIOps can help companies cut IT operations costs through automation and rapid analysis, supporting revenue growth by enabling business processes to run smoothly and with excellent user experiences.
Specific benefits of AIOps
AIOps can digest operational data and spit out actionable recommendations for keeping critical systems running at peak efficiency. These are the top benefits often cited of the technology:
- Alert management: In most cases, the first problem that IT groups will address with AIOps systems is reducing noise volume: the torrent of alerts that inundate IT operations groups. AIOps uses clustering and pattern matching algorithms to eliminate as much as 90% of false alarms and other types of redundant or irrelevant alerts, making it far easier for staffers to focus on what matters.
- Incident prioritization and routing: AIOps systems can then learn over time which types of alerts should be sent to which teams, reducing redundancy and confusion when, say, networking and database teams both get the same alert related to an incident.
- Event correlation: AIOps can correlate alerts and event data to identify the root cause of an outage or application slowdown so that IT teams can respond faster.
- Advanced anomaly detection: AIOps systems can generate anomalies to detect abnormal conditions and relate them to business impact proactively. For example, whether a system will run out of disk space based on projected growth or seasonal patterns even if the growth is non-linear. Or, if there’s a sudden increase in the number of failed server requests, is the server in question handling a mission-critical task or merely performing routine backups?
- Automation: AIOps can be used to handle routine tasks such as backups, server restarts, and other low-risk maintenance activities that take heavy manual effort.
- Predictive analytics: A more advanced use of AI in IT is when operators can predict events before they happen – such as detecting when network bandwidth is reaching its limit or storage capacity is nearing the threshold.
Seven KPIs for AIOps
These common KPIs can measure the impact of AIOps on business processes:
- Mean time to detect (MTTD): This KPI refers to how quickly it takes for an issue to be identified. AIOps can help companies drive down MTTD through machine learning to detect patterns, block out the noise, and identify outages. Amid an avalanche of alerts, ITOps can understand the importance and scope of an issue, which leads to faster identification of an incident, reduced downtime, and better performance of business processes.
- Mean time to acknowledge (MTTA): Once an issue has been detected, IT teams need to acknowledge it and determine who will address it. AIOps can use machine learning to automate that decision-making process and quickly ensure that the right teams are working on the problem.
- Mean time to restore/resolve (MTTR): When a critical business process or application goes down, speedy restoration of service is vital. ITOps plays an essential role in using machine learning to understand if the issue has been seen previously and, based on past experiences, to recommend the most effective way to get the service back up and running.
- Service availability: Often expressed in terms of percentage of uptime or outage minutes over a period, AIOps can help boost service availability through the application of predictive maintenance.
- Percentage of automated versus manual resolution: Increasingly, organizations leverage intelligent automation to resolve issues without manual intervention. Machine learning techniques can be trained to identify patterns, such as previous scripts that had been executed to remedy a problem, and take the place of a human operator.
- User Reported versus Monitoring Detected: IT operations should detect and remediate a problem before the end-user is even aware of it. For example, suppose application performance or Web site performance is slowing down by milliseconds. In that case, ITOps wants to get an alert and fix the issue before the slowdown worsens and affects users. AIOps enables the use of dynamic thresholds to ensure that alerts are generated automatically and routed to the correct team for investigation or auto-remediated when policies dictate.
- Time savings and associated cost savings: The use of AIOps, whether to perform automation or more quickly identify and resolve issues, will result in savings both in operator time and business time to value. These have a direct impact on the bottom line.
These seven KPIs can be correlated to business KPIs around user experience, application performance, customer satisfaction, improved e-commerce sales, employee productivity, and increased revenue. ITOps teams need to quickly connect the dots between infrastructure and business metrics so that IT is prioritizing spend and effort on real business needs. Hopefully, as machine learning matures, AIOps tools can recommend ways to improve business outcomes or provide insights into why digital programs succeed or miss the mark.
Ciaran Byrne is vice president of product management at OpsRamp.
The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/sdecoret