Technology & Digital - Ready or Not, AI Is Coming to IT Operations

Related Expertise: Tech Function, Digital Transformation, Emerging Technologies

Ready or Not, AI Is Coming to IT Operations

By Akash BhatiaUsama Gill, and Remco Mol

Artificial intelligence is shaking up business as usual, enabling corporate functions from marketing and sales to finance to deliver predictive insights at extraordinary speed and scale. Given the possibilities, it’s little wonder that adoption of AI across the enterprise has attracted enormous interest and investment.

But like the cobbler who resoles everyone else’s shoes before fixing his own, IT organizations have been slower to employ machine learning (ML) technologies within their own functions. That may soon change. In the past few years, a number of vendors have begun designing and developing powerful analytics tools to address the particular challenges that IT personnel face in managing, updating, and running IT hardware and software across the enterprise.

The market for AI in IT operations (or AIOps, as it is known) is new, but its potential is huge. A BCG survey of 112 CIOs across multiple industries found that these technologies could significantly improve the cost-effectiveness and performance of IT operations, allowing organizations to drive innovation more quickly without sacrificing security, stability, or service.

It may be too early for AIOps to fully replace manual IT operations activities, but that should not lull CIOs and IT directors into thinking that they can sit back and wait. While observers and analysts may spill ink over whether this emerging technology is worthy of the hype, leading IT organi­zations have already resolved that question and are moving decisively down the path toward implementing AIOps tools and developing the capabilities to use them. The question isn’t whether AIOps will revolutionize IT operations; it’s how long the rest of the field will take to recognize that the shift is already underway.

Challenges Facing IT Operations

IT operations can feel like a thankless job. Teams may have the knowledge and experience to be a strategic partner to the business, but demonstrating value can be a Sisyphean task. The daily grind of chasing down alerts and patching problems can lock IT personnel into a cycle in which they are continually playing catch-up instead of preventing problems from arising. Increasing use of the cloud can alleviate some of these issues, but it doesn’t make the operational complexity go away: someone still has to manage those cloud services and organizational interconnections. With workloads rapidly growing and with no consistent, effective way to prioritize activi­ties, IT operations are constantly on the back foot, perpetuating a stereotype that the function is reactive and slow to move—the very perceptions that IT functions have tried so hard to shake.

IT operations must deal with a number of key challenges:

  • Escalating Service Expectations, with Little Margin for Error. As end users’ expectations continue to grow, service-­level agreement (SLA) requirements have become more stringent. At the same time, organizations increasingly expect IT operations to deliver both near-­perfect service availability and a shorter mean time to recovery when incidents do occur.
  • A Dizzying Number of Services, Released at Faster Rates. Although modular architectures are a boon to innovation, they’ve contributed to the creation of hundreds of new APIs and microservices that IT operations must monitor and maintain. Complex interdependencies among these services make finding the root cause of outages or other issues exponentially more difficult. In addition, the release cycle has accelerated as agile development practices and iterative launches become mainstream.
  • Torrents of Data and Alerts Without an Easy, Reliable Way to Filter Them. The manual and rules-based monitoring systems that most IT operations now have in place can’t cope with the demands of today’s complex and dynamic environments. Chasing down thousands of alerts—many of which turn out to be false positives—often leads to “alert fatigue,” which may result in more fire drills and actual emergencies down the road.
  • A Fragmented and Increasingly Borderless IT Operations Landscape. With the introduction of DevOps, more operations activities are now being managed by feature teams, whose members lack the specialized operations experience needed to address the nonautomated portion of those activities. And because those activities are less centralized, they’re harder for IT functions to coordinate. Likewise, as companies have expanded their partner ecosystems, IT operations have had to extend their monitoring activities across company boundaries and look for new ways to measure—and sometimes charge for—IT service consumption.

These demands are growing at a time when IT operations budgets are under increasing pressure. 1 1 Gartner’s IT Key Metrics 2019 report indicates that, on average, data centers, service desks, and voice and data networks commanded 35% of the total IT budget in 2018, down from 46% in 2014. Notes: 1 Gartner’s IT Key Metrics 2019 report indicates that, on average, data centers, service desks, and voice and data networks commanded 35% of the total IT budget in 2018, down from 46% in 2014. The only way for IT leaders to deliver the stability and cost-­effectiveness that their budgets demand is to make their operations more predictive, proactive, and automated.

IT Operations Could Soon Get a Lot Smarter

AIOps is a new field, and a confusing one. Only in 2017 did Google begin receiving a significant volume of Google search requests for “AIOps,” and most vendors are still refining their solution set. For clarity, we define AIOps as comprising all solutions that use big data, AI, and ML to enhance and automate IT operations and monitoring. 2 2 By BCG’s definition, AIOps includes all artificial intelligence (AI) and machine learning (ML) in application performance monitoring (APM), IT infrastructure monitoring (ITIM), network performance monitoring and diagnostics (NPMD), and IT event correlation and analysis (ITEC&A). Our definition excludes AI and ML in areas such as experience management operations, cybersecurity operations, and delivery automation. Notes: 2 By BCG’s definition, AIOps includes all artificial intelligence (AI) and machine learning (ML) in application performance monitoring (APM), IT infrastructure monitoring (ITIM), network performance monitoring and diagnostics (NPMD), and IT event correlation and analysis (ITEC&A). Our definition excludes AI and ML in areas such as experience management operations, cybersecurity operations, and delivery automation. (See Exhibit 1.)

Within the IT operations and monitoring space, AIOps is most suitable for appli­cation performance monitoring (APM), informa­tion technology infrastructure management (ITIM), network performance monitoring and diagnostics (NPMD), and information technology event correlation and analysis (ITEC&A), where it can help automate routine manual operations activities. That automation potential has prompted a surge in development. The market for core AIOps is projected to grow from $9.4 billion in 2017 to $13.8 billion in 2021, a compound annual growth rate of 10%. 3 3 Areas encompassed by core AIOps include APM, ITIM, NPMD, and ITEC&A. Notes: 3 Areas encompassed by core AIOps include APM, ITIM, NPMD, and ITEC&A. AIOps orchestrators—platforms built to orchestrate insight and actions on the basis of log data from various monitoring solutions—are expected to grow by 26% over the same period. (See Exhibit 2.)

As the underlying technologies become more established, AI-enabled tools and platforms will be able to automate core monitoring and management activities at scale, relieving many of the most pressing IT operations challenges quickly, reliably, and consistently.

We believe that AIOps will help transform IT operations in three critical ways:

  • Provide end-to-end visibility. Modern ML, powered by fine-grained IT operations and performance data, will give IT personnel and teams the robust monitoring capabilities they need to observe the behavior and performance of applications and IT infrastructure across both the enterprise and the broader ecosystem, allowing them to preemptively identify risks and impending issues.
  • Generate evidence-backed insights and recommendations. As AIOps algorithms become more refined, IT personnel will be able to analyze historical operations data, eliminate noise, and identify root cause issues more consistently and effectively. Those insights, in turn, will enable them to make more-accurate predictions that lead to speedier problem resolution as well as to anticipate and gauge the probable business impact of such issues before they occur.
  • Execute recommendations automatically. In time, the algorithms powering AIOps platforms may advance to the level of automatically altering the configuration of IT services and environments. For example, sophisticated AIOps tools might automatically determine whether to scale new infrastructure or containers, allocate more or less virtual capacity to a project, or spin the number of application servers up or down—and then take agreed-upon action.

With sufficient refinement, AIOps may someday be able to automate a significant portion of all IT operations and monitoring activities.

Double-Digit Growth Will Ride the Maturity Curve

So far, only the most progressive IT organizations have implemented AIOps solutions, but adoption is likely to grow significantly over the next three to five years. More than 40% of CIOs say that they plan to begin using AIOps solutions by the end of 2021, up from about 15% who indicated they were using AIOps as of January 2019, when the survey was conducted. Other CIOs are watching the space but intend to wait for the tools to mature further.

As with most other emerging technologies, solution development will occur in stages. Use cases that focus on providing visibility and insight are likely to mature first, with more sophisticated execution capabilities added over time.

User adoption will follow the technology maturity curve. (See Exhibit 3.) Of the roughly nine core use cases that the AIOps vendor community is actively building, the ones that have attracted the greatest interest among early adopters focus on pattern recognition and data consolidation. By contrast, few CIOs have indicated that they are ready to embrace truly bleeding-edge use cases, such as ones that rely on machine-­generated recommendations to fully automate issue remediation.

In our view, three use cases hold the greatest short-term potential:

  • Anomaly Detection. Sophisticated pattern detection tools could help IT operations teams detect unusual behavior—such as a sudden spike in application use—in IT infrastructure, automatically and at scale. A combination of robust data-processing engines and ML could regulate expected CPU utilization thresholds on the basis of historical patterns and workloads, reducing the need for manual observation and hard-coded rules that require teams to define anomalies up front. Our survey indicates that only about 11% of CIOs currently use anomaly detection AIOps tools, but that figure should grow to 42% by 2021.
  • Noise Reduction. ML algorithms built into AIOps platforms could prioritize alerts on the basis of their business impact and filter out false positives, freeing IT operations teams to spend their time addressing critical alerts instead of managing static filters, writing rules, and adjusting thresholds to reduce alert noise. About 9% of CIOs now use noise reduction tools in AIOps, but our survey data shows that the percentage could rise to 42% by 2021.
  • Triaging and Alert Correlation. AIOps could automatically associate alerts that cut across various IT services into a single incident to speed up triage. By analyzing the topology, time, and context of various alerts, the AIOps solution could automatically determine whether different alerts are related, and then cluster the results into a single, unified incident. For example, a monitoring tool might create multiple memory and page-fault alerts from hosts of the same SQL cluster. The ML algorithm in the AIOps tool, when properly trained through supervised learning, could correlate alerts into a single incident, allowing the IT operations team to distinguish between alerts belonging to that incident and similar but unrelated alerts. A sharing rule built into the solution would then automatically create a service desk ticket. Approximately 10% of the CIOs in our survey say that they already use some sort of AIOps-enabled triaging solution today, and 40% say that they’re open to using this type of solution within the next three years.

In the medium term, several other use cases hold strong promise. These include root cause analysis tools, which are designed to help IT personnel understand the issues triggering different incidents and alerts and allow them to prioritize remediation; incident prediction tools, which can identify causal patterns between IT variables; and automated issue remediation, which has the potential to automatically address routine problems and events—for instance, by adjusting IT configurations or workflows.

Consolidation Could Help Bring Needed Clarity

Although few vendors have developed a full suite of enterprise-grade AIOps tools as yet, several categories of vendors are staking claims in the burgeoning AIOps market. They include incumbent monitoring players, entrants from adjacent tool markets, emerging AI/ML challengers, and hyper­scalers. (See Exhibit 4.)

That crowded playing field is good for innovation. But it also creates more work for CIOs and other potential customers in the meantime, since they need to sift through multiple tools and options to develop a working knowledge of vendor capabilities. Over the next several years, though, we’re likely to see considerable vendor consolidation, which should make selection and management less of a chore. That consolidation will be due to three factors:

  • Data Advantage. Vendors that amass significant volumes of high-quality, relevant IT operations data will achieve superior AIOps accuracy, creating a virtuous circle that attracts more clients, more data, and better solutions.
  • Horizontal Expansion. AIOps players that dominate a specific use case can use their existing client base to cross-sell additional use cases.
  • Vertical Consolidation. Tools integrated across different layers of the infrastructure stack can deliver superior insights, making them more valuable than niche tools that focus on individual stack layers.

Do You Have a Plan?

AIOps is still in its early stages, but it won’t stay that way for long. It seems likely that, five years from now, all top-performing IT organizations will have adopted some form of AIOps—with third-party tools or with hyperscalers’ natively provided tools—achieving advantages in cost, service, and stability over slower-moving peers.

CIOs have less time to prepare for AIOps than they may think. The tasks of selecting the right vendor and identifying the right use cases are challenging enough; but beyond that, getting ready for AIOps internally can be a heavy lift. AIOps requires structured access to significant amounts of historical and real-time IT operations data, as well as dedicated resources to enable data access and configure the AIOps tools. In addition, successfully automating the resulting insights and actions will demand significant process reengineering to connect the dots end-to-end and ensure that the right operational handoffs are in place.

Organizations that want to capture maximum upside should start by actively sponsoring a cross-functional team, pinpointing one or two high-value use cases, and assembling the right mix of data engineering, data science, and development talent to test and refine the appropriate tools. Anticipating risk, legal concerns, compliance issues, and other considerations and securing ongoing engagement from these functions are also crucial, not only to reduce potential business exposure, but also to help build confidence and trust among non-IT-based functions in the quality and efficacy of AI-based risk modeling and detection.

Training and change management are also essential. AIOps is a very different way of working, and it will require organizations to retrain some IT operations teams and ­redeploy others. Teams must learn to work with AIOps tools through relevant, labeled examples, operationalize them, and understand their output. And CIOs must ensure that the development effort has a sufficient budget and firm leadership commitment, with funding and reviews based on achieving predefined business outcomes and not simply development milestones.

While AIOps may look like a “watch this space” opportunity, CIOs and IT leaders who sit back now risk being unable to capture the full benefit of this technological revolution later, when full-scale implementation goes mainstream.

Subscribe to our Digital, Technology, and Data E-Alert.

Subscribe to our Digital, Technology, and Data E-Alert.