Scaling AI: 5 Reasons why it is difficult

scaling ai

towards-data-science

This post was originally published by Was Rahman at Towards Data Science

How to ensure your AI solutions can deal with real data, real customers & real business circumstances.

Scaling AI solutions to deal with real data, business users and customers is fraught with risks and difficulties. Even experienced AI-savvy organisations have fallen foul of growing AI solutions to production size.

This article highlights 5 areas where scaling AI for production can be problematic. If you want to introduce AI into your business, it’s critical to get this right. Understanding why it’s tricky is the first step towards that.

Scaling AI from development or test into the real business is not for the faint-hearted. Real data, business users and customers can be more demanding and overwhelming than their test counterparts. And even small mistakes can quickly escalate into substantial damage.

AI systems create additional risks to traditional IT, primarily the massive step-up in scale of data and processing involved. There is also less experience to draw on, both within individual organisations and across the industry in general.

As with many other AI risks and pitfalls, avoiding problems around scaling AI for production is primarily technical work. But business and project management can still play a key role ensuring the issues have been thought about adequately.

If things go badly wrong when taking an IT system live, there have been many high profile reminders that it may be business executives who pay the price.

AI algorithms usually require intensive computer processing. These typically involve matrix manipulation, linear algebra and statistical analysis. It also requires re-running calculations over and over again, until results converge onto a valid conclusion.

Most AI systems require at least hundreds of thousands of data points to perform meaningfully, often millions or more. This combination means that the computer processing and storage requirements of an AI system are immense. This is far greater than anything used by most businesses in past eras.

When building an AI solution, early work is on small sub-sets of data. This needs computer resources orders of magnitude smaller than required later. It’s this step-up in processing, possibly exponentially, that makes performance such a critical area for AI design.

Insufficient attention to performance at scale can create AI solutions that appear to work well during testing, but are unusable by the business.

For example:

  • An AI system that instantly flags possible fraudulent transactions in real-time during testing. It would be unacceptable if it took minutes to do the same with live data.
  • A report of likely performance of different marketing campaigns may take minutes to run in test. But it would be less useful if it takes days or weeks to run with live data.
  • Product recommendations are helpful for customers. But they’re of no business value if the cost of generating them for real customers is greater than the profit made.

The best-known example of this is probably the (now defunct) Netflix competition. Entrants were challenged to find a better way to identify recommendations for viewers to watch next. The winning solution worked very well, but could not be used. Among the reasons were the operational computing costs. Even for Netflix, the cost of scaling AI outweighed the business value of the answers.

Complex Technology Implications & Choices

The technical issues of dealing with a 100x or greater increase in data volumes are complex. These can lead to fundamental issues of architecture and technology. For example, a wrong database choice can render a working test system unusable at scale.

Relational databases have dominated commercial IT for decades, and work very well. But the volumes of data used by AI has challenged their physical limitations. Many AI specialists recommend other technologies, architectures and techniques that scale better.

Some of these are cutting edge, even if many of their origins go back decades. Examples include NoSQL (including Graph databases), Time Series databases, MapReduce programming and even Shared Nothing architectures.

Skills and experience in these are rarer than for established technologies. The costs of both the technology and people to work with them are higher.

Onerous Data Cleansing & Preparation Tasks

There’s another data issue when scaling AI, separate from the technical challenges of massive data volumes. This is the work required to ensure all the data is reliable.

The steps required to clean and prepare raw data are different at scale. For example, manual interventions (such as human inspection) may be effective for preparing test data. But they need to be used judiciously for production, assuming they’re even viable.

This kind of work is covered by terms such as “data pipeline” and “ETL” (Extract, Transform & Load). There are semantic distinctions between such labels. The terms may also exclude other data preparation steps that also need to scale AI. But regardless of label, if it’s not designed well, this process can prevent an otherwise perfect AI system working.

Internal (to Your Organisation) Changes

With any introduction of new technology, the biggest problem may well be the human beings who use it. AI is no different. Unless a business prepares its people properly to use an AI solution, its transition to production can cause its demise.

This isn’t just about training users. It’s also about amending processes, updating policies and putting in place the right kind of business support. (Not just technical support).

Customer-Facing Changes

Additionally, there is often a customer impact of AI work on customer experience. This is sometimes very visibly, other times indirect. Examples include product recommendations, chatbots, personalised campaigns and new customer service scripts.

Part of scaling AI solutions that affect customers is ensuring customer-facing channels are ready to deal with customer reactions. This may just be a temporary spike, but such episodes can be very high profile. A common example could be the introduction of an AI chatbot. Perhaps counter-intuitively, this may lead to an initial increase in telephone calls, especially if it doesn’t work as planned.

Much AI work is about complementing and enhancing human work. But realising the business value requires human work to change accordingly. And that can be harder than the AI work itself.

For example, adding intelligent FAQs to a website should free up contact centre staff time. But creating a business benefit from that extra time requires thought, preparation and perhaps difficult decisions. AI project managers wouldn’t consider that part of their remit. In fact, it may not even appear on a technically-focused AI project plan.

Support for users of a new AI solution includes helping them use the technology. But a trickier issue is supporting business issues, especially ones that didn’t appear in testing. Scaling AI for production should allow for situations not designed or planned for.

Initially, this will mostly be about exceptions to expected results. For example, if a fraud detection score is meant to be less than or greater than 5, processes will cater to each score. But users need to know what to do if the score comes back as something else, such as “unknown” or “error”.

However, over time new situations may arise because of changes within the AI system. This is more likely if the AI incorporates Machine Learning, and is designed to improve itself over time. Usually, this is about improving the accuracy of an algorithm. But could be more open-ended, such as identifying new patterns of customer behaviour.

There is also the risk — perhaps likelihood — of unintended business consequences of the AI. Teams can flush out many of these in pilot programmes and “soft launches”, but not always.

So an important part of scaling AI involves working through difficult or critical scenarios. One focus should be ensuring contingency options are both technically and operationally feasible. For example, is it straightforward to “switch off” the AI solution temporarily?

By definition, unexpected behaviours are tricky to predict. That doesn’t mean they’re impossible to prepare for, just more difficult.

The final problem area around scaling AI for production is underestimating the security implications. For perhaps the first time, substantial amounts of diverse data will all be in one place. This creates potentially new vulnerabilities, and represents a new type of business risk to mitigate. Such issues can be magnified by publicity, and security problems can have significant consequences on a brand.

There’s the obvious risk of data being stolen or directly exploited. These require regular cyber and IT security to be high. But it doesn’t just apply to AI computers and databases. There are also new potential points of breach in communications infrastructure and cloud facilities.

Security and privacy considerations also need to account for the way AI solutions are built and processed. Copies or extracts of sensitive data exist in places not as secure as the live environments. Traditional IT test and production environments may not be set up for AI data validation and experimentation activities.

AI also creates a new, different security risk because of its “intelligence”. The ability to infer meaning from incomplete and disparate data, or intelligently fill gaps in data, creates new challenges. For example, previously “safe” data — perhaps anonymised or incomplete — may become sensitive if conflated with other data. What’s worse is that this “other data” may sit outside the AI system, or even the business. The implication is that all data needs to be considered when planning security measures, not just sensitive data.

Businesses accept that stringent controls, policies and systems are necessary to protect financial resources and assets. Many now treat data as an asset that also needs strong protection and oversight. AI systems take this to another level.

When scaling AI, data protection, security and governance is no longer about getting a regulatory tick-box.

Building working AI systems and solutions is hard. The complexities to overcome involve technology, data and business. Scaling AI for production creates further constraints on this complexity, and adds additional challenges.

But AI in business needs to generate business improvement, and it’s arguably incomplete without achieving them. So the issues around scaling AI are in some ways as important as choosing the right algorithm.

Bear this in mind when evaluating, planning and budgeting for AI in your organisation. The kinds of factors described in this article should be included in your thinking, and be part of your risk awareness. Even if you’re not yet ready to address them, they should exist as placeholders to consider in detail later.

Spread the word

This post was originally published by Was Rahman at Towards Data Science

Related posts