Since late 2022, we¡¯ve seen an extraordinary boom in Generative AI investment. No doubt, large language models (or LLMs) represent a genuine paradigm-changing innovation in data science. They extend the capabilities of machine learning models to generating relevant text and images in response to a wide array of qualitative prompts.
In a podcast interview in early April, Dario Amodei, the chief executive officer of OpenAI rival Anthropic, said the current crop of AI models on the market cost around $100 million to train. Looking ahead, ¡°The models that are in training now and that will come out at various times later this year or early next year are closer in cost to $1 billion. And then, I think in 2025 and 2026, we¡¯ll get more towards $5-to-$10 billion.¡±
Yet despite their high cost and difficulty to build, LLMs have become ¡°the next big thing.¡± Multitudes of users use them to quickly and cheaply perform some of the language-based tasks that only humans could formerly do.
This raises the possibility that many human jobs will soon be performed by LLMs. However, these new tools have yet to demonstrate that they can satisfactorily perform all of the tasks that knowledge workers execute in any given job.
Unlike conventional automation tools which presume a fixed input, an explicit process, and a single correct outcome, LLM tools¡¯ input and output can vary, and the process through which the response is produced is a ¡°black box.¡± Managers can¡¯t evaluate and control these tools the same way they do conventional machines. That means there are serious problems which enterprises must resolve before using these tools in a mainstream organizational context.
According to Wharton-based technology gurus Peter Cappelli, Prasanna (Sonny) Tambe, and Valery Yakubovich, the top five challenges are:
1. The Knowledge Capture Problem
2. The Output Verification Problem
3. The Output Adjudication Problem
4. The Cost-Benefit Problem
5. The Job Transformation Problem
Any combination of these can potentially derail or seriously delay a generative AI initiative. The big insight here is that these five problems are making it more challenging than expected for companies to bring mainstream LLM-based business solutions online, limiting the explosive take-off of user-based revenues.
Let¡¯s examine each of these problems and how they might be resolved in the real world. Let¡¯s start with¡¦
1. The Knowledge Capture Problem
The humans in organizations produce huge volumes of proprietary, written information that they cannot easily process themselves, including strategic plans, job descriptions, organizational and process charts, product documentation, performance evaluations, and so on. An LLM trained on such data can produce insights that the organization likely did not have access to before. And this may prove to be the company¡¯s most important advantage in using LLMs.
That¡¯s because the organizations that make the most of LLMs will use them to generate outputs that pertain specifically to their needs and are informed by their data sources.
Feeding the right information to the LLM is no small task, given the considerable effort required to sort out the volumes of irrelevant data organizations produce. Useful knowledge about organizational culture and survey results from employees take time to assemble and organize. Even then, a lot of important knowledge might be known to individuals but not documented. In one recent study, only about 11% of data scientists reported that they have been able to fine-tune their LLMs with the data needed to produce good and appropriate answers specific to their organization. The process is expensive and requires powerful processors, thousands of high-quality training and verification examples, extensive engineering, and ongoing updates.
LLMs are already very helpful with some applications such as answering programming questions. And there are numerous LLM-based tools, like GitHub¡¯s Copilot and Hugging Face¡¯s StarCoder, that assist human programmers in real time. One study suggests that programmers prefer using LLM-based tools for generating code because they provide a better starting point than the alternative of searching online for existing code to reuse. But surprisingly, this approach alone does not improve the success rate of programming tasks. That¡¯s because additional time is required to debug and understand the code the LLM has generated.
What does this tell us? Rather than eliminate jobs, the difficulty of the knowledge capture task for organizations is likely to drive the creation of new jobs. For instance, data librarians, who catalog and curate organization-specific data that can be used to train LLM applications, could become critical in some contexts.
With that in mind, let¡¯s consider¡¦
2. The Output Verification Problem
All applications of LLMs are not created equal; therefore, success in some areas is racing ahead of those in others. Computer programming is an area where explicit knowledge can be particularly important. The kinds of LLM outputs used in programming tasks have the advantage of being tested for correctness and usefulness before they are rolled out and used in situations with real consequences. Unfortunately, most LLM outputs are not in that category.
For instance, strategic recommendations or marketing ideas are not outputs that can be tested or verified easily. For these kinds of prompts, the output simply has to be ¡°good enough¡± rather than perfectly correct in order to be useful. That begs the question, ¡°When is an LLM answer good enough?¡± For simple tasks, employees with the relevant knowledge can judge for themselves simply by reading the LLM¡¯s answer.
Unfortunately, research on whether users will take the task of checking LLM output seriously is not encouraging. In one experiment, white-collar workers were given the option to use an LLM for a writing task. Those who chose to use the tool could then opt to either edit the text or turn it in unedited. Most participants chose the latter.
Worse yet, what happens if employees lack the knowledge required to judge an LLM¡¯s more complicated, unusual, and consequential outputs? They may realistically ask questions for which they do not know what good enough answers look like. This calls for a higher degree of skilled human judgment in assessing and implementing LLM outputs.
A key problem is that LLMs are algorithmic ¡°black boxes,¡± unlike humans. For example, an LLM, unlike a human employee, is unaccountable for its outputs. A track record of accuracy or good judgment can allow the human¡¯s employer to gauge their future outputs. A human can also explain how they reached certain conclusions or made certain decisions. This is not the case with LLMs. Each prompt sends a question on a complex path through its body of knowledge to produce a response that is unique and unexplainable. Further, LLMs can ¡°forget¡± how to do tasks that they previously did well, making it hard to provide a reliability guarantee for these models.
Ultimately, a human is needed to assess whether LLM output is good enough, and they must take that task seriously. One challenge when integrating LLM output with human oversight is that in many contexts, the human must know something about the domain to be able to assess whether the LLM output is valuable. This suggests that specific knowledge cannot be ¡°outsourced¡± to an LLM. So, when it comes to important functions, human domain experts are still needed to evaluate whether LLM output is any good before it is put into use.
3. The Output Adjudication Problem
LLMs excel at summarizing large volumes of text. This might help bring valuable data to bear on decision-making and allow managers to check the state of knowledge on a particular topic, such as what employees have said about a particular benefit in past surveys. However, that does not mean that LLM responses are more reliable or less biased than human decisions. That¡¯s because LLMs can be prompted to draw different conclusions based on the same data, and their responses can vary even when they¡¯re given the same prompt at different times.
This makes it easy for different parties within an organization to generate conflicting outputs, and that requires companies to develop means of adjudicating between LLM outputs.
Whether the task of adjudicating LLM outputs is added to existing jobs or will create new ones will depend on how easy it is to learn. The hopeful idea that lower-level employees will be empowered by access to LLMs to take on more of the tasks of higher-level employees requires particularly optimistic assumptions. The long-standing view about job hierarchies is that incumbents need skills and judgment that are acquired through practice, and the disposition to handle certain jobs, not just textbook knowledge made available on the fly by LLMs. The challenge has long been to get managers to empower employees to use more of that knowledge as opposed to making decisions for them. That reluctance has been much more about a lack of trust than a lack of employee knowledge or ability. As just discussed, effective adjudication of LLM output might also require a great deal of domain expertise, which further limits the extent to which this task can be delegated to lower-level employees.
At this point, the output adjudication problem is one of the thorniest aspects of using LLMs to eliminate jobs. There are no widely accepted methods for selecting among competing outputs in high-stakes situations.
Understanding the costs of input prep as well as output verification and adjudication provides half the solution to¡¦
4. The Cost-Benefit Problem
The incremental benefits of using LLM output within an organization can be even more unpredictable than the costs. For instance, LLMs are terrific at drafting simple correspondence, which often just needs to be good enough. But simple correspondence that occurs repeatedly, such as customer notifications about late payments, has already been automated with form letters. Interactive connections with customers and other individuals are already handled rather well with simple bots that direct them to solutions the organization wants them to have (though not necessarily what those customers actually want). And call centers are already replete with templates and prepared text tailored to the most common questions that customers ask.
So, it¡¯s obvious that the additional time and cost savings enabled by many LLM solutions could realistically be undone by the other costs they impose.
Consider some real-world research.
A study of customer service representatives where some computer-based aids were already in place found that the addition of a combination of LLM and machine learning algorithms that had been trained on successful interactions with customers improved problem resolution by 14%. But that begs the questions, ¡°Is that a lot or a little for a job often described as uniquely suited to LLM output?¡± and ¡°Is the result enough to justify the cost of implementation?¡±
The Wharton-based experts cite a preregistered experiment with 758 consultants from Boston Consulting Group which showed that GPT-4 drastically increased consultants¡¯ productivity on some tasks, but it significantly decreased it on others. These were jobs where the central tasks were well suited to being done by LLMs, and the productivity effects were real but well short of impressive. That leaves the cost-benefit case ambiguous.
Additional analysis also implies that the time and cost savings afforded by LLMs in various contexts might be undone by the other costs they impose. For instance, converting chatbots to leverage LLMs is a considerable undertaking, even if it might eventually prove useful.
And even if customers and Generative AI vendors can overcome the four problems we¡¯ve examined, they still face¡¦
5. The Job Transformation Problem
That challenge requires figuring out how LLMs will work with workers.
Answering this question is far from straightforward. First, given that employees are typically engaged in multiple tasks and responsibilities that are dynamic in nature, LLMs that take over one task cannot replace the whole job and all of its separate subtasks. Consider the effects of introducing ATMs; even though the machines were able to do many of the tasks that bank tellers performed, they did not significantly reduce the number of human workers because tellers had other tasks besides handling cash and were freed up to take on new responsibilities.
The variability and unpredictability of the need for LLMs in any given workflow is a factor that essentially protects existing jobs. At this point, it seems that most jobs don¡¯t have a need to use LLMs very often, and it can be difficult to predict when they will need them.
The jobs that LLMs are most likely to replace are, of course, those where the tasks that take up most of people¡¯s time can consistently be done correctly by Generative AI. But even in those cases, there are serious caveats. The projections of enormous job losses from LLMs rely on the unstated assumption that tasks can simply be redistributed among workers. This might have worked with old-fashioned typing pools, where all of the employees performed identical tasks. If the pool¡¯s productivity increased by 10%, it would be possible to reallocate the work and cut the number of typists by 10%. The variability and unpredictability of the need for LLMs in any given workflow is a factor that essentially protects existing jobs.
Another possibility is that LLMs could improve productivity enough across an entire organization that it has an effect not on specific occupations but on the overall need for labor. There is no evidence of this yet, but it would be a welcome effect for many business leaders, given how slow productivity growth has been in the US and elsewhere and the difficulty so many employers report in expanding their workforces.
So, what¡¯s the bottom line?
At Trends, we believe Generative AI is the next big thing. However, that¡¯s mostly because it will contribute to fully exploiting Analytic AI and provide a real-world pathway to realizing the potential of robotics in the 2030s and beyond. Meanwhile, companies will be able to address many important revenue and cost-saving opportunities in the shorter term. However, we believe it will not be as easy as most managers expect for companies to solve the Knowledge Capture Problem, the Output Verification Problem, the Output Adjudication Problem, the Cost-Benefit Problem, and especially the Job Transformation Problem.
As history shows, the impact of IT-related innovations varies enormously depending on the job, organization, and industry; and they typically take a lot longer than expected to play out. The fact that LLM tools are constantly becoming easier to use, and that they are being incorporated into widely adopted software products like Microsoft Office, makes it likely that they will see faster uptake than with previous waves of IT innovation.
As of mid-year 2024, it seems that most organizations are simply experimenting with LLMs in small ways. That implies we¡¯ll soon see the real pace and scale of this transformation.
Given this trend, we offer the following forecasts for your consideration
First, the generative AI market will experience its first shakeout by sometime in 2025. That¡¯s because costs will prove higher and revenues more elusive than most investors expect. Such a shake-out is natural and healthy for both the consumers and the survivors. It helps rapidly redeploy talent and capital to new opportunities. Hardware suppliers like Nvidia will continue to prosper in spite of the shakeout. Meanwhile, end users will benefit from dramatically falling prices.
Second, most companies that hope to effectively leverage LLMs will start by establishing ground rules for their use, such as prohibiting proprietary data from being uploaded to third-party LLMs, and disclosing whether and how LLMs were used in preparing any documents that are being shared. In most companies, ¡°acceptable use policies¡± already limit how employees can use company equipment and tools. Some experts suggest that this be augmented by the use of a tool like Amazon Q, a generative AI-powered chatbot that can be customized to adhere to an organization¡¯s acceptable use policies around who can access an LLM and what data can be used.
Third, to address the Knowledge Capture Problem, successful companies will typically create a central office to produce all important LLM output, at least initially, to help ensure that acceptable use standards are followed and to help manage problems like ¡°data pollution.¡± Central offices can provide guidance in ¡°best practices¡± for creating prompts and interpreting the variability of answers. They also offer the opportunity for economies of scale. Having one data librarian in charge of all the company data that could be used in analyses is far more efficient and easier to manage than having each possible user manage it themselves.
Fourth, in order to get ahead of the Output Verification Problem, successful companies will require everyone who is likely to use LLM reports to receive basic training on understanding the quirks of the tool. This must involve its ability to hallucinate as well as how to evaluate AI-generated documents and reports. The next step should be to train employees in prompt design and refinement. It is also important to articulate and communicate a standard for what constitutes clearing the organization¡¯s ¡°good enough bar¡± for using LLM output. A central LLM office could facilitate training that best fits the organization.
And, fifth, the many claims in the popular media about how Generative AI will eliminate enormous numbers of jobs will create pressure from investors and other stakeholders to change company hiring criteria for future jobs or start making plans for where they can cut jobs. In most cases, those discussions will prove premature. It might help to remind those stakeholders how inaccurate similar forecasts have been; for example, predictions that truck drivers would be largely replaced by robotic drivers by now have not come to pass.
In the longer term, once the company figures out the different ways in which LLMs might be put to work, it will become clearer whether tasks can be reorganized to create efficiencies. In the meantime, it would be more prudent to begin to rewrite contracts with vendors to maximize flexibility.