Articles

The Expertise Ceiling: How Leading LLMs Perform on Specialized Consulting Work

published June 11, 2026 In

Digital & AI The Expertise Ceiling: How Leading LLMs Perform on Specialized Consulting Work
Digital & AI The Expertise Ceiling: How Leading LLMs Perform on Specialized Consulting Work

The Expertise Ceiling: How Leading LLMs Perform on Specialized Consulting Work

As AI continues to advance in analytical capability, consultants are using it to move faster and deliver better results. This has upended legacy consulting models, as firms lay off junior resources in favor of AI and reduce hiring for entry-level roles. However, large language models (LLMs) cannot yet match the expertise of those with years of operational and consulting experience.

To explore where the current leading LLMs actually stand on complex consulting work, we assembled a group of Catalant consultants with deep functional credentials to design a set of tasks and evaluate how five leading models performed on them. These are practitioners with deep domain expertise and backgrounds at major consulting firms like McKinsey and Bain and industry-leading companies like GE, JPMorgan Chase, Lockheed Martin, Amazon, and more. As operators and advisors, these consultants have run P&Ls, led supply chain transformations, built commercial functions from scratch, and navigated the specific complexity of executing this work inside large, innovative organizations.

The results were eye-opening for both companies and consultants as they navigate AI adoption. While LLMs can capably execute a wide range of consulting tasks, the knowledge gap between an LLM and an experienced practitioner is still significant. True domain expertise, precisely what Catalant consultants provide, remains indispensable for bridging that gap both for companies leveraging AI in consulting engagements and for AI labs working to push model capabilities forward.

Model rankings for consulting tasks

The group of experts designed tasks spanning three domains:

  • Commercial excellence
  • Operational excellence
  • Procurement and supply chain excellence

Each task included a detailed case study, supporting data, and a scoring rubric. We ran these tasks through 5 leading LLMs, and a Judge LM was used to assess the models’ responses against the expert-created rubrics, reporting the final results as a percentage. Across the tasks, Claude Opus 4.7 achieved the highest average score overall.

The full ranking was:

  1. Claude Opus 4.7
  2. Claude Sonnet 4.6
  3. Gemini 2.5 Pro
  4. GPT 5.4
  5. Gemini Flash 3.5.

The results were more nuanced by domain. No single model dominated every category, which matters for organizations building AI into specific workflows. The right model depends on the functional domain and the type of reasoning the task demands.

Commercial excellence

The set of commercial excellence tasks asked models to perform work such as market sizing and competitive intelligence, customer segmentation, sales territory prioritization, expansion analysis, and acquisition target evaluation. These are tasks where the inputs often look clean on the surface but the analytical judgement underneath is highly specialized.

Within commercial excellence tasks, Claude Opus 4.7 was the top-performing model.

Models performing tasks like these often make errors like drawing on the wrong data sources, failing to answer core questions within the prompt, or applying inappropriate logic to the work being done.

For example:

In a shopper data analysis designed to evaluate expansion into a new food category, a model calculated penetration rates and purchase frequency using weighted averages rather than minimums, defaulting to the most frequently applied statistical method rather than the domain-appropriate one. As a result, outputs overstated attractiveness for fragmented category groups and understated purchase frequency for high-frequency groups, distorting the data that would be used to drive investment decisions.

Operational excellence

Operational excellence tasks covered areas like warehouse footprint analysis; inventory, schedule, and process optimization; resource forecasting; and organizational model design. These problems require not just quantitative reasoning but an understanding of how operations actually function in the physical world.

Within operational excellence tasks, Claude Opus 4.7 was the top-performing model.

Models working on complex operational tasks often make errors by failing to account for physical operating model constraints, consider systems as a whole, or properly evaluate solution costs in relation to the problems they solved.

For example:

In a manufacturing optimization task, a model recommended cleaning a mounting surface while the machine was running. This is physically impossible and a safety violation. The model had no mechanism for applying real-world operating constraints (machine states, safety rules, shift capacity limitations) that any experienced operations consultant would treat as non-negotiable boundaries.

Procurement and supply chain excellence

Procurement and supply chain excellence tasks spanned subjects like procurement cost analysis, cost reduction strategies, and raw material cost tracking and analysis. These are domains where the precision of the underlying analysis has direct and quantifiable financial implications.

Within procurement and supply chain excellence tasks, Gemini 2.5 Pro was the top-performing model.

Models executing these types of tasks often fail when they leverage inappropriate data or analyze it incorrectly, identify the wrong data sources, select inefficient cost models, or otherwise steer leaders in the wrong direction.

For example:

In a cost benchmarking task for an industrial OEM, the model identified incorrect outliers and miscalculated volume based on a flawed linear regression. Misidentifying those outliers means missing a savings opportunity worth approximately $150,000 annually.

What the results reveal about the path forward

The models that will provide real value in consulting are being built differently: they’re investing in domain-specific training, building the feedback loops, rubrics, and expert review processes that make AI meaningfully better at the specific tasks that matter. 

This is the core promise of reinforcement learning with human feedback (RLHF). Fine-tuning a model requires more than programmatic rules; it requires exposing the AI to the unspoken pattern recognition, nuanced judgement, and rigorous standards that can only come from years of doing the work. Done right, the general-purpose engine begins to internalize the bespoke judgment of a senior partner.

Recently, Catalant ran exactly this kind of pilot with a leading frontier AI lab, training one of the most advanced LLMs available on consulting tasks, using the same type of practitioner expertise that drove this evaluation.

The insight is simple but consequential: training LLMs well requires the right humans in the loop. Not just people who know the answer, but people who know precisely why and can articulate the reasoning, the tradeoffs, and the tells that separate strong work from “AI slop.”

That is the quality of judgment Catalant consultants bring, and it is what will ultimately determine whether AI training programs in consulting produce models that are genuinely more capable or just more confidently wrong. The ceiling on how good these models can get at hard consulting tasks is set entirely by the ceiling on the expertise of the people training them.

Need access to elite expertise?

Get in touch

Meet the Author

Diana Rumman is Director, Strategic Solutions at Catalant, where she focuses on the intersection of enterprise AI adoption and expert talent deployment. As an AI transformation specialist and experienced engineer, she brings deep technical expertise to developing solutions that most effectively serve Catalant’s Fortune 500 and private equity clients and deliver on the promise of Consulting 2.0. Diana holds a Master of Business Administration from The Wharton School and a Bachelor of Science in Engineering from Queen’s University.

How does the replacement of entry-level consulting resources with AI alter the talent requirements for enterprise execution?

The displacement of junior analysts by automation creates an urgent enterprise requirement for elite independent operators who possess senior-level pattern recognition and tactical execution capabilities. While LLMs can capably accelerate baseline document production, the knowledge gap between AI and senior advisors remains substantial. A lack of nuanced judgment in AI requires enterprises to engage specialized independent practitioners to de-risk strategic implementations.

Why is practitioner-led reinforcement learning with human feedback (RLHF) the defining determinant of future model capability?

The ceiling of LLM capability on complex business tasks is impacted by the domain expertise of the human practitioners training the network. General-purpose AI cannot internalize the bespoke judgment of a senior consulting partner through programmatic rules alone. As demonstrated in a recent pilot conducted by Catalant alongside a leading frontier artificial intelligence laboratory, effective fine-tuning requires exposing models to human operators who can articulate complex trade-offs. This expert review process ensures that AI algorithms internalize rigorous industry standards rather than becoming more confidently incorrect.

How does practitioner-led reinforcement learning with human feedback (RLHF) prevent models from generating “AI slop” during complex strategic task execution?

Fine-tuning a general-purpose engine requires exposing the algorithm to the unspoken pattern recognition and rigorous standards held by senior human operators. Programmatic rules and basic data scraping cannot teach software the nuanced judgment required for advanced consulting assignments. The most advanced LLMs still require highly skilled experts in the loop. These practitioners must explicitly detail the reasoning, the delicate trade-offs, and the “tells” that separate true strategic depth from superficial text generation.