Is agentic AI more than hype? This company thinks it knows how to find out

Over the past five years, advances in AI models’ data processing and reasoning capabilities have driven enterprise and industrial developers to pursue larger models and more ambitious benchmarks. Now, with agentic AI emerging as the successor to generative AI, demand for smarter, more nuanced agents is growing. Yet too often “smart AI” is measured by model size or the volume of its training data.

Data analytics and artificial intelligence company Databricks argues that today’s AI arms race misses a crucial point: In production, what matters most is not what a model “knows,” but how it performs when stakeholders rely on it. Jonathan Frankle, chief AI scientist at Databricks, emphasizes that real-world trust and return on investment come from how AI models behave in production, not from how much information they contain.

Unlike traditional software, AI models generate probabilistic outputs rather than deterministic ones. “The only thing you can measure about an AI system is how it behaves. You can’t look inside it. There’s no equivalent to source code,” Frankle tells Fast Company. He contends that while public benchmarks are useful for gauging general capability, enterprises often over-index on them. 

What matters far more, he says, is rigorous evaluation on business-specific data to measure quality, refine outputs, and guide reinforcement learning strategies. “Today, people often deploy agents by writing a prompt, trying a couple of inputs, checking their vibes, and deploying. We would never do that in software—and we shouldn’t do it in AI, either,” he says.

Frankle explains that for AI agents, evaluations replace many traditional engineering artifacts, i.e., the discussion, the design document, the unit tests, and the integration tests. There’s no equivalent to a code review because there’s no code behind an agent, and prompts aren’t code. That, he argues, is precisely why evaluations matter and should be the foundation of responsible AI deployment.

The shift from focusing on belief to emphasizing behavior is the foundation of two major innovations by Databricks this year: Test-Time Adaptive Optimization (TAO) and Agent Bricks. Together, these technologies seek to make behavioral evaluation the first step in enterprise AI, rather than an afterthought.

AI behavior matters more than raw knowledge

Traditional AI evaluation often relies on benchmark scores and labeled datasets derived from academic exercises. While those metrics have value, they rarely reflect the contextual, domain-specific decisions businesses face. In production, agents may need to generate structured query language (SQL) in a company’s proprietary dialect, accurately interpret regulatory documents, or extract highly specific fields from messy, unstructured data.

Naveen Rao, vice president of AI at Databricks, says these are fundamentally behavioral challenges, requiring iterative feedback, domain-aware scoring, and continuous tuning, not simply more baseline knowledge.

“Generic knowledge might be useful to consumers, but not necessarily to enterprises. Enterprises need differentiation; they must leverage their assets to compete effectively,” he tells Fast Company. “Interaction and feedback are critical to understanding what is important to a user group and when to present it. What’s more, there are certain ways information needs to be formatted depending on the context. All of this requires bespoke tuning, either in the form of context engineering or actually modifying the weights of the neural network.”

In either case, he says, a robust reinforcement learning harness is essential, paired with a user interface to capture feedback effectively. That is the promise of TAO, the Databricks research team’s model fine-tuning method: improving performance using inputs enterprises already generate, and scaling quality through compute power rather than costly data labeling and annotation.

While most companies treat evaluation as an afterthought at the end of the pipeline, Databricks makes it central to the process. TAO uses test-time compute to generate multiple responses, scores them with automated or custom judges, and feeds those scores into reinforcement learning updates to fine-tune the base model. The result is a tuned model that delivers the same inference cost as the original—with heavy compute applied only once during tuning, not on every query.

“The hard part is getting AI models to do well at your specific task, using the knowledge and data you have, within your cost and speed envelope. That’s the shift from general intelligence to data intelligence,” Frankle says. “TAO can help tune inexpensive, open-source models to be surprisingly powerful using a type of data we’ve found to be common in the enterprise.” 

According to a Databricks blog, TAO improved open-source Llama variants, with tuned models scoring significantly higher on enterprise benchmarks such as FinanceBench, DB Enterprise Arena, and BIRD-SQL. The company claims the method brought Llama models within range of proprietary systems like GPT-4o and o3-mini on tasks such as document Q&A and SQL generation, while keeping inference costs low. In a broader multitask run using 175,000 prompts, TAO boosted Llama 3.3 70B performance by about 2.4 points and Llama 3.1 70B by roughly 4.0 points, narrowing the gap with contemporary large models.

To complement its model fine-tuning technique, Databricks has introduced Agent Bricks, an agentic AI-powered feature within its Data Intelligence Platform. It enables enterprises to customize AI agents with their own data, adjust neural network weights, and build custom judges to enforce domain-specific rules. The product aims to automate much of agent development: Teams define an agent’s purpose and connect data sources, and Agent Bricks generates evaluation datasets, creates judges, and tests optimization methods.

Customers can choose to optimize for maximum quality or lower cost, enabling faster iteration with human oversight and fewer manual tweaks.

“Databricks’ latest research techniques, including TAO and Agent Learning from Human Feedback (ALHF), power Agent Bricks. Some use cases call for proprietary models, and when that’s the case, it connects them securely to your enterprise data and applies techniques like retrieval and structured output to maximize quality. But in many scenarios, a fine-tuned open model may outperform at a lower cost,” Rao says.

He adds that Agent Bricks is designed so domain experts—regardless of coding ability—can actively shape and improve AI agents. Subject matter experts can review agent responses with simple thumbs-up or thumbs-down feedback, while technical users can analyze results in depth and provide detailed guidance. “This ensures that AI agents reflect enterprise goals, domain knowledge, and evolving expectations,” Rao says, noting that early customers saw rapid gains.

AstraZeneca processed more than 400,000 clinical trial documents and extracted structured data in less than an hour with Agent Bricks. Likewise, the feature enabled Flo Health to double its medical-accuracy metric compared with commercial large language models while maintaining strict privacy and safety. “Their approach blends Flo’s specialized health expertise and data with Agent Bricks, which leverages synthetic data and tailored evaluation to deliver reliable, cost-effective AI health support at scale—uniquely positioning us to advance women’s health,” Rao explains.

From benchmarks to business data

The shift toward behavior-first evaluation is pragmatic but not a cure-all. Skeptics warn that automated evaluations and tuning can just as easily reinforce bias, lock in flawed outputs, or allow performance to drift unnoticed.

“In some domains we truly have automatic verification that we can trust, like theorem proving in formal systems. In other domains, human judgment is still crucial,” says Phillip Isola, associate professor and principal investigator at MIT’s Computer Science & Artificial Intelligence Laboratory. “If we use an AI as the critic for self-improvement, and if the AI is wrong, the system could go off the rails.”

Isola points out that while self-improving AI systems are generating excitement, they also carry heightened safety and security risks. “They are less constrained, lacking direct supervision, and can develop strategies that might be unexpected and have negative side effects,” he says, also warning that companies may game benchmarks by overfitting to them. “The key is to keep updating evaluations every year so we’re always testing models on new problems they haven’t already memorized.”

Databricks acknowledges the risks. Frankle stresses the difference between bypassing human labeling and bypassing human oversight, noting that TAO is “simply a fine-tuning technique fed by data enterprises already have.” In sensitive applications, he says, safeguards remain essential and no agent should be deployed without rigorous performance evaluation.

Other experts note that greater efficiency doesn’t automatically improve AI model alignment, and there’s no clear way to measure AI model alignment currently. “For a well-defined task where an agent takes action, you could add human feedback, but for a more creative or open-ended task, is it clear how to improve alignment? Mechanistic interpretability isn’t strong enough yet,” says Matt Zeiler, CEO of Clarifai.

Zeiler argues that the industry’s reliance on a mix of general and specific benchmarks needs to evolve. While these tests condense many complex factors into a few simple numbers, models with similar scores don’t always “feel” equally good in use.

“That ‘feeling’ isn’t captured in today’s benchmarks, but either we’ll figure out how to measure it, or we’ll just accept it as a subjective aspect of human preference; some people will simply like some models more than others,” he says.

If the results from Databricks hold, enterprises may rethink their AI strategy, prioritizing feedback loops, evaluation pipelines, and governance over sheer model size or massive labeled datasets, and treating AI as a system that evolves with use rather than a onetime product.

“We believe the future of AI lies not in bigger models, but in adaptive, agentic systems that learn and reason over enterprise data,” Rao says. “This is where infrastructure and intelligence blur: You need orchestration, data connectivity, evaluation, and optimization working together.”


https://www.fastcompany.com/91384747/databricks-wants-enterprises-to-rethink-how-they-measure-ai-intelligence?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Creato 7h | 13 ago 2025, 11:20:06


Accedi per aggiungere un commento

Altri post in questo gruppo

Most people are using ChatGPT totally wrong—and OpenAI’s CEO just proved it

How did you react to the August 7 release of GPT-5, OpenAI’s latest version of ChatGPT? The company behind the model h

13 ago 2025, 18:20:04 | Fast company - tech
This mine feeds the tech world and fuels a rebel war

Under the watchful eye of M23 rebels in the hills around the Congolese town of Rubaya, a line of men in rubber boots ferry sacks full of crushed rocks up winding paths cut into the slopes.

13 ago 2025, 18:20:03 | Fast company - tech
This free web timer puts your computer’s Clock app to shame

For something as simple as setting a timer, the built-in apps on our computers can be awfully fiddly.

Usually you have to open a Clock app first, then navigate to a separate tab for time

13 ago 2025, 11:20:08 | Fast company - tech
How AI can finally fix prior authorization

If you’ve ever been a patient waiting—days, sometimes more than a week—for treatment approval, or a clinician stuck chasing it, you know what prior authorization feels like. Patients sit in limbo,

13 ago 2025, 11:20:04 | Fast company - tech
Perplexity’s bid to buy Chrome is likely more stunt than strategy

The AI search startup Perplexity has tendered an unsolicited offer to

12 ago 2025, 23:40:04 | Fast company - tech
Musk to sue Apple for featuring OpenAI over X, Grok in the App Store’s top apps

Billionaire SpaceX, Tesla and X owner Elon Musk says he plans to sue

12 ago 2025, 19:10:04 | Fast company - tech
Companies explore their own stablecoins under new law, but hurdles remain

Financial companies from Bank of America to Fiserv are preparing to launch their own dollar-backed crypto tokens now that a new U.S. law has established the first-ever rules for

12 ago 2025, 19:10:03 | Fast company - tech