On this episode: Stack Overflow senior data scientist Michael Geden tells Ryan and Ben about how data scientists evaluate large language models (LLMs) and their output. They cover the challenges involved in evaluating LLMs, how LLMs are being used to evaluate other LLMs, the importance of data validating, the need for human raters, and more needs and tradeoffs involved in selecting and fine-tuning LLMs. https://stackoverflow.blog/2024/04/16/how-do-you-evaluate-an-llm-try-an-llm/
Melden Sie sich an, um einen Kommentar hinzuzufügen
Andere Beiträge in dieser Gruppe

Quinn Slack, CEO and co-founder of Sourcegraph, joins the show to dive into the implications of AI coding tools on the software engineering lifecycle. They explore how AI tools are transforming the wo

Innovation is at the heart of any successful, growing company, and often that culture begins with an engaged, interconnected organization. https://stackoverflow.blog/2025/08/04/cross-pollination-as-a

Ryan and Eira welcome Erin Yepis, Senior Analyst at Stack Overflow, to the show to discuss the newly released 2025 Developer Survey results. They explore the decline in trust in AI tools, shifts in po


In this episode of Leaders of Code, Jody Bailey, Stack Overflow’s CPO, Anirudh Kaul, Senior Director of Software Engineering, and Paul Petersen, Cloud Platform Engineering Manager, discuss the U.S. Ba

No need to bury the lede: more developers are using AI tools, but their trust in those tools is falling. https://stackoverflow.blog/2025/07/29/developers-remain-willing-but-reluctant-to-use-ai-the-20

Ryan welcomes Mahir Yavuz, Senior Director of Engineering at Etsy, to the show to explore the unique challenges that Etsy’s marketplace faces and how Etsy’s teams leverage machine learning and AI to m