On this episode: Stack Overflow senior data scientist Michael Geden tells Ryan and Ben about how data scientists evaluate large language models (LLMs) and their output. They cover the challenges involved in evaluating LLMs, how LLMs are being used to evaluate other LLMs, the importance of data validating, the need for human raters, and more needs and tradeoffs involved in selecting and fine-tuning LLMs. https://stackoverflow.blog/2024/04/16/how-do-you-evaluate-an-llm-try-an-llm/
Accedi per aggiungere un commento
Altri post in questo gruppo

An update to the research that the User Experience team is running over the next quarter. https://stackoverflow.blog/2025/05/19/research-roadmap-update-may-2025/

Christophe Coenraets, SVP of Developer Relations at Salesforce, tells Eira and Ben about building the new Salesforce Developer Edition, which includes access to the company’s agentic AI platform, Agen

Money is pouring into the AI industry. Will software survive the disruption it causes? https://stackoverflow.blog/2025/05/15/whether-ai-is-a-bubble-or-revolution-how-does-software-survive/

On this episode, Ryan chats with Hendrik Rexed, Cloud Native Advocate at Dynatrace, about debugging cloud-based applications like you would a local app. https://stackoverflow.blog/2025/05/13/next-lev

Maryam Ashoori, Head of Product for watsonx.ai at IBM, joins Ryan and Eira to talk about the complexity of enterprise AI, the role of governance, the AI skill gap among developers, how AI coding tools

If velocity is just a tool and not a goal, how do you measure real success for engineering teams? https://stackoverflow.blog/2025/05/12/beyond-speed-measuring-engineering-success-by-impact-not-velocit

Ben Popper chats with CTO Abby Kearns about how Alembic is using composite AI and lessons learned from contract tracing and epidemiology to help companies map customer journeys and understand the ROI