Although AI is changing the media, how much it’s changing journalism is unclear. Most editorial policies forbid using AI to help write stories, and journalists typically don’t want the help anyway. But when consulting with editorial teams, I often point out that, even if you never publish a single word of AI-generated text, it still has a lot to offer as a research assistant.
Well, that assertion might be a bit more questionable now that the Columbia Journalism Review has gone and published its own study about how AI tools performed in that role for some specific journalistic use cases. The result, according to CJR: AI can be a surprisingly uninformed researcher, and it might not even be a great summarizer, at least in some cases.
Let me stress: CJR tested AI models in journalistic use cases, not generic ones. For summarization in particular, the tools—including ChatGPT, Claude, Perplexity, and Gemini—were asked to summarize transcripts and minutes from local government meetings, not articles or PowerPoints. So some of the results may go against intuition, but I also think it makes them much more useful: For artificial intelligence to be the force for workplace transformation as it’s often hyped to be, it needs to give helpful output in workplace-specific use cases.
The CJR report reveals some interesting things about those use cases and how journalists approach AI in general. But mostly it shows how badly we need more of this: systemic testing of AI that goes beyond the ad hoc experimentation that has too long been the default for many organizations. If the study shows nothing else, it’s that you don’t need to be an engineer or a product designer to judge how well AI can help in your job.
Putting AI to the newsroom test
To test AI’s summarization abilities, the evaluators—which included academics, journalists, and research assistants—created multiple prompts to create short and long summaries from each tool, then ran them several times. A weakness of the report is that it doesn’t reveal the outputs so we can see for ourselves how well it did. But it does say it quantified factual errors to evaluate accuracy, comparing them with human-written summaries.
Without seeing the outputs, it’s hard to know how to improve the prompts to get better results. The study says it got good results for short (200-word) summaries but saw inaccuracies and missed facts in longer ones. One surprising outcome was that the simplest prompt, “Give me a short summary of this document,” produced the most consistently good results, but only for short summaries.
The study also looked at research tools, specifically for science reporting. I love the specificity here: The CJR researchers were very particular about the use case: giving the tool a paper and then asking it to perform a literature review (finding related papers, citing them, and extracting the overall consensus). They also chose their targets deliberately, evaluating AI-powered research services like Consensus and Semantic Scholar instead of the usual general chatbots.
On this, the results were arguably even worse. The tools typically would find and cite papers that were completely different from what a human picked for a manually created literature review, and even different from the other tools. And when they ran the same prompts a few days later, the results would change again.
Getting closer to the metal
I think the study is instructive beyond the straightforward takeaways, such as using AI only for short summaries and thinking twice before using AI research apps for literature review.
- Prompt engineering matters: I get that the three different prompts for summaries were probably designed to simulate casual use—the kinds of natural language text a busy journalist might dash off. And maybe AI should ultimately produce good results when you do that. But for out-of-the-box tools (which is what they used), I would recommend more thoughtful prompting.
This doesn’t have to be a big exercise. Simply going over your prompt to make vague language (“short summary”) more precise (“200-word summary”) would help. The researchers did ask for more detail in two of the three prompts, but the study criticizes the longer summaries for not being comprehensive when the language in the prompts doesn’t specifically mention comprehensiveness. Asking the AI to check its own work sometimes helps too.
- The app layer struggles: Reading the part about the various research apps not producing good results had me nodding along. I don’t want to read too much into this since the study was narrowly focused on research apps with a very specific use case, but I’m currently living through something similar while experimenting with AI content platforms for my plans at The Media Copilot. When you use a third-party tool, you’re an extra step removed from the foundation model, and you miss having the flexibility of being “closer to the metal.”
I think this points to a fundamental misunderstanding of the so-called “app layer.” Most AI apps will put a veil over system prompts and model pickers in the name of simplification, but it isn’t the UX win that many think it is. Yes, more controls might confuse AI newbies, but power users want them, and it turns out the gap between the two groups might not be very large.
I think this same misunderstanding is what stymied the GPT-5 launch. Removing the model picker—where you could pick between GPT-4o, o4-mini, o3, etc.—seemed like a smart, simplifying idea, but it turned out ChatGPT users were more sophisticated than anyone had thought. The average ChatGPT Plus subscriber might not have understood what every model does, but they knew which ones worked for them.
- Iterate, iterate, iterate: The study’s results are helpful, but they’re also incomplete. Testing outputs from models is only the beginning of the process of building an AI workflow. Once you’ve got them, you iterate: Adjust your prompts, refine your model choice, and try again. And again. Producing consistent results that save time isn’t something you’ll get perfect on the first try. Once you’ve found the right combination of prompting, model, and context, then you’ll have something repeatable.
Coming halfway
Where does this leave newsrooms? This might sound self-serving since I train editorial teams for a living, but after reading this report, I’m more convinced than ever that, despite predictions that apps and software design will abstract away prompting, AI literacy still matters. Getting the most out of these tools means equipping journalists with the skills they need to craft effective prompts, evaluate results, and iterate when necessary.
Also, the CJR study is an excellent template for testing tools internally. Get a team together (they don’t need to be technical), craft prompts methodically, and then evaluate them—but then iterate. Keep experimenting. Find what consistently gets good results—not just quality outputs, but a process that actually saves time. Just doing “vibe checks” won’t get you very far.
Because there is one more thing the study is off-target about. When a journalist considers how to complete a task, the choice usually isn’t between a machine output and a human one. It’s the machine output or nothing at all. Some might say that’s lowering the bar, but it’s also putting a bar in more places. And with some training, experimentation, and iteration, raising it inch by inch.
Jelentkezéshez jelentkezzen be
EGYÉB POSTS Ebben a csoportban

President Donald Trump is calling

Last year, Transport for London tested AI-powered CCTV at Willesden Gr

While tech and AI giants guard their knowledge graphs behind proprieta

Every company wants to have an AI strategy: A bold vision to do more w

The global race for better batteries has never been more intense. Electric vehicles, drones, and next-generation aircraft all depend on high-performance energy storage—yet the traditiona

Pick up an August 2025 issue of Vogue, and you’ll come across an advertisement for the brand Guess featur

Language is the original technology, the tool we’ve all used to coordinate with each other for thousands of years. Our success in life—both professionally and in relationships—depends on it.