Amplifying Human Expertise
Just finished reading a fascinating paper titled “Towards Artificial Intelligence Research Assistant for Expert-Involved Learning” (ARIEL), which explores how AI — particularly LLMs and LMMs — can act as scientific co-pilots in biomedical research. Some of its findings helped reinforce my assumptions, and others prompted me to think deeply about the impact of AI on scaling human expertise.
Here are a few reflections and takeaways:
🔬 The Two Datasets
The authors evaluated AI models across two distinct datasets:
- Long-form biomedical articles (focused on COVID-19 and healthcare), to test summarization and reasoning abilities.
- Biomedical figures from multiomics studies (e.g., genomics + proteomics), used to assess multimodal reasoning capabilities.
They benchmarked a variety of open- and closed-source models, with a clear focus on evaluating reasoning, summarization, and figure interpretation.
| Task | Best Performer | Fine-tuned Open-source Performance | Takeaway |
|---|---|---|---|
| Text Summarization | ChatGLM-LoRA (open) | Better than GPT-4, Claude 3.5 | Fine-tuning works extremely well |
| Figure Understanding | o1 (closed) | Open-source models far behind | Test-time computation helps |
🧪 Fine-tuning Can Beat the Giants
The standout open-source model in their tests? ChatGLM, a 9B parameter model. After fine-tuning with domain-specific biomedical data using LoRA (low-rank adaptation), the upgraded ChatGLM-LoRA outperformed even closed-source SOTA models like GPT-4 and Claude 3.5 across multiple text summarization benchmarks.
This is huge: with access to the right data and tools, high-quality open-source models can not only match, but surpass, proprietary alternatives.
🧠 Augmenting Human Intelligence, Not Replacing It
One of the most interesting experiments involved comparing AI ensembles against human expert ensembles. Even the best AI ensemble couldn’t outperform the top-performing human. But here’s the twist: human performance improved significantly with AI assistance.
In essence, AI can elevate even average domain experts to the level of top experts — not by replacing them, but by augmenting their decision-making with rapid summarization, figure comprehension, and hypothesis generation.
📐 Evaluating Text with Embeddings
The authors used LLM-generated embeddings and cosine similarity to compare AI- and human-generated summaries — a useful reminder that this remains the de facto approach for evaluating free-form text in research and perhaps even clinical workflows.
🚀 Lessons for Healthcare + AI
Here are a few takeaways that resonated with me for real-world applications in healthcare and biomedical AI:
- Fine-tuned open-source models can outperform closed-source models if the domain-specific data is strong.
- Embedding-based comparison offers a scalable way to evaluate free-text clinical notes and AI summaries.
- AI is a powerful amplifier, but it’s not a great equalizer. Those who learn to use these tools well will see massive productivity gains. Those who don’t will fall behind — widening, not narrowing, the performance gap.
🧭 Final Thought
I believe we’re at a stage where AI won’t flatten expertise, but rather magnify it. Those who master these tools will build, discover, and understand faster than ever — and it’s on all of us to stay on the right side of that curve.
Leave a comment