It’s Not Rocket Science, But We Don’t Know What It Is Yet!

🧠 AI Research Assistants and Human Experts Will Be Best Friends!

Amplifying Human Expertise

Paper Link

Just finished reading a fascinating paper titled “Towards Artificial Intelligence Research Assistant for Expert-Involved Learning” (ARIEL), which explores how AI — particularly LLMs and LMMs — can act as scientific co-pilots in biomedical research. Some of its findings helped reinforce my assumptions, and others prompted me to think deeply about the impact of AI on scaling human expertise.

Here are a few reflections and takeaways:

🔬 The Two Datasets

The authors evaluated AI models across two distinct datasets:

  1. Long-form biomedical articles (focused on COVID-19 and healthcare), to test summarization and reasoning abilities.
  2. Biomedical figures from multiomics studies (e.g., genomics + proteomics), used to assess multimodal reasoning capabilities.

They benchmarked a variety of open- and closed-source models, with a clear focus on evaluating reasoning, summarization, and figure interpretation.

TaskBest PerformerFine-tuned Open-source PerformanceTakeaway
Text SummarizationChatGLM-LoRA (open)Better than GPT-4, Claude 3.5Fine-tuning works extremely well
Figure Understandingo1 (closed)Open-source models far behindTest-time computation helps

🧪 Fine-tuning Can Beat the Giants

The standout open-source model in their tests? ChatGLM, a 9B parameter model. After fine-tuning with domain-specific biomedical data using LoRA (low-rank adaptation), the upgraded ChatGLM-LoRA outperformed even closed-source SOTA models like GPT-4 and Claude 3.5 across multiple text summarization benchmarks.

This is huge: with access to the right data and tools, high-quality open-source models can not only match, but surpass, proprietary alternatives.

🧠 Augmenting Human Intelligence, Not Replacing It

One of the most interesting experiments involved comparing AI ensembles against human expert ensembles. Even the best AI ensemble couldn’t outperform the top-performing human. But here’s the twist: human performance improved significantly with AI assistance.

In essence, AI can elevate even average domain experts to the level of top experts — not by replacing them, but by augmenting their decision-making with rapid summarization, figure comprehension, and hypothesis generation.

📐 Evaluating Text with Embeddings

The authors used LLM-generated embeddings and cosine similarity to compare AI- and human-generated summaries — a useful reminder that this remains the de facto approach for evaluating free-form text in research and perhaps even clinical workflows.

🚀 Lessons for Healthcare + AI

Here are a few takeaways that resonated with me for real-world applications in healthcare and biomedical AI:

  • Fine-tuned open-source models can outperform closed-source models if the domain-specific data is strong.
  • Embedding-based comparison offers a scalable way to evaluate free-text clinical notes and AI summaries.
  • AI is a powerful amplifier, but it’s not a great equalizer. Those who learn to use these tools well will see massive productivity gains. Those who don’t will fall behind — widening, not narrowing, the performance gap.

🧭 Final Thought

I believe we’re at a stage where AI won’t flatten expertise, but rather magnify it. Those who master these tools will build, discover, and understand faster than ever — and it’s on all of us to stay on the right side of that curve.

Comments

Leave a comment