Home / TECHNOLOGY / Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening? | BMC Medical Research Methodology

TECHNOLOGY

Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening? | BMC Medical Research Methodology

By Kuuhaku

No Comments

August 25, 2025 4:44 am

Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening? | BMC Medical Research Methodology

The integration of artificial intelligence (AI) in evidence synthesis has become increasingly relevant, particularly in the context of automatic literature screening. Recent studies assess the performance of various AI-powered tools designed to assist researchers in efficiently identifying relevant literature while reducing the burden of manual screening. This report delves into the findings from an investigation of five prominent automated literature screening tools: ChatGPT, Claude, Gemini, DeepSeek, and Robotsearch.

Keywords: AI-powered literature screening, evidence synthesis, automatic literature screening

Overview of AI-Powered Literature Screening Tools

AI technology, especially large language models (LLMs), has shown considerable promise in speeding up literature review processes. The study evaluated the efficiency and accuracy of selected AI tools, comparing their performance with traditional literature screening methods.

Screening Efficiency

Screening efficiency is a critical metric in this research. The study revealed that ChatGPT, Gemini, and DeepSeek triumphed in processing speed, completing evaluations in less than three seconds per article. In contrast, Claude lagged behind, processing articles in about six seconds. For context, human reviewers took approximately two weeks to screen the same dataset, emphasizing the time-saving potential of AI tools.

While AI tools can facilitate faster screening, they still require human supervision. Future research aims to explore how human and AI-driven methods can be better integrated to further improve screening performance.

False Negative and False Positive Rates

The study also examined both the false negative fraction (FNF) and the false positive fraction (FPF) of the AI tools. Robotsearch achieved the lowest FNF at 6.4%, while Gemini had the highest at 13.0%. A lower FNF is vital for an effective automated literature screening model because it reduces the manual workload for human reviewers.

Interestingly, the tools demonstrated lower FPF rates compared to traditional methods, particularly Robotsearch. The four LLMs assessed maintained FPF rates below 4%, which significantly contributes to reducing the number of articles requiring human review. The ability of LLMs to capture contextual information, unlike traditional methods that often segment text, allows for a more nuanced understanding of the literature.

Diagnostic Performance

Using Youden’s Index, which evaluates both sensitivity and specificity, the performance of different AI tools was measured. ChatGPT emerged as the most balanced with an index of 0.89, reflecting its robust literature screening capabilities. The Number Needed to Screen (NNS) metric further illustrated ChatGPT’s efficiency, requiring the least manual re-screening at 1.123, compared to Robotsearch at 1.4.

Other performance metrics like Risk Difference (RD) and Risk Ratio (RR) also reinforced the effectiveness of ChatGPT and Claude, indicating their reliability in literature screening tasks.

Comparison of Tools

The analysis indicated that both ChatGPT and DeepSeek emerged as strong contenders for literature screening, particularly with the latter’s upgraded V3 version. While both excelled, comprehensive testing on larger and more diverse datasets is still necessary to confirm these findings universally.

Advantages of AI in Literature Screening

AI tools provide significant advantages in the literature screening process:

Efficiency: They can rapidly process large volumes of literature, thus supporting extensive research initiatives.
Reduction of Human Error: By mitigating risks associated with fatigue and personal biases, AI enhances the objectivity of the screening process.
Cost Reduction: Leveraging AI contributes to cost-effective research strategies.
Continuous Learning: AI tools can learn and adapt to evolving research trends, improving their applicability over time.

Limitations and Future Research Directions

Certain limitations of this study should be acknowledged. Firstly, the literature dataset was limited to a retraction database as of April 26, 2023, without representation from other disease types, which could skew outcomes. Secondly, the variability in outputs generated by LLMs was not fully assessed, nor was the consistency of results across multiple iterations. Finally, the screening approach focused primarily on randomized controlled trials (RCTs), leaving the effectiveness of this methodology for other literature types unverified.

Future research will need to address these limitations by diversifying datasets, examining the consistency of results from LLMs, and exploring the implications of screening methodologies across different types of research articles.

Conclusion

The findings from this investigation underscore the considerable potential of AI-powered tools in enhancing the efficiency and accuracy of literature screening in evidence synthesis. While tools like ChatGPT and DeepSeek are showing significant promise, the current technology must still work in tandem with human expertise to optimize literature screening outcomes effectively. The ongoing integration of AI into research methodologies will not only facilitate more effective literature reviews but may also redefine the research landscape as a whole.

In conclusion, as we look to the future, the development of automated literature screening tools should involve continuous refinement and human collaboration to unlock their full potential. The synergy of AI and human insight will likely pave the way for more efficient and effective research practices across disciplines.

Source link