- In recent years, there has been an explosive and exponential growth in medical evidence gen eration. Thus, application of machine learning (ML) systems to automate the systematic review (SR) process could benefit SR teams in terms of reducing the time and labor required for manually conducting a SR.
- A number of ML systems are already available for performing a SR. Some of which (e.g., ML systems for literature search or screening) are relatively more mature than others (e.g., ML systems for information extraction or synthesis).
- The majority of these ML models semi-automate the SR process and adopt the “human- in-the-loop” design, requiring SR reviewers to make the final decision.
- OE is keen to develop ML systems for evidence synthesis. The application of ML systems to automate the SR process is a promising direction, either for researchers in ML models or for future SR reviewers.
The Cochrane Library defines a systematic review (SR) as the application of strategies which attempt to “identify, appraise and synthesize all the empirical evidence that meets pre-specified eligibility criteria to answer a specific research question.” In a SR, if the results from individual studies can be pooled or aggregated using statistical methods to provide a single best estimate of the effect along with the confidence it warrants, this is called meta-analysis (Guyatt et al., 2015, pp. 460-461).
It is well known that a SR, especially one with a meta-analysis, is a critical and key source to provide answers to clinical questions. However, conducting a SR is usually a resource-intensive and time-consuming process. As Lau, (2019) claimed, “Given the comprehensive number of healthcare topics requiring formal systematic review and the need to update existing reviews or establish a living systematic review, one can readily see that the demands of research synthesis far outstrip the current supply and capacity of systematic review production, even with new generations of trained reviewers and thousands of Cochrane volunteers, and other groups worldwide.” Under this reality, automation is apparently the promising solution to reduce the time and workload required to conduct SRs and meta-analyses.
Artificial intelligence (AI) refers to “the use of computers and technology to simulate intelligent behavior and critical thinking comparable to a human being” (Amisha et al., 2019). Machine learning (ML), an approach to achieving AI and responsible for the vast majority of AI advancements, is “an umbrella term that refers to a broad range of algorithms that perform intelligent predictions based on a data set. These data sets are often large, perhaps consisting of millions of unique data points” (Nichols et al., 2019). Natural language processing (NLP) is a field of ML, which aims to “enable computers to parse human language as humans do. ... it is composed of many techniques grouped together by this common aim” (Chary et al., 2019).
There has been an increasing interest in applying ML to automate and expedite the SR processes. In this OE Original, we provide an overview of SR automation, specifically focusing on the current state of using machine learning systems in supporting different steps of the SR process.
1. What stages and steps do we need to go through to complete a SR?
The SR process consists of a number of stages, which can be further divided into small steps. However, each step has different features and may not be appropriate for automation. Understanding these stages and steps are essential for us to explore the possibility and potential of applying ML.
Tsafnat et al. (2014), in a review on SR automation technologies, broke the SR process into 15 steps and classified them into five stages, including preparation, retrieval, appraisal, synthesis and write-up.
On top of Tsafnat’s et al. (2014) classification, Jaspers et al. (2018) added one more step, the critical appraisal of the risk of bias of the included studies, and modified the five stages to preparation, retrieval, screening, synthesis, and write-up.
A recent case study which was reportedly successful in using automation tools to complete a full SR in just two weeks, indicated using about 19 steps (Clark et al., 2020).
We summarized the stages and steps from Clark et al. (2020), Jaspers et al. (2018), and Tsafnat et al. (2014), further optimized the classification, and presented the detailed stages and steps of a SR process in Table 1.
Table 1: Different stages and steps to complete a systematic review
Formulate review question(s)
Identify existing systematic reviews on the same review question(s)
Write the protocol of the systematic review
Develop search strategies
Design data extraction form and pilot
Conduct literature search
De-duplicate the retrieved records
Conduct title and abstract screening
Obtain the full-texts
Conduct full-text screening
Identify additional eligible studies from other source, such as the reference list, trial registries
Extract relevant data
Assess risk of bias
Convert extracted data to a common representation (e.g., common format, scale)
Update literature search
Write up the systematic review
The information was summarized from Clark et al. (2020), Jaspers et al. (2018), and Tsafnat et al. (2014)
2. Current state of applying ML in the SR process
2.1 Application of ML at the retrieval stage
Currently, SR automation using ML mainly focuses on the retrieval stage of the SR process. The retrieval stage involves three basic tasks: literature search, study screening, and information extraction.
2.1.1 Literature search
Filtering studies by study design is one of the key areas under intensive investigation. Currently, text classification systems for searching and identifying randomized controlled trials (RCTs) are considered mature SR ML systems (Marshall et al., 2019).
Text classification is one of the core NLP technologies used in automating SRs. Text classification refers to a set of models capable of sorting unstructured texts written in natural language into one or more suitable predefined categories of interest (García Adeva et al., 2014).
An approach combining text classification using ML with the crowdsourcing strategy used in the Cochrane Crowd project (i.e., a Cochrane project involving volunteers manually reviewing descriptions of research to identify and classify clinical trials) has been investigated (Wallace et al., 2017). This study by Wallace et al. showed that compared to relying on crowdsourcing alone, combining text classification using ML with a crowdsourcing strategy was able to reach a high level of sensitivity in terms of RCT identification (i.e., About 95% to 99% RCTs were identified) and required substantially less effort (i.e., 60% to 80% reduction in workload) (Wallace et al., 2017). This hybrid strategy can be accessed via Cochrane Register of Studies (CRS) only by authorized personnel.
Cohen et al. (2015) created and validated a text classification model requiring only the abstract, citation, and MeSH terms for each article in MEDLINE. It showed that the model could save time and effort in conducting SRs with about 3% additional RCTs identified in MEDLINE, compared to manual identification (Cohen et al., 2015). Additionally, the model had the potential to find errors in MEDLINE publication type with 5% of articles tagged as RCTs in Medline not recognized (Cohen et al., 2015). The tool based on Cohen’s et al. (2015) model, called RCT tagger, can be freely accessed.
Another freely available literature search tool is called RobotSearch, which is based on Marshall et al.’s 2018 model. Marshall et al. (2018) developed and optimized this classification model (i.e., support vector machine and convolutional neural network models) to classify articles based on the titles and abstracts of the RCT set by Cochrane Crowd, and directly compared the model with traditional database search filters. It found that their ML classification model was more accurate than searches of databases with manually crafted Boolean strings (Marshall’s et al., 2018).
ML can also be used to search relevant studies by topic. This application of ML is considered more difficult and challenging than using ML to search studies by study design because topic-related criteria vary among different SRs (Marshall et al., 2019). Weißer et al. (2020) proposed and validated a clustering algorithm applying NLP on article titles, keywords and abstracts that allows the automatic breakdown of large article corpora into distinct topical groups. This clustering algorithm could help reviewers integrate clusters of articles of low or high relevance for the topic of interest.
2.1.2 Study screening
Application of ML systems towards title and abstract screening is mature (Marshall et al., 2019). These ML systems are semi-automatic and so-called human-in-the-loop systems, in which the ML model is trained and optimized by the interaction between human reviewers and the ML algorithms (van de Schoot et al., 2021).
Specifically, human reviewers need to screen and label a set of abstracts in order to provide adequate information for the ML model to generate a classifier. Then, the ML model will predict the relevance of remaining unscreened abstracts using the classifier and order these remaining abstracts by their relevance. The most relevant abstracts are then presented to the human reviewers to examine and label which provides additional feedback to further optimize the ML model. This process is an iterative loop which ends when human reviewers feel satisfied that there are no longer any eligible studies being presented to them.
An active learning (AL) approach can also be used in the human-in-the-loop systems to determine which set of abstracts should be presented (Marshall et al., 2019; van de Schoot et al., 2021). AL can significantly reduce the total number of records requiring manual screening by actively selecting which data points (in our case, the unscreened abstracts) it would like the human reviewers to examine and label next under current model parameters (Kremer et al., 2014; Miwa et al., 2014). Under the AL approach, the ML model asks human reviewers to label the set of abstracts that either have the highest probability to be relevant (certainty-based criteria) or of the lowest probability to be relevant (uncertainty-based criteria) (Miwa et al., 2014).
A number of ML systems adopting an AL approach have been developed to facilitate this stage of the SR process, including Abstrackr, ASReview, Colandr, Rayyan, and RobotAnalyst (Harrison et al., 2020; van de Schoot et al., 2021). These systems can be useful. However, some of the systems are closed source applications, which raises our concern about data ownership and transparency (van de Schoot et al., 2021).
Finally, despite the maturity in title and abstract screening, one key concern for human-in-the-loop systems is when it is safe for the human reviewers to feel satisfied that the ML model has been sufficiently trained (Marshall et al., 2019). Stopping the loop too early might result in missing eligible studies. Currently, the optimal stopping point could only be identified in a retrospective way when all records have been labeled (Marshall et al., 2019).
2.1.3 Information extraction
In scientific publications, information from RCTs is predominantly disseminated as unstructured free-texts. Automatic information extraction using ML systems aims to produce structured output to satisfy various extraction purposes, such as getting information on trial PICO (text excerpts), getting the value of an outcome (structured data), and assessing risk of bias (analytical tasks) (Marshall et al., 2020).
An in-depth technical discussion of the ML models used in automatic information extraction is beyond the scope of this OE Original. Different from the text classification methods used in literature search, ML approaches used for information extraction, often labelled as sequencing tagging models, incorporate the structure of texts or sequence of inputs into their prediction (Marshall et al., 2020). In simpler words, if a particular word is determined to be relevant to the purpose, the probability that subsequent words are also relevant is increased.
Supervised learning is often used to train information extraction ML models (Marshall et al., 2020). Specifically, this means the ML models are trained on a dataset (e.g., full-text of study) where the relevant sentence(s) have already been labelled by human reviewers. However, compiling manually labeled corpora is correlated with the volume of the training data and can be expensive, labor-intensive, and error prone (Marshall et al., 2020; Spasic et al., 2020).
In the event that a manually labeled dataset is not available, we can also borrow existing structured resources and rules to train the ML models. This is called supervised distant learning (Wallace et al., 2016). For example, some researchers have exploited the Cochrane Database of Systematic Reviews (CDSR) to derive labels to train their ML models for PICO extraction (Wallace et al., 2016).
A SR conducted by Jonnalagadda et al. (2015), which identified and examined 26 studies describing automatic extraction used in SRs, concluded that NLP techniques have not been “fully utilized to fully or even partially automate the data extraction step of systematic reviews.” Marshall et al. (2019) claimed that “extraction technologies remain in formative stages and are not readily accessible by practitioners.”
Although not mature, a number of studies have proposed ML models for information extraction from clinical trials. Some of them aimed to extract characteristic descriptions of RCTs (Kiritchenko et al., 2010; Marshall et al., 2017); some explored how to extract PICO information (Wallace et al., 2016); some tried to identify text snippets for risks of bias assessment (Marshall et al., 2017); and some worked on extracting outcome measures of RCTs (Summerscales et al., 2011).
2.2 Application of ML at other stages of the SR process
Potentially, ML systems can also be applied to automate other stages and steps of the SR process, such as the synthesis stage. However, it is still far from reaching maturity (Marshall et al., 2020).
Michie et al. (2017) explored the use of ML models to automatically extract, synthesize and interpret findings from behaviour change intervention evaluation reports. It aimed to inform practitioners, researchers, and policymakers about “What works, compared with what, how well, with what exposure, with what behaviours (for how long), for whom, in what settings and why?” (Michie et al., 2017; Michie et al., 2020).
Lehman et al. (2019) also proposed an ML model regarding automated synthesis to infer whether a treatment works (e.g., whether the intervention of interest significantly increases, has no significant effect, or significantly decreases an outcome) from RCT reports written in natural language.
3. Application of ML at OrthoEvidence
At OE, we are dedicated to applying ML to draw insights from RCTs to inform clinicians and researchers in the field of musculoskeletal disease.
3.1 We developed a working version of the "Human-assisted Artificial Intelligence Cumulative Meta-analysis" tool. Leveraging the data extraction and tagging system, the tool automatically generates reports on any possible comparison between any two treatments. Using human-assisted AI based on supervised learning, the tool automatically identifies interesting reports with temporal trends and detects important changes in their comparative effects over time as new studies are published.
3.2 We develop a working version of a "Trending Topics" tool. Again, leveraging the tagging system and tracking of ACE report views for supervised learning, the tool was developed to track topics that are trending in the orthopedic literature and also what's popular on the OE website based on user views.
3.3 We built a Tag Extraction algorithm involving machine learning techniques (entity extraction frameworks such as Open Calais, IBM Watson/Alchemy, MetaMap, and other text extraction technologies). Tags (topics) under five categories (drugs, anatomical terms, general, company and device) have been extracted from ACE reports. OE performed sampled reviews by the domain experts. The result showed that the accuracy is high (0% false positive, 92% recall).
All of the applications are currently undergoing testing and optimization. We believe once completed, these ML systems will greatly benefit our orthopedic community.
4. Closing Remark
In the present OE Original, we introduced the current state of applying ML models in automating the SR process. We decided not to focus on the technical details of ML models and instead, wrote this OE Original from the standpoint of SR reviewers. We were hoping to inform SR reviewers about the existence and availability of ML systems for SR automation and to help them realize that ML systems have the potential to assist them to conduct SRs in an automated, relatively fast, labor-saving, and highly efficient manner.
A number of ML systems, especially those automating the retrieval stage of the SR process, are already available for SR reviewers. Some of them, such as ML systems for literature search or screening, are relatively more mature than others (e.g., ML systems for information extraction or synthesis). The majority of the ML models for SR automation are semi-automated models and have the “human-in-the-loop” features, which allows human SR conductors to make the final decision.
Marshall et al. (2019) has called for pioneering SR reviewers to adopt and pilot some of the ML systems to expedite their SR process. However, promoting the adoption of ML systems in conducting SRs is facing a major barrier -- the absence of trust in automation among SR reviewers (O’Connor et al., 2019). To solve this issue, the establishment of a trusted evidence base and leadership from pioneering SR reviewers has been suggested (O’Connor et al., 2019).
Nonetheless, the application of ML to automate the SR process, which has a great potential to reduce the time and labor required to manually conduct SRs, is without doubt an important and promising direction, either for researchers in ML models or for future SR conductors. OE echoes Marshall’s et al. (2019) call and invites SR conductors and clinicians to pilot current ML systems and provide feedback if possible, which may greatly improve and optimize current ML systems in SR automation.
Amisha, et al. (2019). Overview of artificial intelligence in medicine. Journal of family medicine and primary care, 8(7), 2328-2331. doi:10.4103/jfmpc.jfmpc_440_19
Chary, M., et al. (2019). A Review of Natural Language Processing in Medical Education. The western journal of emergency medicine, 20(1), 78-86. doi:10.5811/westjem.2018.11.39725
Clark, J., et al. (2020). A full systematic review was completed in 2 weeks using automation tools: a case study. Journal of Clinical Epidemiology, 121, 81-90. doi:http://dx.doi.org/10.1016/j.jclinepi.2020.01.008
Cohen, A. M., et al. (2015). Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. Journal of the American Medical Informatics Association : JAMIA, 22(3), 707-717. doi:10.1093/jamia/ocu025
García Adeva, J. J., et al. (2014). Automatic text classification to support systematic reviews in medicine. Expert Systems with Applications, 41(4, Part 1), 1498-1508. doi:https://doi.org/10.1016/j.eswa.2013.08.047
Guyatt, G., et al. (2015). User's Guides to the Medical Literature, 3rd edition. Chapter 22: The Process of a Systematic Review and Meta-analysis: The JAMA Network.
Harrison, H., et al. (2020). Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation. BMC Medical Research Methodology, 20(1), 7. doi:10.1186/s12874-020-0897-3
Jaspers, S., et al. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in European Food Safety Authority (EFSA). Retrieved from https://efsa.onlinelibrary.wiley.com/doi/pdf/10.2903/sp.efsa.2018.EN-1427
Jonnalagadda, S. R., et al. (2015). Automating data extraction in systematic reviews: a systematic review. Systematic Reviews, 4(1), 78. doi:10.1186/s13643-015-0066-7
Kiritchenko, S., et al. (2010). ExaCT: automatic extraction of clinical trial characteristics from journal publications. BMC Med Inform Decis Mak, 10, 56. doi:10.1186/1472-6947-10-56
Kremer, J., et al. (2014). Active learning with support vector machines. WIREs Data Mining and Knowledge Discovery, 4(4), 313-326. doi:https://doi.org/10.1002/widm.1132
Lau, J. (2019). Editorial: Systematic review automation thematic series. Systematic Reviews, 8(1), 70. doi:http://dx.doi.org/10.1186/s13643-019-0974-z
Lehman, E., et al. (2019). Inferring which medical treatments work from reports of clinical trials. 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics Minneapolis, MN.
Marshall, I. J., et al. (2020). Semi-Automated evidence synthesis in health psychology: current methods and future prospects. Health psychology review, 14(1), 145-158. doi:10.1080/17437199.2020.1716198
Marshall, I. J., et al. (2017). Automating Biomedical Evidence Synthesis: RobotReviewer. Proc Conf Assoc Comput Linguist Meet, 2017, 7-12. doi:10.18653/v1/P17-4002
Marshall, I. J., et al. (2018). Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide. Research Synthesis Methods, 9(4), 602-614. doi:https://doi.org/10.1002/jrsm.1287
Marshall, I. J., et al. (2019). Toward systematic review automation: A practical guide to using machine learning tools in research synthesis. Systematic Reviews, 8(1), 163. doi:http://dx.doi.org/10.1186/s13643-019-1074-9
Michie, S., et al. (2017). The Human Behaviour-Change Project: harnessing the power of artificial intelligence and machine learning for evidence synthesis and interpretation. Implementation science : IS, 12(1), 121-121. doi:10.1186/s13012-017-0641-5
Michie, S., et al. (2020). The Human Behaviour-Change Project: An artificial intelligence system to answer questions about changing behaviour [version 1; peer review: not peer reviewed] Wellcome Open Research 5(122). doi:https://doi.org/10.12688/wellcomeopenres.15900.1
Miwa, M., et al. (2014). Reducing systematic review workload through certainty-based screening. Journal of Biomedical Informatics, 51, 242-253. doi:https://doi.org/10.1016/j.jbi.2014.06.005
Nichols, J. A., et al. (2019). Machine learning: applications of artificial intelligence to imaging and diagnosis. Biophys Rev, 11(1), 111-118. doi:10.1007/s12551-018-0449-9
O’Connor, A. M., et al. (2019). A question of trust: can we build an evidence base to gain trust in systematic review automation technologies? Systematic Reviews, 8(1), 143. doi:10.1186/s13643-019-1062-0
Spasic, I., et al. (2020). Clinical Text Data in Machine Learning: Systematic Review. JMIR medical informatics, 8(3), e17984-e17984. doi:10.2196/17984
Summerscales, R. L., et al. (2011, 12-15 Nov. 2011). Automatic Summarization of Results from Clinical Trials. Paper presented at the 2011 IEEE International Conference on Bioinformatics and Biomedicine.
Thomas, J., et al. (2020). Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews. Journal of Clinical Epidemiology. doi:10.1016/j.jclinepi.2020.11.003
Tsafnat, G., et al. (2014). Systematic review automation technologies. Syst Rev, 3(74). doi:https://doi.org/10.1186/2046-4053-3-74
van de Schoot, R., et al. (2021). An open source machine learning framework for efficient and transparent systematic reviews. Nature Machine Intelligence, 3(2), 125-133. doi:10.1038/s42256-020-00287-7
Wallace, B. C., et al. (2016). Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision. Journal of machine learning research : JMLR, 17, 132.
Wallace, B. C., et al. (2017). Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. Journal of the American Medical Informatics Association : JAMIA, 24(6), 1165-1168. doi:10.1093/jamia/ocx053
Weißer, T., et al. (2020). A clustering approach for topic filtering within systematic literature reviews. MethodsX, 7, 100831. doi:https://doi.org/10.1016/j.mex.2020.100831