A human-LLM collaborative annotation approach for screening articles on precision oncology randomized controlled trials | BMC Medical Research Methodology

Data source and screening criteria

In this study, we validated our method by screening articles on precision oncology randomized controlled trials (RCTs). Since there are no specific subject terms for “precision” in this context, we retrieved articles from PubMed [17] using the search query: “randomized controlled trial”[pt] AND “cancer”[MeSH Major Topic] AND “humans”[mh]. This search yielded 23,521 articles published between January 1, 2012, and December 31, 2023.

To identify articles on precision oncology RCTs from this set, we established four criteria as shown in Table 1: (1) the article must be a randomized controlled trial, (2) the study population must be cancer patients, (3) the study purpose must be cancer treatment evaluation, and (4)the study must involve biomarkers related to genetic and molecular characteristics. An article is considered a precision oncology RCT only if it meets all four criteria.

Table 1 The screening criteria for precision oncology RCTs

During the manual annotation process, experts assessed whether each article met these criteria based on the title and abstract. To ensure accuracy and reliability, two experts conducted the initial annotations independently. Any discrepancies were reviewed by an additional annotator who made the final decision. We used the Medtator [18] annotation tool for this process.

Design of ChatGPT prompt

Using the OpenAI API (https://platform.openai.com/docs/api-reference), we selected “gpt-3.5-turbo” as our base model. We utilized the role attribute in the message objects to define the prompts for “system” and “user” roles. The “system” role was employed to set the context and guidelines, providing ChatGPT’s virtual character and relevant background information. The “user” role described specific requirements, detailing the tasks and expected response format. In our API implementation, we configured the top-p parameter to its default value of 1, while setting the temperature to 0. This configuration minimizes the variability of the returned responses. For additional details on the API usage and settings, please refer to our code repository on GitHub.

Fig. 2

The schema for article screening using ChatGPT

For the task of screening articles on precision oncology RCTs, the prompt design in the “system” and “user” messages is illustrated in the Fig. 2. In the “system” message, the large model’s role was defined as “an expert annotator specializing in scientific article content analysis”, including specific content from scientific articles: the title and abstract. The “user” message specified the task of screening and annotation, requiring the determination of whether an article meets all criteria. We provided the expected response format in JSON, along with an example. To ensure consistent response formatting, we instructed the model to use yes/no options and appended “Answer:” after the article content to clarify the starting point of the response. This structure helps the model better understand where to generate the response, avoiding confusion when handling long inputs.

Prompt optimization

To achieve near-perfect recall and high precision, we iteratively refined the LLM’s prompt manually using a standard dataset. The dataset was divided into a tuning set and a validation set, and the LLM was initially prompted to annotate both sets. Performance was assessed by comparing the labels generated by the LLM with those annotated by humans for both sets. If the performance metrics (recall and precision) were satisfactory, the process was concluded. If not, the prompt was revised based on an analysis of misclassifications in the tuning set. This cycle of evaluation and prompt refinement was repeated until the model demonstrated consistently high performance, characterized by near-perfect recall and high precision.

During the iterative prompt optimization process, we focused on three levels of refinement and adjustment for the LLM’s prompt. The first level addressed the structure of the prompt framework, such as whether the specific content of the article (title and abstract) should be included in the “system” message or the “user” message, and whether GPT should be required to provide reasoning for its answers. The second level addressed how to determine whether an article meets multiple criteria—whether to assess each criterion independently or simultaneously. The third level focused on refining the conceptual description of each criterion and providing corresponding examples when the concepts were ambiguous, enabling the model to accurately classify the articles.

Collaborative annotation

It is important to emphasize that our collaborative annotation approach is specifically designed for tasks involving article screening with a low prevalence of positive samples, where articles meeting specific criteria comprise only a small fraction of the retrieved set. The collaborative annotation process is as follows: using the optimized prompt developed in Prompt optimization section, we employed the LLM to annotate the articles. Given the near-perfect recall achieved by our model, the negative samples identified by the LLM are almost entirely accurate. Although errors may occur among the LLM-annotated positive samples, they are relatively rare. By selectively verifying these positive samples manually, we can effectively correct any misclassifications. This combined approach of LLM pre-annotation followed by manual validation significantly reduces the overall workload for article screening.

Fine-tuning of the supervised model

To further validate the reliability of the human-LLM collaboration annotation data, we trained a supervised model using the collaboratively annotated articles to assess its performance. We selected the BioBERT [19] model, known for its excellent performance in previous studies [6, 20], for fine-tuning to perform the classification task. During preprocessing, we concatenated the article titles and abstracts and then input them into the BERT model. This model generates probabilities for each category, and if the probability of a specific category exceeds a threshold, the article is assigned the corresponding label. We conducted hyperparameter tuning to enhance the model’s reliability and significance. By carefully selecting and adjusting hyperparameters such as learning rate, batch size, and regularization strength, we aimed to achieve accurate and meaningful results.

Continue Reading