In this section, we will detail our method. Notably, our study is based on data obtained from open-source websites, as listed in Supplementary Table 5. Therefore, the relevant ethical regulations are governed by the original data-uploading processes outlined in each dataset’s collection pipeline (please refer to each dataset website in Supplementary Table 5 for more details). Specifically, for the data from Radiopaedia, which forms the main component of our newly proposed dataset, Radiopaedia is a peer-reviewed, open-edit radiology resource collection website. Its mission is to “create the best radiology reference available and to make it available for free, forever, and for all.” We have obtained non-commercial use permission from various uploaders as well as the founder of Radiopaedia. The relevant ethical regulations are governed under Radiopaedia privacy-policy.
Dataset
Here, we describe the procedure for constructing the datasets and benchmark. In the section “Medical Multimodal Dataset (MedMD)”, we present several medical multimodal datasets and merge them with an extensive collection of preexisting datasets, resulting Medical Multimodal Dataset (MedMD). MedMD is a large-scale, high-quality medical vision-language dataset, covering a wide range of anatomies with over 5000 diseases, as shown in Fig. 7a. We further construct a filtered radiology subset Radiology Multimodal Dataset (RadMD). In the section “Radiology Evaluation Benchmark (RadBench)”, we introduce a new Radiology Benchmark for evaluation, termed RadBench, with three distinct tasks, e.g., visual question answering, report generation and rationale diagnosis, aiming to monitor the progress of developing foundation models.
Medical multimodal dataset (MedMD)
To start, we construct a candidate data pool by pulling a variety of existing visual-language medical datasets together, for example, MIMIC-CXR30 and PMC-OA31. Despite the scale of these high-quality datasets, they are fundamentally limited in several aspects: (i) Data format. These datasets are only composed of 2D medical images, which do not fully capture the complexities in clinical use cases, for example, 3D medical imaging modalities, like CT, MRI; (ii) Modality diversity. A noteworthy limitation arises from the fact only chest X-ray images are provided with medical reports, training models on such data will clearly pose limitation on the generalizability to a broader range of imaging modalities and anatomical regions; (iii) Report quality. Another critical limitation lies in the use of data extracted from figures and captions from research papers. The gap between research-oriented data and real-world clinical scenarios may not support accurate and reliable clinical diagnoses. Therefore, to support the training of our proposed Radiology Foundation Model (RadFM), we augment the dataset with four new ones, including PMC-Inline, PMC-CaseReport, RP3D-Series, and MPx-Series, resulting in MedMD. MedMD has a total of 16M 2D image-text pairs, including around 15.5M 2D images and 500k 3D scans with corresponding captions or diagnosis labels, as shown in Supplementary Table 3. More detailed introduction of different data sources can be found in the Supplementary Section “Detailed Introduction for Different Data Sources”.
Generally speaking, we split the candidate data pool into two parts (i) interleaved image-language data that is collected from academic papers and (ii) image-language data constructed for visual-language instruction tuning, as detailed below.
Interleaved dataset
PMC-Inline. PMC-Inline contains 11M 2D radiology images that are collected from PubMed Central papers. In contrast to existing work, for example, PMC-OA31, that only contains figures and corresponding captions, here, we focus on the inline reference from the main body of papers. For example, one paper may contain many sentences like “As shown in Fig. 2, we can see …”, we localise the keyword “Fig. 2” and link its corresponding figure back into sentences, ending up with interleaved images and texts, with rich context. This dataset shares the same format as MMC427, which has shown to be effective in training foundation models in the computer vision community, for example, Flamingo22.
Visual-language instruction tuning dataset
PMC-CaseReport. Inspired by former works leveraging clinical case reports42, PMC-CaseReports is a filtered subset of PMC-Inline with around 103K case reports, where the doctors typically document the valuable clinical cases, based on their contact with the patients, such as family medical history, preliminary diagnosis, radiographic exam results, surgical records, etc., together with critical radiologic scans, that generally follows the real timeline.
Similar to PMC-VQA8 that generates VQA pairs by querying ChatGPT with image captions, we also generate 1.1M question-answer pairs by querying ChatGPT with the sentences containing inline references in case reports. However, in contrast to PMC-VQA, we keep background information of the patients to simulate the clinical diagnosis scenario, thus can be seen as a medical contextual VQA dataset. For example, a question-answer pair may like “Question: A 58-year-old woman presented to the emergency department …Postoperatively, her pain significantly relieved. What did the MRI indicate? Answer: The MRI indicated tumor recurrence at L2 and S1-S2.”
RP3D. RP3D (RadioPaedia 3D) is a novel dataset with 3D radiology scans, sourced from the Radiopaedia website (https://radiopaedia.org/). All privacy issues have already been resolved by the clinician who uploaded the case. Specifically, each patient case comprises one or more images from the same or different modalities, accompanied by high-quality captions that have been meticulously peer-reviewed by experts in the Radiopaedia Editorial Board (https://radiopaedia.org/editors). We have included a response letter from Radiopaedia, with the agreement for us to use the dataset for training under non-commercial cases. In addition, for each disease, we can get corresponding radiological features across different modalities. We convert the image-caption pairs into a variety of formats, namely, RP3D-Caption, RP3D-Modality, RP3D-Rationale, and RP3D-VQA, depending on their corresponding text content. Specifically, RP3D-Caption denotes the images paired with their corresponding captions; RP3D-Modality refers to images with modality labels; RP3D-Rationale incorporates radiological features with disease labels for each case; RP3D-VQA involves visual question-answering pairs generated from captions by querying ChatGPT, as illustrated in Supplementary Fig. 1.
MPx. MPx is collected from the MedPix website (https://medpix.nlm.nih.gov/) and organized by cases. Each case contains multiple radiologic scans, along with general clinical findings, discussions, and diagnostic results. In addition, MPx also provides annotations on the scan level, including information such as image modality, shooting plane, and captions for each scan. Thus, we separate it into MPx-Single and MPx-Multi, containing annotations on the case-level and scan-level, respectively.
Radiology multimodal dataset (RadMD)
For domain-specific finetuning, we filter out the non-radiology images from MedMD, and construct a clean subset, named Radiology Multimodal Dataset (RadMD), dedicating to supervised visual instruction tuning. It contains a total of 3M images, spanning various data formats, modalities, and tasks, featuring over 5000 diseases, as shown in Fig. 7b.
In general, we have conducted the following filtering process: (i) remove non-radiologic images; (ii) remove the entire PMC-OA and PMC-Inline datasets, as the images in PubMed are 2D-only, thus differ from real clinical cases, additionally, the writing styles between academic papers and real clinical reports are inconsistent; (iii) remove a large portion of 2D image cases from PMC-Series, to emphasize the 3D images in training. (iv) filter out the information about patient age or structure size, as the image spacing and patient background information are not provided. Specifically, we applied string matching techniques using Python’s regular expressions to remove any sentences containing terms related to physical measurements, such as “mm”, “cm”, or decimal numbers (e.g., “2.5 cm”), as these are indicative of missing or incomplete metadata related to patient age, structure size, or image spacing. This step primarily addresses the problem in the report generation tasks, where such metadata would otherwise cause incorrect or unpredictable descriptions.; (v) balance the number of normal and abnormal patients in the diagnosis datasets, as generative models are sensitive to data imbalances. More comprehensive details regarding the filtering process and the resulting dataset sizes can be found in Supplementary Table 3.
Radiology evaluation benchmark (RadBench)
In addition to the training set, we also introduce RadBench, a comprehensive evaluation benchmark for monitoring progress in the development of radiology foundation model for generative tasks. Considering that most existing medical benchmarks may only include a plain label (like disease categories), that are not suitable to assess the models’ long sentence generation ability, our RadBench is targeted at compensating for this.
In detail, RadBench is first randomly sampled from the RP3D dataset. Then, We further carry out meticulous manual verification to ensure data quality on all the samples. Specifically, we developed a human evaluation interface, visually presenting the data source, image, question, and answer of each case. Eight human annotators were asked to assess the quality of these cases by addressing the following criteria:
-
Image types: remove the images that do not fall in radiology.
-
Question reasonability: keep the questions that can be answered from the given radiology image, for example, on visual question answering, remove the question related to size; on report generation, remove cases containing sentences like “Compared with previous cases”; on rationale diagnosis, remove cases lacking corresponding radiological features are filtered out.
-
Answer correctness: keep those with correct answers based on the given text reports.
As a result, we have obtained 4229 for visual question answering, 1468 for report generation, and 1000 for rationale diagnosis. Additionally, we also consider nine existing tasks for our evaluation, which include plain diagnosis and medical VQA tasks. A detailed breakdown of each dataset, including task descriptions and modalities, is provided in Supplementary Table 4. Combining them with our RadBench, in evaluation, we will comprehensively assess models for four tasks, i.e., disease diagnosis, medical VQA, report generation, and rationale diagnosis. The details of the four evaluation tasks and metrics are introduced in the following.
Disease diagnosis
This task involves analyzing the radiology images to determine the likelihood of specific diseases. Here, we modify this task to an induction task, which uses introductory text explaining the classification task and providing the name of the queried disease at the beginning of the prompt. Given a medical image, we randomly select a disease and a prompt sentence like “Is {disease} shown in this image” as input, querying the model to determine the existence of a certain disease. Due to this being formulated as a generation task, “AUC” cannot be calculated, so we match the output with ground-truth to calculate the ACC and F1 score. Similarly, we match the output with a closed ground-truth list {“yes”, “no”} using difflib.SequenceMatcher, and choosing the most similar one as the prediction of the model. Considering ACC scores may suffer from data unbalancing, we keep the same ratio to sample positive and negative cases. In our dataset, we do not put prior on the disease, and over 5000 diseases are considered, with a balanced ratio of “yes” or “no” responses.
Medical visual question answering
This task is a combination of popular visual question-answering challenges. Given a medical image and a clinically relevant question in natural language as a prompt, the medical VQA system is expected to predict a plausible and convincing answer.
Radiology report generation
This task focuses on the automatic generation of reports, i.e., summarizing the radiologic findings based on radiology images, such as X-rays, CT scans, and MRI scans. Given a medical image, we randomly select a prompt sentence like “Please caption this scan with findings” as input.
Rationale diagnosis
This task involves analyzing radiology images to predict both the underlying disease and the typical radiologic features of different modalities, such as X-rays, CT scans, and MRI scans associated with that disease. Specifically, we randomly select a prompt sentence like “Determine the disease that corresponds to the given radiographic images, starting with the established radiological features and concluding with the ultimate diagnosis.” Since we have evaluated disease diagnosis accuracy in the common “Disease Diagnosis” setting, for rational diagnosis, we mainly focus on how well the foundation model can give reasons.
Building generalist foundation model for radiology
In this section, we start by describing the paradigm for unifying different medical tasks into a generative framework, followed by detailing the proposed RadFM model, and its training details. Our training adopts two types of datasets, namely, interleaved datasets and visual instruction datasets. It is worth noting that their training objectives differ slightly, which will be detailed in the following.
A unified learning paradigm
In both of our proposed multimodal datasets, i.e., MedMD and RadMD, each training sample is essentially consisting of two elements, i.e., ({{{mathcal{X}}}}={{{{mathcal{T}}}},{{{mathcal{V}}}}}), where ({{{mathcal{T}}}}) refers to the language part in the case, with special placeholder tokens for images, e.g., “The patient is 47-year-old. 〈image-1〉 〈image-2〉 We can see opacity on the X-ray”. ({{{mathcal{V}}}}) refer to the visual parts containing a set of 2D or 3D image scans, i.e., ({{{mathcal{V}}}}={{v}_{1},{v}_{2},ldots,{v}_{N}}), ({v}_{i}in {{mathbb{R}}}^{Htimes Wtimes C}) or ({v}_{i}in {{mathbb{R}}}^{Htimes Wtimes Dtimes C}), H, W, D, C are height, width, depth, and channel, respectively, corresponding to the “〈image-i〉 ” token in ({{{mathcal{T}}}}). In general, ({{{mathcal{T}}}}) and ({{{mathcal{V}}}}) can be considered as prompts input to model with interleaved language and image.
The goal is to model the likelihood of generated text tokens in ({{{mathcal{T}}}}), conditioned on interleaved scans as:
$$p({{{mathcal{T}}}}| {{{mathcal{V}}}})=prod p({{{{mathcal{T}}}}}_{l}| {{{{mathcal{V}}}}}_{ < l},{{{{mathcal{T}}}}}_{ < l}),$$
(1)
where ({{{{mathcal{T}}}}}_{l}) represents the l-th token in ({{{mathcal{T}}}}) and ({{{{mathcal{V}}}}}_{ < l}), ({{{{mathcal{T}}}}}_{ < l}) represent the image and language text appearing before the l-th token. We use a generative model (ΦRadFM) to parameterize the probability p, and our final training objective can be expressed as the negative log-likelihood of the correct next token in the text sequence:
$${{{{mathcal{L}}}}}_{{{{rm{reg}}}}}=-sum {w}_{l}log {Phi }_{{{{rm{RadFM}}}}}({{{{mathcal{T}}}}}_{l}| {{{{mathcal{V}}}}}_{ < l},{{{{mathcal{T}}}}}_{ < l}),$$
(2)
where wl refers to a per-token weighting, aiming to either emphasize key tokens or skip special tokens. Its value differs for different datasets and we detail this in the following.
Interleaved datasets. For samples in visual-language interleaved dataset, i.e., PMC-Inline, there are no strong question-and-answer relationships between contexts, we extract medical-related words in each sentence by using unified medical language system (UMLS)43, and give them a high loss weights. Additionally, we avoid calculate loss on the image placeholder token. Overall, wl can be formulated as,
$${w}_{l}=left{begin{array}{ll}3,quad &{{{{mathcal{T}}}}}_{l}in ,{mbox{USML}},hfill \ 1,quad &{{{{mathcal{T}}}}}_{l},notin ,{mbox{USML}},hfill \ 0,quad &{{{{mathcal{T}}}}}_{l}=langle ,{mbox{image-i}},rangle end{array}right..$$
(3)
Note that, PMC-Inline is the only dataset fit in this case.
Visual instruction datasets. For samples from visual instruction datasets like PMC-VQA8 or PMC-CaseReport, they are often in the format of dialogue, for example, “What can you see from the image? 〈image-1〉 I can see lesions.” or “Please describe the scans 〈image-1〉. The scan is …”, we further separate the language part ({{{mathcal{T}}}}) into instruction and response, denoted as ({{{mathcal{I}}}}) and ({{{mathcal{R}}}}) respectively. For example, as in the former two cases, ({{{mathcal{I}}}}) refers to “What can you see from the image? 〈image-1〉 ” and “Please describe the scans 〈image-1〉 ”. In a practical scenario, ({{{mathcal{I}}}}) is expected to be given by users, and the model is only required to output correct responses. Overall, wl can be formulated as,
$${w}_{l}=left{begin{array}{ll}3,quad quad &{{{{mathcal{T}}}}}_{l}in {{{mathcal{R}}}}quad&quad {{{{mathcal{T}}}}}_{l}in ,{mbox{USML}},hfill\ 1,quad quad &{{{{mathcal{T}}}}}_{l}in {{{mathcal{R}}}}quad&quad {{{{mathcal{T}}}}}_{l}, notin ,{mbox{USML}},hfill\ 0,quad quad &{{{{mathcal{T}}}}}_{l}in {{{mathcal{I}}}}hfillend{array}right..$$
(4)
Most samples from MedMD fit the weighting formulation. All prompts used for instruction tuning are listed in the Supplementary Tables 8–11. We describe the detailed prompting for different problem settings:
-
Modality recognition. Here, we adopt two types of prompts, (i) we use inductive prompts, and the 2D or 3D medical scan as input, for example, “〈image-1〉 Is this image captured by {modality}?”, and the modality category is randomly sampled from the modality set, forming the text input ({{{mathcal{I}}}}) and if the modality matches the ground truth labels we set the ({{{mathcal{R}}}}) as “yes” otherwise “no”. (ii) we use open prompts, like “What’s the modality of the input scan 〈image-1〉 ?” to form the ({{{mathcal{I}}}}), and translate the corresponding modality label into ({{{mathcal{R}}}}). Samples for training such functionality are from RP3D-Modality and MPx-Single, with modality annotations available.
-
Disease diagnosis. All the datasets listed as “image data” in Supplementary Table 3 are built for diagnosis, they only have binary labels for diseases. Similarly to modality recognition, we use two prompts to transform them into our desired format, (i) we use inductive prompts, like “〈image-1〉 Does the patient have {disease}?” and the disease category is randomly sampled from a disease set, forming the text input ({{{mathcal{I}}}}) and if the disease matches the ground truth labels we set the ({{{mathcal{R}}}}) as “yes” otherwise “no”, note that, during sampling, we balance the positive and negative ratio, (ii) we use open diagnosis prompts, like “Please make diagnosis based on the images 〈image-1〉 〈image-2〉.” to construct the instruction (({{{mathcal{I}}}})), and translate the positive disease labels into response (({{{mathcal{R}}}})), by simply using their category names. A simple example is, ({{{mathcal{I}}}})=”Please make diagnosis based on the image 〈image-1〉.” with ({{{mathcal{R}}}}) = “Edema, pneumothorax.”. With such instruction, the model is thus required to complete a difficult task, i.e., directly outputting the disease name.
-
Visual question answering. Beyond the abovementioned task formulation, there are more complex questions that can be asked, such as those about the spatial relationships among objects (“What is the location of the lesion?”) and common sense reasoning questions (“Given the image context and patient history, what is likely to be the cause of the observed symptoms?”). A robust medical VQA system must be capable of solving a wide range of classic medical diagnosis tasks, as well as the ability to reason about images. Existing medical VQA datasets like VQA-RAD32, SLAKE33, PMC-VQA8 and RP3D-VQA naturally fit into this paradigm. They contain a mixture of question types, thus the language questions can naturally be treated as text instruction (({{{mathcal{I}}}})) and the corresponding answer as response (({{{mathcal{R}}}})). It is worth noting that, our constructed PMC-CaseReport dataset also falls into this category, with more contextual information available for instruction, for example, history diagnosis, is also available, thus providing critical information for answering the question.
-
Report generation. MIMIC-CXR30, RP3D-Caption, PMC-OA31, MPx-Multi, and MPx-Single are all captioning datasets, the task is to write a long caption or report given one or a set of images. The language instruction for this task are like “What can you find from the scans 〈image-1〉 〈image-2〉?”.
-
Rationale diagnosis. We construct RP3D-Rationale based on the RP3D dataset. This task encompasses disease prediction and the generation of typical radiological features associated with the diagnosed disease. Specifically, we design some prompts like “What disease can be diagnosed from these radiological images and what specific features are typically observed on the images? 〈image-1〉 〈image-2〉 ” as instruction (({{{mathcal{I}}}})), and response (({{{mathcal{R}}}})) refers to the disease label along with radiological features collected from the Radiopaedia website.
Architecture detail
In this section, we aim to describe the proposed model in detail. As shown in Fig. 1c, our proposed RadFM model consists of a visual encoder Φvis, that can process both 2D and 3D medical scans; a perceiver44 module Φper for aggregating a sequence of scans into a fixed number of tokens, for example, taken with different modalities (CT, MRI) or various time point; and a large language model (LLM) Φllm that enables to generate free-form text responses, based on the input visual-language information.
Visual encoding. Given one sample instance from our dataset, denoted as ({{{mathcal{X}}}}={{{{mathcal{T}}}},{{{mathcal{V}}}}}), where ({{{mathcal{V}}}}={{v}_{1},{v}_{2},ldots,{v}_{N}}), we first encode each input image separately with an image-encoder Φvis. Specifically, we adopt 3D ViT here to be compatible with both 2D and 3D image input. For 2D images, we expand a new dimension for depth by replicating the slices. Therefore, each image scan can be denoted as ({v}_{i}in {{mathbb{R}}}^{Htimes Wtimes {D}_{i}times C}), where C denotes the image channels and H, W, Di are the height, width, and depth of the image, respectively. The rationale behind this design choice is as follows: (i) increasingly more radiology diagnosis rely on 3D scans, for example, CT, MRI, the foundation model should certainly be able to process 3D data input; (ii) in 3D data, consecutive slices are highly similar, thus padding 2D into 3D, on the one hand, does not lead information loss, on the other hand, resembles a good approximation of 3D data; (iii) padding 2D images will only affects the tokenization layer, i.e., converting image patches into continuous embedding, while still keep the rest of model shared with 3D scans, thus facilitating knowledge share.
Note that, comparing to the typical visual encoding scenario that assumes different images have unified shape, we do not normalize the depth dimension into an exact size, only round into a factor of 4, depending on their original resolution. Note that, all the 2D images are padded into four slices on the depth channel. We convert the image into 3D patches, embed them into a token sequence, and feed into the encoder (Φvis). To retain the 3D position of these tokens, we adopt learnable 3D position embeddings, the detailed procedure can be formulated as:
$${{{{boldsymbol{v}}}}}_{i}={Phi }_{{{{rm{vis}}}}}({v}_{i})in {{mathbb{R}}}^{{P}_{i}times d},$$
(5)
where vi is the output embedding for image vi, encoded with 3D ViT, Pi is the total number of tokens, and d is the feature dimension. Due to the inconsistency in depth dimension, Pi varies across 2D and 3D images, and the model can get to know the original image size by positional encoding.
Aggregation with perceiver. After visual encoding, we adopt a perceiver44 module Φper to aggregate visual representation. Specifically, Φper follows the classical perceiver architecture with a fix number of learnable queries as the latent array input, and the visual embedding vi is treated as the byte array input, so that the final output embeddings will be normalized into the same length with the pre-defined learnable query sequence. The aggregation procedure can be formulated as:
$${{{{boldsymbol{u}}}}}_{i}={Phi }_{{{{rm{per}}}}}({v}_{i})in {{mathbb{R}}}^{Ptimes d},$$
(6)
where ui refers to the aggregated visual embedding, P denotes the number of learnable queries. Leveraging perceiver architecture, we can map an arbitrary number of patch tokens into the same length, such that images of different sizes can be treated equally in the following fusion flow.
Multimodal fusion. To fuse the visual-language information, we interleave the visual embedding with text embeddings from tokenization, where the special image placeholder token is simply replaced with the corresponding visual embedding. The resulting interleaved sequence is then passed into a decoder-only large language model (Φllm), the self-attention transformer layers in LLM can thus naturally be reused as multi-modal fusion modules:
$$p={Phi }_{{{{rm{llm}}}}}(,{mbox{concat}},({{{{boldsymbol{t}}}}}_{1},{{{{boldsymbol{u}}}}}_{1},{{{{boldsymbol{t}}}}}_{2},{{{{boldsymbol{u}}}}}_{2},{{{{boldsymbol{t}}}}}_{3},ldots )),$$
(7)
where ti, ui refer to the text and visual embeddings, p is the probability distribution for the next token.
Training procedure
Our training procedure includes two stages, namely, pretraining, and domain-specific finetuning, as shown in Fig. 1b. Note that, all training settings remain identical at two stages, with the only distinction lying in the training data, from generalist to radiologic-specific.
Generally, all the data used for model training is listed in Supplementary Table 2 with citations indicating their sources (those without citations denoting the data are contributed by this work). For pretraining, all the listed data are employed. While for domain-specific instruction tuning, we further filter out some relatively low-quality data, i.e., generated data without human verification or non-radiology data, focusing more on high-quality question-answering pairs. Next, we will describe this in detail.
Pretraining. At this stage, we use all available data in MedMD as listed in Supplementary Table 3, the main components of the data are PMC-Inline and PMC-OA31, which are all collected from 2.4M PMC papers. These two datasets contain diverse medical vocabularies and images with cutting-edge medical knowledge, however, they are relatively noisy, so we only use them during pretraining in the hope that the network can accumulate enough knowledge about medical-specific terminologies and images. Additionally, we also include other VQA, captioning, and diagnosis datasets, as they are much cleaner.
Domain-specific Instruction Tuning. At this stage, we adopt RadMD for domain-specific instruction tuning, which contains over 3M radiologic images, with high-quality language instructions and responses. In this stage, we utilize RadMD for domain-specific instruction tuning, which includes over 3M radiological images accompanied by high-quality language instructions and responses. Notably, we filter out PMC-Inline and PMC-OA, as these datasets are not derived from real clinical scenarios. For the remaining data sources, we primarily filter out non-radiology-related content. Specifically, the filtering process targets the MPx-series, RP3D-series, and PMC-CaseReport datasets. For both MPx-series and RP3D-series, the filtering is straightforward since the original websites provide related imaging modalities for each case. For PMC-CaseReport, which is generated from the case reports subset of PMC-Inline using ChatGPT, we rely on the image captions to filter the cases. Only those with captions explicitly mentioning radiology-related terms—such as “MRI”, “CT”, “X-ray”, “ultrasound”, or “mammography”—are retained. We acknowledge that some noisy cases may still remain in the dataset. Therefore, in our evaluation dataset, RadBench, the selected test cases undergo additional manual inspection to further ensure quality.
Training details
Image preprocessing. To dismiss the differences of medical images in different modalities, certain preprocessing steps are applied. Specifically, (i) to align the intensity distributions, we employ min-max normalization of all images; (ii) given that medical images can exist in either 3D or 2D formats (such as MRI being 3D and X-ray being 2D), we convert all 2D images to 3D simply by expanding an extra dimension. Consequently, all images, irrespective of their original format, can be processed uniformly as 3D images; (iii) to ensure consistent sizes across all images, we resize them using the torchvision.transforms.Resize function. For height and weight dimensions, we resize them to 512 × 512 for 2D images and 256 × 256 for 3D images because 3D data has more slices, thus taking more computational memorization. For the depth dimension, since our visual encoder, a 3D vision transformer (ViT), requires the input image sizes to be divisible by the patch size of 32 × 32 × 4, we resize the depth dimension to the nearest multiple of 4 and will not surpass 64. Please check the Supplementary Table 6 to obtain more details.
A detailed forward example. To better illustrate our model architecture, we present a simple instruction tuning example: a radiology image paired with the text prompt “Does the case 〈image〉 have pneumonia?”, with the ground truth response “Yes.” The model forward procedure will include three main steps, i.e., visual encoding, text fusion, and loss calculation. Visual encoding: A 2D image is first expanded into a pseudo-3D format by adding an extra dimension of size 4. It is then processed by a 3D Vision Transformer (ViT) to produce visual tokens. These are compressed to a fixed length of 32 using a perceiver module, ensuring consistent input regardless of image size. Text fusion: The text prompt is tokenized using the LLM’s embedding layer, and the “〈image〉 ” placeholder is replaced with the visual tokens. This fused sequence is input to the LLM’s causal self-attention layers for multimodal understanding. Loss calculation: The model predicts the next tokens auto-regressively, and the loss is computed against the ground truth “Yes”. During pretraining, the same forward process is used, but the loss is calculated over all text tokens except the image placeholder, following GPT-style training.
Implementation. For the visual encoder, we adopt a 12-layer 3D ViT with 768 feature dimensions and the perceiver is chosen as a six-layer transformer decoder with a learnable latent array in 32 × 5120 dimensions, so that all images will be embedded as a 32 × 5120 feature embedding after passing visual encoding and perceiver aggregation. When inserting them into the text embedding, we will add two extra special tokens 〈image〉, 〈/image〉 at the beginning and ending, respectively, to distinguish them from common text tokens. For the large language model, we initialize it with the MedLLaMA-13B model introduced by PMC-LLaMA25, which has further finetuned the LLaMA-13B2 model on the medical corpus. Our final model has 14B parameters.
In training, we vary the batch size, i.e., one batch size per device for 3D images and four batch size per device for 2D images with four-step gradient accumulation, and the max token length is set to be 2048. We totally train the model for eight epochs, four epochs for pretraining and four epochs for instruction tuning. In the first one epoch, we freeze the language model to align image embedding space with that of texts, in the following epochs, all parameters are updated. To improve the training speed, we adopt FSDP acceleration strategy45, together with automatic mixed precision (AMP) and gradient checkpointing46. All models are implemented in PyTorch and trained on 32 NVIDIA A100 GPUs with 80 GB memory.
Evaluation
In this section, we introduce three evaluation settings, i.e., zero-shot, few-shot and task-specific evaluation, together with the models in comparison. Note that, the first two evaluations require no further training, while the last requires additional finetuning on specific tasks. Afterward, we introduce the automatic metrics and human rating progress.
Zero-shot and few-shot evaluation
Foundation models, as a generalist model, the most appealing characteristic is that they can be applied to various tasks just with proper prompting strategies, like zero-shot or few-shot prompting, without any specific training. In the zero-shot setting, models will be given task-related semantic instructions to indicate which task it is expected to perform, and in the few-shot prompting scenario, some similar cases related to the task will be given instead. The insight of both is to use appropriate textual instructions to prompt the model on what tasks to perform, while which one is more suitable for a certain model depends on its training approach.
Baselines. For our RadFM, we mainly adopt zero-shot evaluation, as in the instruction tuning step, we focus on promoting the model to understand diverse zero-shot instructions. For other baselines, we compare with the following publicly accessible foundation models under these two settings, as follows:
-
OpenFlamingo13. This is an open-source implementation of the prior state-of-the-art generalist visual-language model Flamingo22, that was trained on large-scale data from general visual-language domain. We utilized the released checkpoint for zero-shot and few-shot evaluation in our study.
-
MedVInT8. This is a visual instruction-tuned visual-language model based on LLaMA2, which was trained on PMC-VQA8. Considering that the PMC-VQA data does not contain any few-shot cases, mainly targeting at zero-shot prompting cases, we directly use the released checkpoint of the MedVInT-TD model with PMC-LLaMA and PMC-CLIP backbone for zero-shot evaluation.
-
LLaVA-Med4. LLaVA-Med is a medical-specifical vision-language foundation model trained based on LLaVA47 leveraging zero-shot instruction tuning dataset generated from pubmed image-caption pairs. Similar to MedVInT, it also mainly targets zero-shot prompting cases and we directly use the released checkpoint LLaVA-Med-v1.5 for zero-shot evaluation.
-
Med-Flamingo6. This is a multimodal model developed based on OpenFlamingo-9B13, that can handles multi-image input interleaving with texts. We use the released checkpoint for zero-shot and few-shot evaluation.
-
GPT-4V14. GPT-4V is widely considered as the most powerful multi-modal foundation model, released by OpenAI. Since until our submission, GPT-4V can only input 4 images which can hardly allow few-shot cases with multiple images, thus we evaluate it in zero-shot cases Besides, GPT-4V can be only accessed through the online chatting website, therefore, large-scale auto-evaluation is not feasible. In this paper, we only use it for evaluation under the human rating setting.
For OpenFlamingo and Med-Flamingo, we perform both zero-shot and few-shot evaluations in our study. Specifically, we follow the prompts derived from the official Med-Flamingo repository. The example prompt for zero-shot evaluation: ‘You are a helpful medical assistant. Please answer the question about the given image. 〈image〉 Question: the query question. Answer:”. In the few-shot setting, we expand upon this format by supplying the models with additional examples to guide their responses. This is structured as follows: “You are a helpful medical assistant. You are being provided with images, a question about the image, and an answer. Follow the examples and answer the last question. 〈image〉 Question: [the first question]. Answer: [the first answer]. 〈 —endofchunk—〉 〈image〉 Question: [the second question]. Answer: [the second answer]. 〈 —endofchunk—〉 〈image〉 Question: the query question. Answer:”.
To our knowledge, there are currently no existing foundation models that can effectively handle both 2D and 3D radiology images. For comparison, we have strong baseline models that are publicly accessible, for example, OpenFlamingo13, MedVInT8, LLaVA-Med4, and Med-Flamingo6, which have demonstrated efficacy in processing slices and making predictions. In addition, we also compare with GPT-4V(ision)14 use its online chatting website version.
Datasets. We evaluate the above foundation models on RadBench and 9 exising datasets as introduced in section “Radiology evaluation benchmark (RadBench)”. Additionally, we also evaluate them on PadChest48. It is a labeled large-scale, high-resolution chest x-ray dataset including 160,000 images obtained from 67,000 patients, with 174 different radiographic finding labels. We dismiss the classes with cases fewer than 10 together with the seen classes appearing in our training set, resulting in 163 totally unseen classes. We therefore ensure that not only images, but also categories in the texts never appear in the training, which requires more generalization ability of models.
Task-specific evaluation
In addition to directly evaluating different foundation models using zero-shot or few-shot prompting, without any training, our model can also serve as a pretrained model, that can be adapted to different specific tasks by further finetuning on its corresponding training set, giving up the ability to generalize between tasks, but getting better performance on a specific task. In such a case, we compare our final results with different task-specific state-of-the-arts (SOTAs) according to the related datasets. In detail, we use the following datasets, and the corresponding SOTAs for comparison are listed in Table 3 with citations:
-
VinDr-Mammo49 is a mammography diagnosis dataset comprising 20,000 images (5000 four-view scans). Each scan was manually annotated with a five-level BI-RADS score. We view this as a multi-class classification task with the official split following the BenchMD50.
-
CXR1451 is a widely-used chest X-ray diagnosis dataset containing 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 to 2015) unique patients with 14 finding labels. We follow its official split and evaluate the SOTA52 on the split.
-
LDCT53 Low dose computed tomography (LDCT) is a procedure that uses an x-ray machine linked with a computer to create 3D images of a patient’s tissues and organs. LIDC-IDRI53 dataset is used here, containing 1018 low-dose lung CTs, where each CT has small/large/no nodule labels. We follow BenchMD50 to set this dataset as a 3D diagnosis task and split it follow BenchMD.
-
MosMedData54 is a set of 1110 3D CT cases labeled with COVID-19 related findings, as well as without such findings. We view it as a classification task and split it randomly with 8:2 for training and testing following54.
-
COVID-CT55 is a set of 349 2D CT slices labeled with COVID-19 collected from 216 patients. We split it randomly with an 8:2 ratio for training and testing.
-
BraTs201932 is an MRI dataset with four MRI modalities T1WI, T2WI, T2FLAIR, and T1 contrast-enhanced(T1CE). There are 259 volumes of high-grade glioma (HGG) and 73 volumes of low-grade glioma (LGG). We follow the setting as DSM56 that uses T1CE to diagnose the HGG or LGG. Due to the original paper did not release their splits we randomly split the dataset following 7:3 for training and testing and re-tested the SOTA on it.
-
ADNI (Alzheimer’s disease neuroimaging initiative)57 is a large collection alzheimer’s disease dataset with 3D brain MRI scans. We follow the setting introduced in ref. 58 and split it randomly 8:2 for training and testing.
-
BTM-17 (Brain-tumor-17)59 is a challenge about classifying an MRI case into 17 tumor types, with 4449 real images. We adopt its official split.
-
Lung-PET-CT-Dx60 consists of CT and PET-CT DICOM images of 355 lung cancer subjects. We treat it as a diagnosis dataset to further distinguish whether one patient is diagnosed with Adenocarcinoma, small cell carcinoma, large cell carcinoma, or squamous cell carcinoma. Considering its limited case number, we split it with 7:3 (train:test) to ensure enough cases for evaluation.
-
VQA-RAD32 is a radiology VQA dataset containing 3515 questions with 517 possible answers. We follow the official dataset split for our evaluation.
-
SLAKE33 is an English-Chinese medical VQA dataset composed of 642 images and 14K questions. There are 224 possible answers in total. We only use the “English” part, and follow the official split.
-
PMC-VQA8 is an English medical VQA dataset generated with auto-nlp methods containing 149K images with 227K questions. Its answers are diverse for different questions. Considering its test set is also auto-generated, we have manually cleaned it as mentioned in section “Radiology Evaluation Benchmark (RadBench)” and retest the SOTA MedVInt8 checkpoint on the cleaned test set.
-
MedDiffVQA61 is a large-scale dataset for difference medical VQA (involving historical comparison) in medical chest x-ray images with 700,703 pairs of question-answer. We follow its official split.
-
IU-X-ray62 is a set of chest X-ray images paired with clinical reports. The dataset contains 7470 pairs of images and reports. We follow the setting and split as CDGPT263 where we use a single-view image to generate the reports.
Evaluation metrics
Machine rating. We evaluate on four distinct tasks, e.g., disease diagnosis, visual question answering, report generation and rationale diagnosis. The details of the four tasks and automatic metrics are introduced in section “Radiology Evaluation Benchmark (RadBench)”. To evaluate the model’s performance across a range of tasks, distinct evaluation metrics are employed based on the task type. For tasks with pre-defined answer choices, such as disease diagnosis, we adopted standard metrics developed in the community, for example, F1 stands for “F1 score”, and ACC stands for “Accuracy”. Conversely, for tasks involving open-ended responses, like report generation and visual question answering (VQA) and rationale diagnosis, alternative evaluation metrics, like BLEU, ROUGE and BERT-sim are employed. BLEU stands for “BiLingual Evaluation Understudy”64, ROUGE stands for “Recall-Oriented Understudy for Gisting Evaluation”65. BERT-sim stands for “BERT similarity score”, the F1 BERT score between the generated answer and the correct answer66. For BLEU and ROUGE, if not specific pointing, we all use 1-gram by default.
In addition, inspired by the score RadCliQ12 designed specifically for evaluating generated chest X-ray reports, we also propose two new metrics, UMLS_Precision and UMLS_Recall, which aim to measure the overlapping ratio of medical-related words between ground truth and predicted response. Specifically, given a pair of ground-truth and prediction, we extract the medical-related words from them by using unified medical language system (UMLS)43, and count the overlap words as true-positive. UMLS_Precision is defined with the classical precision concept, i.e., the number of true-positive divides the whole generated medical-related word number. On the other hand, UMLS_Recall also follows the recall concept, i.e., the number of true-positive words divides the total number of medical-related words in the ground truth.
Discussion on automatic metrics. Despite these automatic metrics have been widely adopted by the community, they often struggle to capture the semantic accuracy in generative tasks, for example, question answering, report generation, and rationale generation. To address these limitations and ensure a more accurate evaluation of system performance, we incorporate human evaluation, leveraging the expertise of radiologists, to get a professional evaluation on the quality of generated answers.
Human rating. For the sake of clinical utility, we further involve manual checking in the evaluation stage and compute the human rating score. Three radiologists were asked to rate the quality of the generated answers using a 0–5 scale. Each radiologist has five years of clinical experience in radiology departments. One is affiliated with Shanghai General Hospital, and the other two are from Shanghai Sixth People’s Hospital. All three completed their studies in “Medical imaging and nuclear medicine” at Shanghai Jiao Tong University. Here are the specifics of each rating:
-
1.
Garbled – The content is incomprehensible and lacks any readability.
-
2.
Inaccurate – While readable, the content is entirely incorrect and lacks meaningful information.
-
3.
Partially informative – The content holds some reference value, yet its correctness is subpar.
-
4.
Moderately accurate – The content provides reference points, with approximately half of the information being correct, but containing several errors.
-
5.
Mostly accurate – The content is almost entirely correct, with only a few omissions or errors present.
-
6.
Completely correct – The content is accurate in its entirety, without any mistakes.
To facilitate this assessment, we have developed a human evaluation interface, visually presenting the generative instances with images, as depicted in Supplementary Fig. 2. Prior to the full evaluation, we conducted a preliminary exam involving 20 randomly sampled test cases. This exam was designed to ensure that the radiologists understood the evaluation criteria. All three radiologists showed consistent results, with one exception: for one case, one radiologist rated the answer as 2 while the others rated it as 3. This indicates that our five-point rating system was sufficiently clear for evaluating the model’s outputs. The exam results were also reviewed by a senior radiologist with over 10 years of experience from the radiology department of Shanghai Sixth People’s Hospital, further confirming the validity of the evaluation process. In the evaluation, raters are provided with images, the question, the correct answer, and a set of generated responses from different models, arranged in a randomized order. The evaluation score given by the professional radiologists differs from the automatic evaluation metrics, offering greater accuracy and flexibility. In the context of the report generation example shown in the figure, they focus on the most crucial aspects, rather than solely on word matching, recall or precision.
Note that, human rating is only performed for the open-ended tasks, i.e., medical VQA, report generation and rationale diagnosis. As for disease diagnosis, their answers are fixed without confusion; thus, the automatic metrics can already well reflect the performance. Considering the cost for human rating, for each open-ended task, we randomly sample 400 test cases from RadBench, as they are generally collected from clinical practice across the world, and can represent real scenarios, resulting in 1.2K cases for human rating in total.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.