Study selection and baseline characteristics
A comprehensive search of multiple databases identified a total of 524 records, comprising 111 from PubMed, 412 from Embase, and 1 from Cochrane. After removing 9 duplicate entries, 515 studies underwent title and abstract screening. Of these, 457 studies were excluded for not meeting the predefined inclusion criteria, leaving 58 reports for full-text review. Following a detailed assessment, 19 studies were deemed eligible for inclusion in this systematic review [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39] (Fig. 1: PRISMA Flow Diagram of included studies).
PRISMA flow diagram of included studies
Characteristics of included studies
The 19 included studies comprised a range of study designs, including model development (n = 5) [26,27,28, 32, 33], model evaluation and analysis (n = 6) [21, 23, 30, 31, 35], model training (n = 5) [24, 25, 29, 37, 38], and overlapping studies which include both model development and evaluation/training [22, 34, 39]. These studies were conducted across multiple regions, including Africa (n = 7) [21, 22, 25, 30,31,32, 37], Asia (n = 5) [24, 33, 34, 38, 39], and global or multicenter cohorts (n = 7) [23, 26,27,28,29, 35, 36]. Sample sizes varied substantially, ranging from as few as 102 dermatological images [21] to over 1.8 million participants [34].
AI models evaluated included deep learning convolutional neural networks (CNNs) (n = 4) [22, 31, 32, 38], support vector machines (SVMs) (n = 1) [21], and ensemble learning or hybrid approaches (n = 1) [21]. Public datasets such as ISIC, HAM10000, Derm7pt, PH2, and MSLD were widely used across studies, especially for melanoma and monkeypox classification tasks. The dermatological conditions addressed included skin cancer (n = 7) [22, 26,27,28, 35, 37], monkeypox (n = 5) [21, 24, 29, 32, 36], general skin aging and facial analysis (n = 3) [30, 31, 34], and others such as acne, rosacea, and burn depth estimation (n = 4) [25, 33, 38, 39]. Refer to Table 1 below for characteristics of included studies.
AI performance and diagnostic accuracy
Overall, the AI models demonstrated high accuracy in detecting dermatological conditions, with reported sensitivity ranging from 90 to 98% and specificity between 45 and 99% [28, 32, 35]. In 63% (n = 12) of the studies [21,22,23,24,25, 27, 28, 32, 35,36,37, 39], AI models outperformed dermatologists in diagnostic accuracy, whereas in 26% (n = 5) [29,30,31, 33, 38], human dermatologists either matched or slightly exceeded AI performance. A subset of studies (n = 4) [22, 24, 30, 39] highlighted the benefits of AI-assisted diagnosis, demonstrating that dermatologists collaborating with AI systems achieved greater diagnostic precision compared to unaided clinical assessments.
Comparative analysis of AI and dermatologists
Among studies comparing AI performance directly with dermatologists, 63% (n = 12) [21,22,23,24,25,26, 28, 32, 34,35,36,37, 39] reported that AI models were equivalent to board-certified dermatologists in classifying skin lesions. In contrast, 26% (n = 5) [29,30,31, 33, 38] noted that dermatologists outperformed AI, particularly in complex cases requiring clinical judgment beyond image-based analysis. The integration of AI into clinical workflows was associated with a reduction in diagnostic time and improved triage efficiency in 21% (n = 4) of studies [22, 24, 30, 39].
Quality assessment of included studies
A comprehensive risk of bias assessment using the PROBAST and QUADAS-2 tools revealed that most of the 19 included studies exhibited moderate to high risk of bias due to various methodological limitations. Refer to Table 2 for the detailed risk of bias assessment. Key concerns included small or unrepresentative sample sizes [30, 38], lack of external validation [25, 27, 29, 35], and unblinded AI assessments that could compromise objectivity [31]. While several studies employed well-established public datasets such as ISIC, HAM10000, and MSLD, supporting transparency in participant selection, many overrelied on synthetic or non-diverse data, raising concerns about generalizability and demographic inclusivity [21, 23,24,25, 31, 36].
Additional limitations included unclear label verification processes (e.g., unconfirmed PCR standards in Almufareh 2023) [24], inconsistent reporting on model calibration [27, 29, 35], and the absence of blinded outcome assessment [24]. Notably, only Yuan et al. (2020) linked AI predictions to actual clinical outcomes (e.g., healing time), underscoring a broader gap in translational relevance [38]. Overall, although AI models exhibited promising performance in dermatological diagnostics, enhancements in external validation, demographic representation, model calibration, and integration with clinical outcomes are essential to improve their reliability and applicability in real-world settings.
Summary of findings
This systematic review highlights the promising role of AI in dermatological diagnostics, particularly in resource-limited settings. Overall, the evidence suggests that AI has significant potential in dermatological diagnostics, particularly in enhancing early detection of malignant skin lesions [23, 27, 28, 35, 37] and infectious diseases like monkeypox [21, 24, 29, 32, 36]. However, variability in study methodologies and the need for further clinical validation highlight the necessity for continued research in this field. Future studies should focus on improving AI generalizability through diverse datasets, standardizing evaluation metrics, and integrating AI tools into real-world clinical practice for optimal patient outcomes in low-resource settings.
Pooled analyses of all studies
The review and synthesis showed several patterns in the application of AI technology for dermatological diagnosis in different settings. Deep learning models analyzing visual images, specifically convolutional neural networks (CNNs), were the most used approach, where they exhibited strong performance in both skin cancer detection [27, 35] and infectious disease diagnosis, such as monkeypox [21, 36]. These models were particularly useful in analyzing dermoscopic images, with several studies reporting almost expert levels of performance in melanoma detection when trained on large enough datasets.
Transfer learning approaches proved to be practical in resource-limited settings, where they enabled effective model development despite smaller local datasets. Studies like Olusonji and Chunglin (2025) [22] demonstrated that pre-trained models could be adapted for local dermatologic datasets while maintaining diagnostic accuracy exceeding 85%. Furthermore, this model could also be utilized offline without continuous internet connectivity, making it especially useful in regions where stable internet access may be challenging [22]. This approach significantly reduced computational requirements and training time as compared to developing models de novo [30, 39].
The effectiveness of these technologies varied based on application. LA-CapsNet is a hybrid deep learning architecture utilizing DeepLabV3 + for precise segmentation of skin lesions, and with the employment of three pretrained models: MobileNetV2, EfficientNetB0, and DenseNet201, which surpasses the accuracy of an individual model [26]. Vision Transformers implements a self-attention mechanism, allowing the model to focus on different aspects of an image, capturing the context and relationships by dividing images into patches, allowing for handling various image sizes and resolutions [35]. For skin cancer detection, models consistently achieved accuracies between 80–99%, with LA-CapsNet [26] and Vision Transformers [35] indicating robust performance. In infectious disease diagnosis, approaches that combined multiple architectures showed the most reliable results, with Abdelrahim et al. (2024) [21] reporting 95.45% accuracy for monkeypox detection using SVM-CNN hybrids. However, notable performance differences emerged across different demographic groups, with Kamulegeya et al. (2023) [31] finding substantially lower accuracy (17% vs. 69.9%) for Black patients compared to Caucasian patients in the Ugandan sample.
For African healthcare systems, these findings propose several critical considerations. The current lack of diverse and representative datasets (as evidenced by Kamulegeya et al.’s limited sample of 123 images [31]) represents a major barrier to equitable AI implementation. Mobile-optimized models like MobileNetV2 [24] offer a promising solution for rural areas as they combine high accuracy (96%) with lower hardware requirements. The most successful implementations emphasized human-AI collaboration, with Yuan et al. (2022) [39] showing an improvement in diagnostic precision when clinicians used AI outputs as screening decision support and a lead in diagnosis rather than replacement. These findings underscore the necessity for context-specific implementations that integrate AI technologies with tailored clinical workflows and ongoing clinician training. Future development should prioritize locally collected datasets, lightweight architectures suitable for mobile deployment, and hybrid diagnostic systems that leverage both algorithmic and clinical expertise.