ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Multimodal Machine Learning Approach for Diagnosing Atopic Dermatitis

[version 1; peer review: awaiting peer review]
PUBLISHED 19 Sep 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background

Atopic dermatitis (AD) is a prevalent, chronic inflammatory skin disease with diverse clinical presentations, often overlapping with other dermatoses. Its diagnosis remains largely dependent on clinical expertise, leading to variability and limited diagnostic accuracy, particularly among general practitioners. This study aimed to develop and evaluate a multimodal artificial intelligence (AI) model that integrates lesion image analysis and structured anamnesis to improve AD diagnosis.

Methods

This diagnostic study was conducted in two phases: Phase 1 used retrospective data from 2021–2024, and Phase 2 involved prospective external validation from multiple hospitals in 2025. Patients with AD or related skin conditions were included, with diagnoses based on AAD 2014 criteria. Multimodal fusion combined ResNet50-extracted image features and MPNet-based anamnesis text features using a late fusion model. This approach mimics clinical reasoning by integrating visual and contextual clinical information to classify cases as AD or non-AD.

Results and Discussion

The multimodal AI model integrating ResNet50 (image) and MPNet (anamnesis) achieved 98.28% accuracy in classifying AD vs non-AD, outperforming image-or text-only models. It offers clinical advantages by mimicking physician reasoning, improving diagnostic consistency, reducing subjectivity, and enabling mass triage. However, real-world generalizability remains a challenge due to limited training diversity, potential language constraints (Bahasa Indonesia), and narrow differential diagnoses. External validation and explainable AI (XAI) are critical for broader application. Despite limitations, the model aligns with emerging literature, showing multimodal AI can approach or surpass expert-level performance in dermatological diagnosis when rigorously validated.

Conclusions

The multimodal ResNet50-MPNet model shows near-perfect accuracy in diagnosing AD by mimicking clinician reasoning. It offers consistent, holistic assessment but requires external validation and improved interpretability for clinical adoption. Continued AI-clinician collaboration is vital to translating this promising technology into real-world dermatological care.

Keywords

Atopic dermatitis, Multimodal Artificial Intelligence (AI), ResNet50, MPNet, Dermatology diagnosis, Clinical decision support, Machine learning, Explainable AI (XAI)

Introduction

Atopic dermatitis (AD) is a chronic and relapsing inflammatory skin disease frequently encountered in both children and adults. Its prevalence is estimated to be 15–20% in children and 10% in adults.13 The onset typically occurs before the age of five, underscoring the importance of timely and accurate diagnosis to prevent complications and improve quality of life.46 Clinically, AD is characterized by severe pruritus and xerosis, and is often associated with allergic comorbidities such as asthma and allergic rhinitis.5,7 The complexity of its pathogenesis—encompassing genetic, immunologic, and environmental factors—along with diverse clinical presentations can hinder accurate diagnosis.810 Morphologically, AD may mimic other dermatoses (e.g., psoriasis vulgaris, contact dermatitis, nummular dermatitis), leading to potential misdiagnosis if not thoroughly evaluated.1113

Another diagnostic challenge is the high inter-clinician variability. Diagnosis of AD currently relies on the clinician’s expertise through careful anamnesis and physical examination. This conventional approach is subjective, resulting in variable diagnostic accuracy, especially among general practitioners. Previous studies report diagnostic accuracy for skin diseases by general practitioners to range from 24% to 70%, markedly lower than that of dermatologists.14 This variation in expertise and experience leads to diagnostic inconsistencies and may result in inappropriate or prolonged treatment. Even standardized severity scoring tools (e.g., SCORAD or EASI) show interobserver disparities due to the subjectivity in assessing certain clinical elements. This condition underscores the need for a more reliable and consistent diagnostic method for AD.13,15

To address these diagnostic challenges, artificial intelligence (AI) presents a promising solution. Advances in AI, particularly deep learning (DL), offer new opportunities in dermatology to enhance diagnostic accuracy and consistency. Research on AI-based tools for diagnosing inflammatory skin diseases has demonstrated significant potential.1618 Various Convolutional Neural Network (CNN)-based algorithms have successfully recognized and classified skin lesions with high accuracy, comparable to that of dermatologists.13

Wu et al. (2020) developed a DL model using EfficientNet-b4 to classify psoriasis and AD from lesion images, achieving accuracy, sensitivity, and specificity rates above 90%. Maron et al. (2019) showed that CNNs systematically outperformed 112 dermatologists in multi-class skin lesion classification, highlighting the potential of DL in dermatological diagnostics.19 Dautovic et al. applied an artificial neural network (ANN) using nine clinical parameters to diagnose AD in both healthy individuals and AD patients. Yang et al. also utilized DL to recognize dermoscopic images of psoriasis and inflammatory diseases such as dermatitis, achieving a sensitivity of 73%.20

Despite these promising results, several limitations persist—particularly the lack of clinical context. Most current models are trained on images alone, neglecting additional clinical information that could enhance diagnostic decision-making.21 Such single-modality approaches risk overlooking important patient context that clinicians routinely consider in clinical practice. In fact, anamnesis and clinical information—such as patient age, chief complaints, history of atopy, lesion distribution, and family medical history—contain valuable insights that can aid in distinguishing AD from its differential diagnoses and provide a more comprehensive assessment of the patient’s condition. Fundamentally, physicians establish a diagnosis based on a combination of history taking and physical examination.

The fact that only a few multimodal AI models are currently available in dermatology—especially for inflammatory skin diseases such as AD—reveals a significant research gap. Integrating clinical data into AI models has the potential to bridge this disparity, in line with the current medical trend of leveraging diverse patient data sources to enhance diagnostic accuracy. Multimodal AI approaches represent a highly promising avenue for achieving more comprehensive and accurate diagnoses of complex conditions such as AD (Yu et al., 2025).13,22,23

This study aims to develop and evaluate an AI model with a multimodal approach, combining DL-based clinical image analysis with structured textual anamnesis using a transformer-based language model. Specifically, the architecture employs a 50-layer Residual Network (ResNet50) for image feature extraction and a Masked and Permuted Pre-training for Language Understanding (MPNet) for textual feature extraction. ResNet50 is a widely used CNN model with proven success in both medical and general image classification.24 Its architecture introduces skip connections that help the network learn deep image features without losing critical information. In dermatology, ResNet50 has demonstrated high accuracy—for example, achieving 90% on the ISIC 2018 dataset and 95.8% on the PH2 dataset for skin disease detection.25

MPNet is a transformer-based NLP model designed to produce rich contextual sentence representations. Combining the strengths of BERT and XLNet, MPNet effectively captures complex language patterns and generates high-dimensional text embeddings. In this study, MPNet processes structured anamnesis data (e.g., chief complaints, subjective symptoms, history of allergies, family medical history, and others) into numerical text features. The integration of MPNet ensures that relevant non-visual clinical information is appropriately incorporated into the model.

Methods

This diagnostic study was conducted in two phases. Phase 1 focused on the development of the AI model. This phase comprised several stages. The first stage involved the collection of medical data through retrospective review of hospital medical records from 2021 to 2024. Data included clinical information and skin lesion images of patients diagnosed with AD, psoriasis vulgaris (PV), chronic lichen simplex (CLS), nummular dermatitis (ND), and contact dermatitis (CD), as diagnosed by board-certified dermatologists. In the second stage, images were pre-labeled with relevant clinical information. The third stage involved training the machine learning (ML) model to identify discriminative features and characteristics that differentiate AD from non-AD ( Figure 4).

38d22c9e-e265-4bc3-a62b-2d722b31ab5f_figure1.gif

Figure 1. ResNet50 training and validation accuracy image.

38d22c9e-e265-4bc3-a62b-2d722b31ab5f_figure2.gif

Figure 2. F1 score image.

38d22c9e-e265-4bc3-a62b-2d722b31ab5f_figure3.gif

Figure 3. Multimodal model algorithm flowchart.

38d22c9e-e265-4bc3-a62b-2d722b31ab5f_figure4.gif

Figure 4. Research flow – Phase 1.

Phase 2 was a multicenter validation study designed to evaluate the generalizability of the ML model trained in Phase 1. Clinical data and lesion images were collected from patients diagnosed with AD or non-AD skin conditions who visited board-certified dermatologists at various hospitals between January and May 2025. In this phase, dermatologists performed complete clinical interviews, and all clinical data were recorded using a structured form ( Figure 5).

38d22c9e-e265-4bc3-a62b-2d722b31ab5f_figure5.gif

Figure 5. Research flow – Phase 2.

In both Phases 1 and 2, skin lesion images were captured using mobile phones owned by the examining physicians. All participants formed a consecutive and provided written informed consent and agreed to the use and publication of their anonymized data, including medical history and lesion images. For minor participants, written consent was obtained from their legal guardians. Ethical approval was obtained from the Ethics Committee of the Faculty of Medicine, Universitas Indonesia–Cipto Mangunkusumo Hospital. The diagnosis of AD was established based on the criteria outlined in the American Academy of Dermatology (AAD) guidelines for the diagnosis and assessment of AD 2014. Two board-certified dermatologists independently annotated all images, achieving 100% agreement. The learning outcome was binary: AD or non-AD. All clinical images and patient data were archived systematically. The sample size was calculated using the formula for diagnostic studies, resulting in a minimum requirement of 346 participants for the AD group and 138 for the non-AD group.

Multimodal fusion was implemented by integrating visual features from ResNet50 and textual features from MPNet prior to final prediction. A late fusion approach was used, in which feature vectors from both modalities were concatenated and passed through a classification layer to determine the diagnosis. This method leverages the strengths of both modalities: images provide morphological and distributional characteristics of lesions, while anamnesis text offers clinical context such as chronic pruritus, personal or family history of atopy, and comorbid asthma. The combined approach is expected to reach above 90% accuracy and emulate the way dermatologists diagnose—by integrating visual inspection with the patient’s clinical information.

Results

The characteristics of subjects with and without AD are presented in Table 1. In Phase 1, a total of 926 AD samples and 697 non-AD samples were collected, while Phase 2 yielded 525 AD samples and 663 non-AD samples. The findings indicate that AD was most frequently observed in infants and children, accounting for 45.45% of cases in Phase 1 and 40% in Phase 2. Its prevalence declined with increasing age, with only 0.21% of cases occurring in individuals over 65 years old. These results are consistent with the known epidemiology of AD, which typically begins in childhood or adolescence. AD most commonly manifests in infancy and early childhood, with approximately 60% of cases developing before the age of one, and nearly all cases occurring before the age of five.26

Table 1. Demographic and clinical characteristics of phase 1 and phase 2 study subjects.

CharacteristicsPhase 1Phase 2
ADNon-AD ADNon-AD
n%n%n%n %
Age GroupToddler: 0–5 years18920.41006913.1430.45
Child: 5–11 years12513.5040.579818.67324.83
Early adolescent: 12–16 years10711.56202.87438.19111.66
Late adolescent: 17–25 years19921.49608.618916.958612.97
Young adult: 26–35 years10711.5619227.547714.6712819.31
Middle-aged adult: 36–45 years10911.7711516.506712.768412.67
Early elderly: 46–55 years737.8923233.29417.8114221.42
Late elderly: 56–65 years151.617410.62122.2910916.44
Senior: > 65 years20.2100295.526810.26
SexMale46049.6825636.7324246.1031347.21
Female46650.3244163.2728353.9035052.79
Duration of illness1 week313.35162.30183.43588.75
2 weeks151.6200387.24507.54
3 weeks0020.290000
4 weeks343.67537.60234.38142.11
> 4 weeks84691.3662689.8144684.9554181.6

Image

The dataset used in this study consists of images of skin lesions from patients diagnosed with AD as well as patients suffering from other skin conditions whose lesions visually resemble those found in AD cases. To perform a supervised classification task, we categorized the images into two distinct classes: AD (representing AD lesions) and Non-AD (representing non-AD lesions). In order to enhance model generalization and robustness, we applied extensive data augmentation prior to training. Each original image was augmented five times using the Albumentations library to produce image sets with various transformations. These included horizontal and vertical flipping, random adjustments to brightness and contrast, rotations of up to 30 degrees, mild Gaussian blurring, and the addition of Gaussian noise. After augmentation, all images were resized to a standardized dimension of 224 × 224 pixels. Furthermore, pixel values were normalized using the ImageNet mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225] to match the input expectations of the pretrained models. For the classification task, we evaluated two different DL architectures: ResNet50 and Vision Transformer (ViT). Each model’s final classification layer was adapted to produce outputs corresponding to the two target classes, AD and Non-AD.

Training was conducted using the PyTorch DL framework, which is widely used in the field of computer vision research due to its flexibility and high performance. Since the dataset used in this study is relatively small in scale, which is a common situation in the medical field where obtaining labeled images can be challenging, it becomes crucial to adopt an evaluation strategy that maximizes the use of available data while still providing a reliable estimate of the model performance. To address this, we employed the use of K-fold Cross Validation.

Rather than splitting the dataset into a single training and testing group, the cross-validation approach divides the dataset into ten equal parts, or “folds.” In each round of training, the model is trained using nine of these folds and evaluated on the remaining one. This process is repeated ten times, with each fold taking a turn as the validation set. At the end of this procedure, the results from all ten rounds are averaged to provide a comprehensive assessment of the model’s performance. This technique reduces the risk of the evaluation being biased by the specific choice of training or testing data and is particularly important when dealing with medical datasets, where the number of samples can be limited. It offers a favorable balance between model accuracy and computational efficiency.

During training, we also implemented an optimization algorithm called Adam, which automatically adjusts the learning rate during training to improve the model’s ability to learn from data. To further refine the training process, we applied a strategy called learning rate scheduling, where the learning rate was reduced gradually over time to allow the model to converge more smoothly. In addition, we implemented early stopping, a technique where the training process is halted if the model’s performance on validation data does not improve after a few consecutive training rounds. This prevents the model from overfitting, or memorizing the training data too much, and helps ensure that it learns patterns that are generalizable to unseen data.

Throughout the training and evaluation process, we monitored several key metrics to assess model performance. These included accuracy, which measures how often the model’s predictions matched the correct labels; precision, which indicates how many of the predicted DA cases were actually correct; recall, which measures how many actual DA cases the model successfully identified; and F1-score, which balances precision and recall into a single number to give an overall sense of the model’s effectiveness. By considering these multiple aspects, we aimed to evaluate not just whether the model made correct predictions overall, but also whether it was good at correctly identifying cases of AD without missing too many or falsely predicting AD where there was none.

After completing the full training and evaluation pipeline, ResNet50 emerged as the best-performing model among the three architectures tested, achieving an accuracy score of 0.8750, compared to 0.6013 for the ViT. ResNet50 consistently demonstrated superior performance across all key metrics, including higher accuracy, precision, recall, and F1-score when compared to the other model ( Table 2). In particular, ResNet50 exhibited strong stability during training and converged more rapidly. These results suggest that convolutional architectures with residual connections, such as ResNet, are particularly well-suited for medical image classification tasks such as skin lesion analysis, where fine-grained texture analysis is the highlight.

Table 2. Key performance metrics of Resnet50 vs ViT.

ModelAccuracyPrecisionRecall F1-Score
Resnet500.87500.88090.87500.8728
ViT0.60130.58090.60130.5487

Text

The dataset used in this study consists of anamnesis text of patients whose diagnosed with AD as well as patients suffering from other skin conditions whose lesions visually resemble those found in AD cases. Clinical text inputs were derived from dermatologist notes within medical records during Phase 1, and from patient anamnesis recorded in structured forms during Phase 2. The text in the dataset are written in Bahasa Indonesia. To perform a supervised classification task, we categorized the anamnesis into two classes, the AD and Non-AD. The anamnesis contains several parameters such as patient age, gender, symptoms, contact history, etc. Those parameters were then configured to follows a template as follows:

Pasien dengan jenis kelamin: <gender>

usia: <age>

klasifikasi usia: <age_classification>

Keluhan utama dan onset: <main_onset_symptoms>

Riwayat kontak dengan bahan alergen atau iritan: <irritant_allergent_contact_hisotry>

Sumber infeksi: <infection_source>

Faktor pencetus penyakit saat ini: <current_disease_cause>

Lama sakit: <disease_duration>

Lokasi lesi: <lesion_location>

Kriteria mayor: <major_criterion>

Kriteria minor: <minor_criterion>

Riwayat penyakit dahulu: <old_historical_disease>

Riwayat penyakit keluarga: <familiy_historical_disease>

By generating text following this template, we obtained a string as the model input to be classified by the proposed method. After we obtain the full text as model input, we need to do pre-processing to the text before feeding it to the model. In this text processing task, the preprocessing part was done by using a tokenizer. A tokenizer is a tool or process used in natural language processing (NLP) to break down a text into smaller units, called tokens. These tokens can be words, subwords, or characters, depending on the tokenization method used. The purpose of tokenization is to transform raw text into a more manageable and structured form that can be analyzed or fed into ML models.

In this study we used a pre-trained model, MPNet which is a state-of-the-art transformer model that was introduced as a variant of BERT (Bidirectional Encoder Representations from Transformers) and other transformer-based models like RoBERTa and ALBERT. MPNet improves upon BERT’s pretraining strategy by using a more sophisticated approach for handling masked tokens and learning better dependencies between tokens in a sequence. The tokenizer used in MPNet is based on the WordPiece tokenization method, which is also used in BERT and similar transformer models.

The output of this MPNet pre-trained model is 768 sized vectors. To do classification task, we added some dense networks downstream the MPNet architecture and formed a single output classification. The model was created using PyTorch DL framework, which is widely used in the field of computer vision research due to its flexibility and high performance. For the training process, we splitted the dataset into training and validation parts with a ratio of 8:2. After the training process, the fine-tuned MPNet model obtained an accuracy of 100% in both validation and training data.

Multimodal

The multimodal approach in this study integrates two powerful models: ResNet50 for visual data and MPNet for textual data, to enhance classification performance by leveraging both image and text information. ResNet50, a deep CNN, is employed to process the images of skin lesions. These images are first preprocessed through augmentation, resizing, and normalization, ensuring compatibility with the input requirements of ResNet50. The model then generates a feature vector from the image, capturing the essential visual characteristics of the skin lesion, which represents the image data for subsequent processing.

In parallel, MPNet, a transformer-based model, is applied to the textual data consisting of patients’ anamnesis information. MPNet tokenizes the text and generates a fixed-size vector (768-dimensional), which encodes the relationships and context between the words. This representation is essential for understanding the nuanced medical history of the patients, such as symptoms, age, and other relevant factors. MPNet’s ability to model contextual dependencies and generate more accurate token representations provides an edge over traditional models like BERT, especially when dealing with complex and diverse medical texts.

Once both ResNet50 and MPNet have processed their respective modalities and produced their vectorized outputs, the next step is to combine these two vectors into a single unified feature representation. This fusion of image and text data ensures that the model is utilizing all available information, combining the fine-grained visual analysis from the images with the rich, contextual information from the text. The feature vectors from both modalities are concatenated, creating a comprehensive representation that encapsulates the full scope of the data.

Finally, to refine the combined feature vector and prepare it for classification, several dense layers are applied downstream. These dense layers help the model learn the most relevant patterns from the multimodal data. The final output from the dense layers is a classification label that distinguishes between the two target classes: AD and Non-AD. This multimodal approach effectively leverages both image and text data, improving the model’s robustness and accuracy by considering multiple facets of the data simultaneously, which is crucial in medical classification tasks.

Clinical implications, strengths, and limitations of the multimodal AI model

The image-only model achieved an accuracy of 87.50% on the training set and 90.13% on the test set ( Figure 1). Key performance metrics, including accuracy, precision, recall, and F1 score are presented in Figure 2 and Table 2. The text-only model achieved 100% accuracy on both the training and validation sets. Following the fusion of both modalities, the overall accuracy increased significantly to 98.28% ( Figure 3).

The multimodal AI approach employed in this study offers several advantages. First, the AI model can consistently analyze hundreds of microscopic visual features and textual patterns without fatigue, enabling the potential for high accuracy and sensitivity across diverse cases. In fact, numerous studies have demonstrated that AI performance in dermatological diagnosis can match or even surpass that of dermatology specialists for certain specific tasks.27 For instance, a recent meta-analysis by Salinas et al. (2024) on skin cancer diagnosis reported AI algorithms achieving a sensitivity of 87% and specificity of 77%, comparable to expert dermatologists (sensitivity 84%, specificity 74%).28 In the context of AD, the ResNet50 model, when specifically trained, has outperformed conventional models and approached expert-level performance in assessing AD severity.29 By integrating anamnesis data, AI can also account for factors typically gathered through patient interviews, thereby allowing for more clinically informed decision-making.

AI systems can also process multiple cases simultaneously, supporting mass screening or triage workflows, thereby allowing physicians to prioritize complex cases. Moreover, multimodal AI reduces subjectivity: decisions are derived from patterns learned across hundreds of examples, rather than individual intuition, which can vary among physicians. With a reported accuracy of 98.28%, this system has strong potential as a reliable diagnostic aid, offering second opinions to clinicians and enhancing general practitioners’ confidence in correctly identifying AD.30

Nevertheless, several important considerations must be addressed. First, the model’s outstanding performance in a controlled setting may not necessarily reflect its effectiveness in real-world clinical practice. Generalizability remains a common challenge in AI models, as they may learn patterns specific to a particular hospital’s dataset, potentially resulting in reduced accuracy when applied to different patient populations or images captured with other camera devices. While 10-fold cross-validation provides a comprehensive internal evaluation, external validation using data from independent clinics is essential to ensure the model is not overfitting. The high accuracy of 98.28% raises a concern that the model may have been overtrained on a limited dataset. In this study, external validation was conducted using datasets from multiple hospitals to ensure that the model maintains reliable and consistent performance. In future work, explainable AI (XAI) techniques could be employed to enhance interpretability, allowing the model’s decision-making process to be visually explained and verified against clinical reasoning.31,32

Another key issue is interpretability. Although AI can achieve high levels of accuracy, it remains limited in its ability to explain the rationale behind a given prediction. Physicians, by contrast, can justify a diagnosis based on observable clinical features (e.g., “there is excoriation due to intense itching, a typical distribution pattern on the infant’s cheeks, and a positive family history of atopy, consistent with AD”). AI models require XAI techniques to offer similar justifications. Without such interpretability, trust in AI-generated recommendations—both among physicians and patients—may be diminished. Moreover, broader clinical context is often necessary for final decision-making. Physicians may rely on direct physical examination (e.g., palpating the skin texture), additional diagnostic tests, or further clinical history. AI models, on the other hand, can only process the limited input they are provided (i.e., images and structured textual data), making their reasoning less flexible than that of physicians. Clinicians can pose follow-up questions, revise diagnostic hypotheses in real time, and manage atypical or complex cases not represented in the training data—for example, a patient presenting with two concurrent skin conditions.

Finally, physicians possess critical strengths in empathy and clinical ethics—for example, delivering a diagnosis and treatment plan in a psychologically appropriate and compassionate manner. This human element lies beyond the scope of AI models. Therefore, AI should be viewed as a complementary tool that supports physicians, rather than as a full replacement. AI is intended to serve as a decision support system, not a substitute for clinical judgment. Collaborative use of AI and physicians has been shown to improve diagnostic accuracy compared to either working alone.33 In practice, AI can provide a rapid second opinion, while the final decision and any necessary contextual adjustments remain the responsibility of the physician.

Discussion

Accuracy of 98.28% and its relevance in current literature

The final accuracy of 98.28% achieved by the multimodal ResNet50 and MPNet model in distinguishing between AD and non-AD cases represents an exceptionally high performance. To assess its relevance, this result must be compared with recent studies in the field of computer-aided dermatological diagnosis. In general, AI models for skin lesion classification typically achieve accuracies above 80–90%. For example, Wu et al., (2020) developed and validated an image-based DL model, AIDDA, using the EfficientNet-b4 convolutional neural network architecture to automate the diagnosis of common inflammatory skin diseases, including psoriasis, eczema, and AD. Trained on a dataset of 4,740 expert-labeled clinical images, the model achieved high diagnostic performance, with an overall accuracy of 95.8%, sensitivity of 94.4%, and specificity of 97.2%.

Therefore, the 98.28% accuracy achieved in this study warrants attention as a highly promising result. It highlights the potential advantage of a multimodal approach. By incorporating additional clinical data, the model may resolve ambiguities that remain when relying on image analysis alone, allowing for near-perfect classification across samples. Other multimodal studies have also reported strong performance. Adebiyi et al. (2024), for instance, integrated lesion images with patient metadata and achieved an area under the curve (AUC) of 0.94 (94%) on the HAM10000 dataset—an improvement over image-only models.34 In addition, recent developments in transformer-based fusion approaches, such as TFormer, have been specifically designed to deeply integrate multimodal information and have demonstrated enhanced diagnostic accuracy across various types of skin lesions.17

Nevertheless, the scientific community emphasizes the necessity of validating AI models in real-world clinical scenarios. Burlando et al. (2024), in a recent meta-analysis, stressed the importance of external validation across diverse populations and collaborative decision-making between AI and clinicians.28 Near-perfect accuracy may indicate model excellence, but also raises the concern of potential data leakage or bias. For instance, if structured anamnesis texts contain inadvertently predictive features (e.g., specific keywords unique to AD cases), the model might overfit to non-clinical artifacts. Thus, transparent methodological reporting—such as details on 10-fold cross-validation and feature importance analyses—is essential for interpreting these results responsibly.

The 98.28% accuracy is consistent with previous studies reporting that advanced DL models, particularly those utilizing multimodal inputs, can achieve diagnostic performance comparable to or exceeding that of experienced clinicians in specific tasks.19 This parallels recent innovations in digital dermatology, such as the PanDerm foundation model, trained on millions of multimodal images, which outperformed dermatologists in early melanoma detection.35 Similarly, generative models like SkinGPT-4 (2024) now integrate ViT with large language models to generate diagnostic reports that approximate physician reasoning.36 In this context, our findings contribute compelling evidence that multimodal AI has the potential to reshape dermatological diagnostics, provided that it is demonstrated to be consistent in larger-scale studies.

Several limitations warrant consideration. First, the model was trained on anamnesis written in Bahasa Indonesia using a language-specific version of MPNet. Although MPNet has multilingual variants, the one applied here may not generalize effectively to English or other languages. Moreover, differences in documentation style (e.g., brief vs. narrative notes) could also impact model performance. Future implementations should consider language adaptation or employ inherently multilingual models for broader applicability.37

Second, the differential diagnosis in the present study was restricted to four conditions: lichen simplex chronicus, psoriasis vulgaris, nummular dermatitis, and contact dermatitis. In clinical reality, AD mimickers may also include conditions such as scabies, fungal infections, or seborrheic dermatitis, which were not represented in the training data. This raises concerns about generalizability when the model is applied to unfamiliar cases. Expanding the dataset to encompass a broader spectrum of dermatoses, or transitioning to a multiclass classification model, could improve both utility and clinical relevance.

Third, real-world deployment of this system would require regulatory approval, robust documentation of safety and reliability, and integration with electronic health records to ensure seamless clinical workflow. This would necessitate interdisciplinary collaboration among clinicians, engineers, and policy makers. However, to successfully implement AI, key barriers need to be addressed. Efforts should focus on ensuring algorithm transparency and adequate regulation of algorithms. Simultaneously, improving knowledge about AI could reduce the fear of replacement.

Conclusion

The multimodal approach combining ResNet50 and MPNet represents a breakthrough in the automated diagnosis of AD. By integrating image and textual data, the model is able to emulate a holistic diagnostic approach similar to that of clinicians, as demonstrated by its near-perfect internal accuracy. Compared to baseline physician assessments, this model offers the potential for greater consistency in diagnosis. However, its adoption in clinical practice must be supported by robust external validation and improved interpretability to gain trust within the medical community. The reported 98.28% accuracy reflects the promise of multimodal AI in dermatology, echoing broader trends in the field toward increasingly sophisticated diagnostic systems. Nonetheless, it is essential to remain critical of the limitations and challenges involved in real-world implementation. With continued advancement in AI research, collaboration between AI and clinicians holds great promise for improving diagnostic accuracy and the overall quality of care, particularly for patients with AD and other dermatologic conditions.38,39

Ethical considerations

Ethical approval for the study was obtained from the Health Research Ethics Committee, Faculty of Medicine, Universitas Indonesia–Cipto Mangunkusumo Hospital (certificate number KET-1477/UN2.F1/ETIK/PPM.00.02/2024). All participants provided written informed consent prior to data collection. For minor participants, written consent was obtained from their legal guardians. Ethical approval was obtained from the Ethics Committee of the Faculty of Medicine, Universitas Indonesia–Cipto Mangunkusumo Hospital. Patient data and images were anonymized before use, and no identifiable information was included in the analysis. This diagnostic study did not involve any interventional procedures.

Reporting guidelines

Figshare: STARD checklist for Multimodal Machine Learning Approach for Diagnosing Atopic Dermatitis

https://doi.org/10.6084/m9.figshare.29925533

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

Source code available from:

Archived software availabile from:

License: MIT License

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 19 Sep 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Widiawaty A, Indriatmi W, Jatmiko W et al. Multimodal Machine Learning Approach for Diagnosing Atopic Dermatitis [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:952 (https://doi.org/10.12688/f1000research.169102.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 19 Sep 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.