ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

How to Achieve Near-Perfect Gender Inference Accuracy in Medicine Using ChatGPT

[version 1; peer review: awaiting peer review]
PUBLISHED 03 Nov 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the AI in Medicine and Healthcare collection.

Abstract

Background

Gender inference from names is widely used in bibliometric and epidemiologic research, including in general internal medicine. Traditional tools such as Gender API and NamSor are considered accurate but remain limited by misclassifications and unclassified cases. Recent studies suggest that ChatGPT may perform comparably to these tools. We aimed to test whether a two-step procedure could further improve ChatGPT’s performance.

Methods

We evaluated ChatGPT-5 against Gender API using a random sample of 1,000 Swiss physicians. A two-step one-shot prompt was applied: 1-assign gender directly from the name if reliable; 2-otherwise, verify using the internet. Gender API was applied to the same dataset with no threshold and at probability thresholds of ≥60%, ≥70%, ≥80%, and ≥90%. Confusion matrices, McNemar’s test, and accuracy metrics (errorCoded, errorCodedWithoutNA, naCoded) were computed.

Results

Of 1,000 physicians, 523 (52.3%) were women and 477 (47.7%) were men. ChatGPT-5 achieved 996 correct classifications (99.6%), with 4 errors and no unclassified cases, whereas Gender API (whole sample) achieved 977 correct classifications (97.7%), 18 errors, and 5 unclassified cases (p-value<0.001). At higher thresholds, Gender API reduced errors but produced up to 6.5% unclassified cases. Overall error rates (errorCoded) were 0.4% for ChatGPT-5 versus 2.3% for Gender API. ChatGPT-5 marked 10.1% of names as “checked” through internet verification, increasing to 69.6% among cases that Gender API misclassified or left unclassified.

Conclusion

ChatGPT-5 substantially outperformed Gender API in gender inference from physicians’ names, achieving near-perfect accuracy without unclassified cases. Its adaptive use of internet verification for difficult names may offer a robust and efficient approach for large-scale research.

Keywords

AI, artificial intelligence, ChatGPT, comparison, gender, Gender API, gender inference, medicine, name, name-to-gender, natural language processing

Introduction

Name-to-gender inference services, which determine gender from names, have become an increasingly common methodological tool in medical, social science, and bibliometric research.1 These tools enable investigators to rapidly assess gender representation across large datasets at relatively low cost, thereby supporting studies on equity, diversity, and inclusion in science and medicine, including in general internal medicine. Applications have included the evaluation of gender disparities in editorial boards,25 research funding,6,7 and scientific authorship,812 consistently highlighting the persistent underrepresentation of women in leadership and senior authorship roles. Among the best-known services are Gender API, NamSor, and Genderize, with reported misclassification rates of ≤5%.1,13,14 Nevertheless, performance varies across cultural and linguistic contexts, and classification of unisex or rare names remains a challenge. Furthermore, most tools rely on proprietary databases and contextual metadata, which may limit reproducibility and transparency.

The advent of artificial intelligence (AI) and large language models (LLMs) such as ChatGPT opens new possibilities for gender inference. Early studies have suggested that ChatGPT performs comparably to the best existing tools, with the added benefit of reproducibility.15,16 However, few investigations have systematically evaluated its accuracy in this domain.

In the present study, we extend this line of research by using the most recent version of ChatGPT (ChatGPT-5) in a structured two-step “one-shot” query: (1) if gender can be reliably determined from the name alone, it is assigned directly; (2) if not, ChatGPT searches the internet to verify the gender. By combining linguistic knowledge with web-based verification, this approach aims to maximize classification performance. We hypothesize that ChatGPT-5, implemented with this two-step procedure, will achieve higher accuracy than Gender API.

Methods

Study population

This study builds on a previous work that evaluated the performance of four name-to-gender inference services: Gender API, NamSor, Genderize, and Wiki-Gendersort.1 It was conducted in Switzerland, a multilingual and multicultural country with four national languages (German, French, Italian, and Romansh) and a high proportion of foreign-trained physicians (36% overall; 33% in outpatient medicine and 40% in hospital medicine).1 The most frequent countries of origin among foreign physicians are Germany (53%), Italy (9%), France (7%), and Austria (6%).

The database used for the current work was described in detail in the primary study.1 In brief, it was compiled by merging several sources: (1) physicians and trainee physicians affiliated with the University Hospital of Geneva, the largest hospital in Switzerland, (2) senior physicians working in Swiss university hospitals, and (3) community-based physicians. After deduplication, the database comprised 6,131 physicians, of whom 50.3% were women.

In our primary study, the probable origin of first names was inferred using nationalize.io, which assigned an origin to 85% of names.1 The majority of names (88%) originated from Western countries or countries where Western languages are widely spoken, while 12% originated from Asian or Arabic-speaking countries, reflecting both the predominance of Western names and the multicultural nature of the physician population in Switzerland.

For the present study, we randomly selected a subsample of 1,000 physicians. For each physician, first name, last name, and gender (as recorded in the source database) were available and used as the reference standard.

Pretesting

Before analyzing this sample, we conducted pretests with an independent random set of 100 physicians, who were not included in the final dataset. To determine the optimal batch size for prompting ChatGPT, we submitted lists of 1, 5, 10, 20, 50, and 100 names. Each test was repeated several times, always in a new chat session. Performance was assessed by comparing ChatGPT’s output against the reference gender recorded in the source database, with discrepancies manually checked. The most consistent and accurate performance was obtained with 10 names per query. This setting was therefore retained for the study, which consisted of 100 queries of 10 names each, with a new chat session for every query.

ChatGPT procedure

We used ChatGPT-5.0 (OpenAI) in a one-shot setting, a choice motivated by previous evidence of its reliability for gender determination15,16 and by feasibility considerations given the large number of queries required. To increase the difficulty of the task and to avoid potential reliance on publicly available registries of Swiss physicians, individuals were referred to as “researchers” rather than “physicians” in the prompt. This wording was also consistent with the Swiss context, in which physicians are expected to publish scientific articles as part of their certification process.

The exact prompt used was: “Here is a list of 10 researchers (first name + last name). Your task is to determine their gender (Female / Male). Instructions: (1) If the gender can be reliably determined from the researcher’s name alone, assign it directly. (2) If the gender cannot be reliably determined from the name, search the internet to verify it. (3) Output exactly 10 lines, one per researcher, in the following format: gender (Female/Male), checked (Yes/No), first name, last name. (4) Do not include explanations, reasoning, or tables. Only output the 10 lines”.

All queries were conducted manually in a new chat window to avoid memory effects or contamination from previous prompts. Internet verification was allowed only when the model indicated that the name alone was insufficient for reliable classification.

Comparator: Gender API

As a comparator, we used Gender API (gender-api.com), identified as the most accurate name-to-gender inference tool in our primary study.1 An Excel file containing the first and last names of the 1,000 physicians was uploaded. For each case, Gender API returned an assigned gender (female, male, or unknown), a probability score, and the sample size underlying the inference. Analyses were conducted on the full dataset as well as at multiple probability thresholds (≥60%, ≥70%, ≥80%, and ≥90%). When the probability score fell below the defined threshold, the case was coded as “unclassified” for the purposes of analysis.

Statistical analysis

For both ChatGPT-5 and Gender API, we constructed confusion matrices and compared performance using McNemar’s test. For each tool, we calculated the number and proportion of correct, incorrect, and unclassified assignments. Incorrect assignments are also referred to as misclassifications, and unclassified assignments as nonclassifications. We also computed standard performance metrics used in accuracy studies of gender inference1,13,17,18:

  • - errorCoded (overall error rate), calculated as the number of incorrect plus unclassified assignments divided by the total number of cases.

  • - errorCodedWithoutNA (error rate excluding unclassified cases), calculated as the number of incorrect assignments divided by the number of cases with a classification.

  • - naCoded (proportion of unclassified cases), calculated as the number of unclassified assignments divided by the total number of cases.

All analyses were conducted using Stata version 15.1 (StataCorp, College Station, TX, USA). A two-sided p-value <0.05 was considered statistically significant.

Ethical considerations

This study was conducted in accordance with the principles of the Declaration of Helsinki. It was based exclusively on publicly available information (physicians’ names and genders from institutional websites and registries) and did not involve the collection of personal health-related data. According to Swiss law, such studies do not require approval from a local ethics committee. Therefore, formal approval from an Institutional Review Board or ethics committee was not sought.

Because the study relied solely on publicly available information and did not involve direct contact with individuals, the requirement for informed consent did not apply.

Results

Of the 1,000 physicians included in the study, 523 (52.3%) were women and 477 (47.7%) were men. Table 1 presents the confusion matrices for ChatGPT-5 and Gender API. ChatGPT-5 correctly classified 521 of 523 women (99.6%) and 475 of 477 men (99.6%). Only 4 physicians were misclassified (0.4%), and no cases were unclassified. By contrast, Gender API correctly classified 510 women (97.5%) and 467 men (97.9%). A total of 18 physicians (1.8%) were misclassified, and 5 (0.5%) remained unclassified. The difference in classification performance between ChatGPT-5 and Gender API was statistically significant (McNemar’s test, p<0.001).

Table 1. Confusion matrices for ChatGPT-5 and Gender API in gender classification (n = 1,000 physicians).

Gender detection toolClassified as women, n (%)Classified as men, n (%) Unclassified, n (%)
ChatGPT-5
 Female physicians521 (99.6)2 (0.4)0
 Male physicians2 (0.4)475 (99.6)0
Gender API
 Female physicians510 (97.5)10 (1.9)3 (0.6)
 Male physicians8 (1.7)467 (97.9)2 (0.4)

Overall, 101 of 1,000 names (10.1%) were marked as “checked” by ChatGPT-5, indicating that the model searched the internet rather than assigning gender directly from the name. The proportion of checked names was substantially higher among cases misclassified or unclassified by Gender API: 16 of 23 (69.6%) in this subgroup were checked by ChatGPT-5.

Table 2 summarizes the number of correct, incorrect, and unclassified assignments. ChatGPT-5 achieved 996 correct classifications (99.6%), with 4 errors (0.4%) and no unclassified cases. Gender API, when applied to the whole sample without thresholds, achieved 977 correct classifications (97.7%), 18 errors (1.8%), and 5 unclassified cases (0.5%). When increasing the probability threshold, the number of errors decreased but at the cost of a growing number of unclassified cases: from 11 errors and 20 unclassified at the ≥60% threshold to only 1 error but 65 unclassified at the ≥90% threshold.

Table 2. Number of correct, incorrect, and unclassified assignments for ChatGPT-5 and for Gender API at different probability thresholds (n = 1,000 physicians).

Gender detection toolCorrect, n (%)Incorrect, n (%) Unclassified, n (%)
ChatGPT-5996 (99.6)4 (0.4)0
Gender API (whole sample)977 (97.7)18 (1.8)5 (0.5)
Gender API (≥60% threshold)969 (96.9)11 (1.1)20 (2.0)
Gender API (≥70% threshold)967 (96.7)8 (0.8)25 (2.5)
Gender API (≥80% threshold)954 (95.4)6 (0.6)40 (4.0)
Gender API (≥90% threshold)934 (93.4)1 (0.1)65 (6.5)

Table 3 shows the performance metrics. For ChatGPT-5, the overall error rate (errorCoded) was 0.004, identical to the error rate excluding unclassified cases (errorCodedWithoutNA), as no cases were unclassified. For Gender API, the overall error rate was higher, ranging from 0.023 for the whole sample to 0.066 at the ≥90% threshold. The error rate excluding unclassified cases decreased progressively with stricter thresholds (from 0.018 to 0.001), but this was offset by a higher proportion of unclassified cases (naCoded), increasing from 0.005 for the whole sample to 0.065 at the ≥90% threshold.

Table 3. Performance metrics for ChatGPT-5 and for Gender API at different probability thresholds (n = 1,000 physicians).

Gender detection toolerrorCodederrorcodedWithoutNA naCoded
ChatGPT-50.00400.00400
Gender API (whole sample)0.02300.01810.0050
Gender API (≥60% threshold)0.03100.01120.0200
Gender API (≥70% threshold)0.03300.00820.0250
Gender API (≥80% threshold)0.04600.00630.0400
Gender API (≥90% threshold)0.06600.00110.0650

Discussion

Main findings

In this study based on 1,000 physicians practicing in Switzerland, ChatGPT-5 outperformed Gender API in inferring gender from names. ChatGPT-5 achieved an overall error rate (errorCoded) of 0.4% with no unclassified cases, compared with 2.3% for Gender API in the whole sample. While applying probability thresholds to Gender API reduced misclassifications, it also led to a substantial increase in unclassified cases. Importantly, ChatGPT-5 relied on internet verification (“checked” names) in only 10% of cases overall, but in 70% of the names that Gender API either misclassified or could not classify. We did not test reproducibility in this study, as this aspect has already been addressed in two previous investigations, both of which found ChatGPT to provide highly consistent results.15,16

Comparison with existing literature

At least three prior studies have benchmarked or contextualized ChatGPT for gender inference. First, using ChatGPT-3.5 and 4 on the full sample of 6,131 Swiss physicians, we found ≤1.5% misclassifications, 0% nonclassifications, and almost perfect agreement across two runs for both versions (κ>0.98), with errorCoded values of 0.012 and 0.014 for ChatGPT-3.5 and 0.015 for ChatGPT-4 in both runs.16 Second, Goyanes et al. published a methodological procedure describing how to implement gender inference with ChatGPT, NamSor, and Gender-API; this paper offers practical guidance but does not present head-to-head accuracy outcomes.18 Third, Domínguez-Díaz et al. compared ChatGPT-3.5 and 4o with NamSor and Gender API on 5,779 names, reporting low misclassification rates for all tools: errorCoded 0.070 for NamSor, 0.072 for Gender API, 0.058 for ChatGPT-3.5, and 0.043 for ChatGPT-4o (for ChatGPT, based on the mean of twenty runs).15 Stability was assessed across the twenty runs, yielding a mean κ of 0.87 for ChatGPT-3.5 and 0.91 for ChatGPT-4o.

Our results extend this literature: ChatGPT-5 reached 99.6% accuracy with 0% unclassified cases (errorCoded 0.004) and outperformed Gender API across thresholds, while resorting to web verification in 10% of all names and in 70% of cases that Gender API misclassified or left unclassified, suggesting that ChatGPT-5, combined with the two-step procedure, achieves further gains over earlier ChatGPT implementations. Importantly, the performance metrics for Gender API were almost identical in our study and in the primary study, as expected given that we randomly sampled 1,000 names from the same dataset. Without applying a probability threshold, the errorCoded, errorCodedWithoutNA, and naCoded values were 0.023, 0.018, and 0.005, compared with 0.018, 0.015, and 0.003 in the primary study.1

Implications for practice

The findings of this study suggest that ChatGPT-5 can serve as a highly effective tool for gender inference in bibliometric and epidemiologic research. Compared with Gender API, ChatGPT-5 offers the combined advantage of higher accuracy and the absence of unclassified cases, which are often the most resource-intensive to resolve manually. The “checked” feature is of particular interest: in practice, the model resorted to internet-based verification primarily for the most difficult cases, precisely where traditional tools are most likely to fail. This adaptive strategy reflects a strength of LLMs and could reduce the need for additional post-processing or the use of multiple tools in tandem. However, in the present implementation, queries were conducted in batches of 10 names at a time, which is time-consuming for large-scale datasets. For practical adoption, future applications should integrate automated pipelines allowing the entire dataset to be processed at once.

In general internal medicine, the use of accurate gender inference tools is particularly relevant for both workforce monitoring and research. General internal medicine plays a central role in ensuring equitable access to healthcare, and reliable data on the gender distribution of general practitioners are essential to identify disparities in recruitment, retention, career progression, and leadership opportunities. In research, gender analyses increasingly inform the study of authorship, funding, and participation in general internal medicine. Inaccurate or incomplete data may bias analyses, obscure inequities, and weaken the evidence base. By providing near-perfect accuracy without unclassified cases, ChatGPT-5 offers a robust solution that can strengthen workforce monitoring, support diversity and inclusiveness, and advance general internal medicine research and policy.

Limitations

This study has several limitations. First, it was conducted in a single country, albeit one that is linguistically and culturally diverse, with a substantial proportion of non-Swiss physicians. Generalizability to contexts with higher proportions of Asian, Middle Eastern, or other non-Western names remains to be confirmed. Second, the approach tested here relied on a specific prompt design and one-shot querying of ChatGPT-5; whether alternative designs would yield similar or better results remains unknown. In addition, the method required queries to be run in batches of 10 names, which limits scalability and increases processing time. Future studies should evaluate automated solutions capable of handling entire datasets in a single run. Third, the method, by construction, imposes a binary view of gender (female/male), which does not capture the complexity of gender identity and risks marginalizing non-binary or transgender individuals. Although this limitation applies equally to existing gender inference tools, it is important to emphasize that gender detection from names should not replace self-identification whenever feasible.

Conclusion

In summary, this study shows that ChatGPT-5 substantially outperforms Gender API in inferring gender from physicians’ names in Switzerland, achieving near-perfect accuracy with no unclassified cases. The ability of the model to selectively seek external verification for ambiguous names is a key advantage that addresses the main weakness of existing tools, supporting its use as a reliable method for gender inference in large-scale datasets. However, because the current implementation required querying names in small batches, future work should focus on automated solutions that can process entire datasets efficiently while also validating performance in more diverse populations.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 03 Nov 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Sebo P. How to Achieve Near-Perfect Gender Inference Accuracy in Medicine Using ChatGPT [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1195 (https://doi.org/10.12688/f1000research.170933.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 03 Nov 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.