Keywords
AI, artificial intelligence, ChatGPT, comparison, gender, Gender API, gender inference, medicine, name, name-to-gender, natural language processing
This article is included in the Artificial Intelligence and Machine Learning gateway.
This article is included in the AI in Medicine and Healthcare collection.
Gender inference from names is widely used in bibliometric and epidemiologic research, including in general internal medicine. Traditional tools such as Gender API and NamSor are considered accurate but remain limited by misclassifications and unclassified cases. Recent studies suggest that ChatGPT may perform comparably to these tools. We aimed to test whether a two-step procedure could further improve ChatGPT’s performance.
We evaluated ChatGPT-5 against Gender API using a random sample of 1,000 Swiss physicians. A two-step one-shot prompt was applied: 1-assign gender directly from the name if reliable; 2-otherwise, verify using the internet. Gender API was applied to the same dataset with no threshold and at probability thresholds of ≥60%, ≥70%, ≥80%, and ≥90%. Confusion matrices, McNemar’s test, and accuracy metrics (errorCoded, errorCodedWithoutNA, naCoded) were computed.
Of 1,000 physicians, 523 (52.3%) were women and 477 (47.7%) were men. ChatGPT-5 achieved 996 correct classifications (99.6%), with 4 errors and no unclassified cases, whereas Gender API (whole sample) achieved 977 correct classifications (97.7%), 18 errors, and 5 unclassified cases (p-value<0.001). At higher thresholds, Gender API reduced errors but produced up to 6.5% unclassified cases. Overall error rates (errorCoded) were 0.4% for ChatGPT-5 versus 2.3% for Gender API. ChatGPT-5 marked 10.1% of names as “checked” through internet verification, increasing to 69.6% among cases that Gender API misclassified or left unclassified.
ChatGPT-5 substantially outperformed Gender API in gender inference from physicians’ names, achieving near-perfect accuracy without unclassified cases. Its adaptive use of internet verification for difficult names may offer a robust and efficient approach for large-scale research.
AI, artificial intelligence, ChatGPT, comparison, gender, Gender API, gender inference, medicine, name, name-to-gender, natural language processing
Name-to-gender inference services, which determine gender from names, have become an increasingly common methodological tool in medical, social science, and bibliometric research.1 These tools enable investigators to rapidly assess gender representation across large datasets at relatively low cost, thereby supporting studies on equity, diversity, and inclusion in science and medicine, including in general internal medicine. Applications have included the evaluation of gender disparities in editorial boards,2–5 research funding,6,7 and scientific authorship,8–12 consistently highlighting the persistent underrepresentation of women in leadership and senior authorship roles. Among the best-known services are Gender API, NamSor, and Genderize, with reported misclassification rates of ≤5%.1,13,14 Nevertheless, performance varies across cultural and linguistic contexts, and classification of unisex or rare names remains a challenge. Furthermore, most tools rely on proprietary databases and contextual metadata, which may limit reproducibility and transparency.
The advent of artificial intelligence (AI) and large language models (LLMs) such as ChatGPT opens new possibilities for gender inference. Early studies have suggested that ChatGPT performs comparably to the best existing tools, with the added benefit of reproducibility.15,16 However, few investigations have systematically evaluated its accuracy in this domain.
In the present study, we extend this line of research by using the most recent version of ChatGPT (ChatGPT-5) in a structured two-step “one-shot” query: (1) if gender can be reliably determined from the name alone, it is assigned directly; (2) if not, ChatGPT searches the internet to verify the gender. By combining linguistic knowledge with web-based verification, this approach aims to maximize classification performance. We hypothesize that ChatGPT-5, implemented with this two-step procedure, will achieve higher accuracy than Gender API.
This study builds on a previous work that evaluated the performance of four name-to-gender inference services: Gender API, NamSor, Genderize, and Wiki-Gendersort.1 It was conducted in Switzerland, a multilingual and multicultural country with four national languages (German, French, Italian, and Romansh) and a high proportion of foreign-trained physicians (36% overall; 33% in outpatient medicine and 40% in hospital medicine).1 The most frequent countries of origin among foreign physicians are Germany (53%), Italy (9%), France (7%), and Austria (6%).
The database used for the current work was described in detail in the primary study.1 In brief, it was compiled by merging several sources: (1) physicians and trainee physicians affiliated with the University Hospital of Geneva, the largest hospital in Switzerland, (2) senior physicians working in Swiss university hospitals, and (3) community-based physicians. After deduplication, the database comprised 6,131 physicians, of whom 50.3% were women.
In our primary study, the probable origin of first names was inferred using nationalize.io, which assigned an origin to 85% of names.1 The majority of names (88%) originated from Western countries or countries where Western languages are widely spoken, while 12% originated from Asian or Arabic-speaking countries, reflecting both the predominance of Western names and the multicultural nature of the physician population in Switzerland.
For the present study, we randomly selected a subsample of 1,000 physicians. For each physician, first name, last name, and gender (as recorded in the source database) were available and used as the reference standard.
Before analyzing this sample, we conducted pretests with an independent random set of 100 physicians, who were not included in the final dataset. To determine the optimal batch size for prompting ChatGPT, we submitted lists of 1, 5, 10, 20, 50, and 100 names. Each test was repeated several times, always in a new chat session. Performance was assessed by comparing ChatGPT’s output against the reference gender recorded in the source database, with discrepancies manually checked. The most consistent and accurate performance was obtained with 10 names per query. This setting was therefore retained for the study, which consisted of 100 queries of 10 names each, with a new chat session for every query.
We used ChatGPT-5.0 (OpenAI) in a one-shot setting, a choice motivated by previous evidence of its reliability for gender determination15,16 and by feasibility considerations given the large number of queries required. To increase the difficulty of the task and to avoid potential reliance on publicly available registries of Swiss physicians, individuals were referred to as “researchers” rather than “physicians” in the prompt. This wording was also consistent with the Swiss context, in which physicians are expected to publish scientific articles as part of their certification process.
The exact prompt used was: “Here is a list of 10 researchers (first name + last name). Your task is to determine their gender (Female / Male). Instructions: (1) If the gender can be reliably determined from the researcher’s name alone, assign it directly. (2) If the gender cannot be reliably determined from the name, search the internet to verify it. (3) Output exactly 10 lines, one per researcher, in the following format: gender (Female/Male), checked (Yes/No), first name, last name. (4) Do not include explanations, reasoning, or tables. Only output the 10 lines”.
All queries were conducted manually in a new chat window to avoid memory effects or contamination from previous prompts. Internet verification was allowed only when the model indicated that the name alone was insufficient for reliable classification.
As a comparator, we used Gender API (gender-api.com), identified as the most accurate name-to-gender inference tool in our primary study.1 An Excel file containing the first and last names of the 1,000 physicians was uploaded. For each case, Gender API returned an assigned gender (female, male, or unknown), a probability score, and the sample size underlying the inference. Analyses were conducted on the full dataset as well as at multiple probability thresholds (≥60%, ≥70%, ≥80%, and ≥90%). When the probability score fell below the defined threshold, the case was coded as “unclassified” for the purposes of analysis.
For both ChatGPT-5 and Gender API, we constructed confusion matrices and compared performance using McNemar’s test. For each tool, we calculated the number and proportion of correct, incorrect, and unclassified assignments. Incorrect assignments are also referred to as misclassifications, and unclassified assignments as nonclassifications. We also computed standard performance metrics used in accuracy studies of gender inference1,13,17,18:
- errorCoded (overall error rate), calculated as the number of incorrect plus unclassified assignments divided by the total number of cases.
- errorCodedWithoutNA (error rate excluding unclassified cases), calculated as the number of incorrect assignments divided by the number of cases with a classification.
- naCoded (proportion of unclassified cases), calculated as the number of unclassified assignments divided by the total number of cases.
All analyses were conducted using Stata version 15.1 (StataCorp, College Station, TX, USA). A two-sided p-value <0.05 was considered statistically significant.
This study was conducted in accordance with the principles of the Declaration of Helsinki. It was based exclusively on publicly available information (physicians’ names and genders from institutional websites and registries) and did not involve the collection of personal health-related data. According to Swiss law, such studies do not require approval from a local ethics committee. Therefore, formal approval from an Institutional Review Board or ethics committee was not sought.
Because the study relied solely on publicly available information and did not involve direct contact with individuals, the requirement for informed consent did not apply.
Of the 1,000 physicians included in the study, 523 (52.3%) were women and 477 (47.7%) were men. Table 1 presents the confusion matrices for ChatGPT-5 and Gender API. ChatGPT-5 correctly classified 521 of 523 women (99.6%) and 475 of 477 men (99.6%). Only 4 physicians were misclassified (0.4%), and no cases were unclassified. By contrast, Gender API correctly classified 510 women (97.5%) and 467 men (97.9%). A total of 18 physicians (1.8%) were misclassified, and 5 (0.5%) remained unclassified. The difference in classification performance between ChatGPT-5 and Gender API was statistically significant (McNemar’s test, p<0.001).
Overall, 101 of 1,000 names (10.1%) were marked as “checked” by ChatGPT-5, indicating that the model searched the internet rather than assigning gender directly from the name. The proportion of checked names was substantially higher among cases misclassified or unclassified by Gender API: 16 of 23 (69.6%) in this subgroup were checked by ChatGPT-5.
Table 2 summarizes the number of correct, incorrect, and unclassified assignments. ChatGPT-5 achieved 996 correct classifications (99.6%), with 4 errors (0.4%) and no unclassified cases. Gender API, when applied to the whole sample without thresholds, achieved 977 correct classifications (97.7%), 18 errors (1.8%), and 5 unclassified cases (0.5%). When increasing the probability threshold, the number of errors decreased but at the cost of a growing number of unclassified cases: from 11 errors and 20 unclassified at the ≥60% threshold to only 1 error but 65 unclassified at the ≥90% threshold.
Table 3 shows the performance metrics. For ChatGPT-5, the overall error rate (errorCoded) was 0.004, identical to the error rate excluding unclassified cases (errorCodedWithoutNA), as no cases were unclassified. For Gender API, the overall error rate was higher, ranging from 0.023 for the whole sample to 0.066 at the ≥90% threshold. The error rate excluding unclassified cases decreased progressively with stricter thresholds (from 0.018 to 0.001), but this was offset by a higher proportion of unclassified cases (naCoded), increasing from 0.005 for the whole sample to 0.065 at the ≥90% threshold.
In this study based on 1,000 physicians practicing in Switzerland, ChatGPT-5 outperformed Gender API in inferring gender from names. ChatGPT-5 achieved an overall error rate (errorCoded) of 0.4% with no unclassified cases, compared with 2.3% for Gender API in the whole sample. While applying probability thresholds to Gender API reduced misclassifications, it also led to a substantial increase in unclassified cases. Importantly, ChatGPT-5 relied on internet verification (“checked” names) in only 10% of cases overall, but in 70% of the names that Gender API either misclassified or could not classify. We did not test reproducibility in this study, as this aspect has already been addressed in two previous investigations, both of which found ChatGPT to provide highly consistent results.15,16
At least three prior studies have benchmarked or contextualized ChatGPT for gender inference. First, using ChatGPT-3.5 and 4 on the full sample of 6,131 Swiss physicians, we found ≤1.5% misclassifications, 0% nonclassifications, and almost perfect agreement across two runs for both versions (κ>0.98), with errorCoded values of 0.012 and 0.014 for ChatGPT-3.5 and 0.015 for ChatGPT-4 in both runs.16 Second, Goyanes et al. published a methodological procedure describing how to implement gender inference with ChatGPT, NamSor, and Gender-API; this paper offers practical guidance but does not present head-to-head accuracy outcomes.18 Third, Domínguez-Díaz et al. compared ChatGPT-3.5 and 4o with NamSor and Gender API on 5,779 names, reporting low misclassification rates for all tools: errorCoded 0.070 for NamSor, 0.072 for Gender API, 0.058 for ChatGPT-3.5, and 0.043 for ChatGPT-4o (for ChatGPT, based on the mean of twenty runs).15 Stability was assessed across the twenty runs, yielding a mean κ of 0.87 for ChatGPT-3.5 and 0.91 for ChatGPT-4o.
Our results extend this literature: ChatGPT-5 reached 99.6% accuracy with 0% unclassified cases (errorCoded 0.004) and outperformed Gender API across thresholds, while resorting to web verification in 10% of all names and in 70% of cases that Gender API misclassified or left unclassified, suggesting that ChatGPT-5, combined with the two-step procedure, achieves further gains over earlier ChatGPT implementations. Importantly, the performance metrics for Gender API were almost identical in our study and in the primary study, as expected given that we randomly sampled 1,000 names from the same dataset. Without applying a probability threshold, the errorCoded, errorCodedWithoutNA, and naCoded values were 0.023, 0.018, and 0.005, compared with 0.018, 0.015, and 0.003 in the primary study.1
The findings of this study suggest that ChatGPT-5 can serve as a highly effective tool for gender inference in bibliometric and epidemiologic research. Compared with Gender API, ChatGPT-5 offers the combined advantage of higher accuracy and the absence of unclassified cases, which are often the most resource-intensive to resolve manually. The “checked” feature is of particular interest: in practice, the model resorted to internet-based verification primarily for the most difficult cases, precisely where traditional tools are most likely to fail. This adaptive strategy reflects a strength of LLMs and could reduce the need for additional post-processing or the use of multiple tools in tandem. However, in the present implementation, queries were conducted in batches of 10 names at a time, which is time-consuming for large-scale datasets. For practical adoption, future applications should integrate automated pipelines allowing the entire dataset to be processed at once.
In general internal medicine, the use of accurate gender inference tools is particularly relevant for both workforce monitoring and research. General internal medicine plays a central role in ensuring equitable access to healthcare, and reliable data on the gender distribution of general practitioners are essential to identify disparities in recruitment, retention, career progression, and leadership opportunities. In research, gender analyses increasingly inform the study of authorship, funding, and participation in general internal medicine. Inaccurate or incomplete data may bias analyses, obscure inequities, and weaken the evidence base. By providing near-perfect accuracy without unclassified cases, ChatGPT-5 offers a robust solution that can strengthen workforce monitoring, support diversity and inclusiveness, and advance general internal medicine research and policy.
This study has several limitations. First, it was conducted in a single country, albeit one that is linguistically and culturally diverse, with a substantial proportion of non-Swiss physicians. Generalizability to contexts with higher proportions of Asian, Middle Eastern, or other non-Western names remains to be confirmed. Second, the approach tested here relied on a specific prompt design and one-shot querying of ChatGPT-5; whether alternative designs would yield similar or better results remains unknown. In addition, the method required queries to be run in batches of 10 names, which limits scalability and increases processing time. Future studies should evaluate automated solutions capable of handling entire datasets in a single run. Third, the method, by construction, imposes a binary view of gender (female/male), which does not capture the complexity of gender identity and risks marginalizing non-binary or transgender individuals. Although this limitation applies equally to existing gender inference tools, it is important to emphasize that gender detection from names should not replace self-identification whenever feasible.
In summary, this study shows that ChatGPT-5 substantially outperforms Gender API in inferring gender from physicians’ names in Switzerland, achieving near-perfect accuracy with no unclassified cases. The ability of the model to selectively seek external verification for ambiguous names is a key advantage that addresses the main weakness of existing tools, supporting its use as a reliable method for gender inference in large-scale datasets. However, because the current implementation required querying names in small batches, future work should focus on automated solutions that can process entire datasets efficiently while also validating performance in more diverse populations.
The dataset underlying this study is openly available in the Open Science Framework (OSF) repository under a CC-BY 4.0 license: https://doi.org/10.17605/OSF.IO/6KH3A.19
The file includes the following variables: gender_real (reference gender from the source database), gender_ga (gender classification returned by Gender API), accuracy_ga (accuracy of Gender API classification), gender_chatgpt (gender classification returned by ChatGPT-5), and checked (whether ChatGPT-5 performed internet verification: Yes/No).
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - | 
| 
                                            PubMed Central
                                             Data from PMC are received and updated monthly. 
                                             | 
                                        - | - | 
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)