How to Achieve Near-Perfect Gender Inference Accuracy in Medicine Using ChatGPT

Paul Sebo

doi:10.12688/f1000research.170933.1

Home Browse How to Achieve Near-Perfect Gender Inference Accuracy in Medicine...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

How to Achieve Near-Perfect Gender Inference Accuracy in Medicine Using ChatGPT

[version 1; peer review: awaiting peer review]

Paul Sebo

PUBLISHED 03 Nov 2025

Author details Author details

University Institute for Primary Care, University of Geneva, Geneva, Switzerland

Paul Sebo
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the AI in Medicine and Healthcare collection.

Abstract

Background

Gender inference from names is widely used in bibliometric and epidemiologic research, including in general internal medicine. Traditional tools such as Gender API and NamSor are considered accurate but remain limited by misclassifications and unclassified cases. Recent studies suggest that ChatGPT may perform comparably to these tools. We aimed to test whether a two-step procedure could further improve ChatGPT’s performance.

Methods

We evaluated ChatGPT-5 against Gender API using a random sample of 1,000 Swiss physicians. A two-step one-shot prompt was applied: 1-assign gender directly from the name if reliable; 2-otherwise, verify using the internet. Gender API was applied to the same dataset with no threshold and at probability thresholds of ≥60%, ≥70%, ≥80%, and ≥90%. Confusion matrices, McNemar’s test, and accuracy metrics (errorCoded, errorCodedWithoutNA, naCoded) were computed.

Results

Of 1,000 physicians, 523 (52.3%) were women and 477 (47.7%) were men. ChatGPT-5 achieved 996 correct classifications (99.6%), with 4 errors and no unclassified cases, whereas Gender API (whole sample) achieved 977 correct classifications (97.7%), 18 errors, and 5 unclassified cases (p-value<0.001). At higher thresholds, Gender API reduced errors but produced up to 6.5% unclassified cases. Overall error rates (errorCoded) were 0.4% for ChatGPT-5 versus 2.3% for Gender API. ChatGPT-5 marked 10.1% of names as “checked” through internet verification, increasing to 69.6% among cases that Gender API misclassified or left unclassified.

Conclusion

ChatGPT-5 substantially outperformed Gender API in gender inference from physicians’ names, achieving near-perfect accuracy without unclassified cases. Its adaptive use of internet verification for difficult names may offer a robust and efficient approach for large-scale research.

Keywords

AI, artificial intelligence, ChatGPT, comparison, gender, Gender API, gender inference, medicine, name, name-to-gender, natural language processing

Corresponding author: Paul Sebo

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 Sebo P. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Sebo P. How to Achieve Near-Perfect Gender Inference Accuracy in Medicine Using ChatGPT [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1195 (https://doi.org/10.12688/f1000research.170933.1) First published: 03 Nov 2025, 14:1195 (https://doi.org/10.12688/f1000research.170933.1) Latest published: 03 Nov 2025, 14:1195 (https://doi.org/10.12688/f1000research.170933.1)

Introduction

Name-to-gender inference services, which determine gender from names, have become an increasingly common methodological tool in medical, social science, and bibliometric research.¹ These tools enable investigators to rapidly assess gender representation across large datasets at relatively low cost, thereby supporting studies on equity, diversity, and inclusion in science and medicine, including in general internal medicine. Applications have included the evaluation of gender disparities in editorial boards,^2–5 research funding,^6,7 and scientific authorship,^8–12 consistently highlighting the persistent underrepresentation of women in leadership and senior authorship roles. Among the best-known services are Gender API, NamSor, and Genderize, with reported misclassification rates of ≤5%.^1,13,14 Nevertheless, performance varies across cultural and linguistic contexts, and classification of unisex or rare names remains a challenge. Furthermore, most tools rely on proprietary databases and contextual metadata, which may limit reproducibility and transparency.

The advent of artificial intelligence (AI) and large language models (LLMs) such as ChatGPT opens new possibilities for gender inference. Early studies have suggested that ChatGPT performs comparably to the best existing tools, with the added benefit of reproducibility.^15,16 However, few investigations have systematically evaluated its accuracy in this domain.

In the present study, we extend this line of research by using the most recent version of ChatGPT (ChatGPT-5) in a structured two-step “one-shot” query: (1) if gender can be reliably determined from the name alone, it is assigned directly; (2) if not, ChatGPT searches the internet to verify the gender. By combining linguistic knowledge with web-based verification, this approach aims to maximize classification performance. We hypothesize that ChatGPT-5, implemented with this two-step procedure, will achieve higher accuracy than Gender API.

Methods

Study population

This study builds on a previous work that evaluated the performance of four name-to-gender inference services: Gender API, NamSor, Genderize, and Wiki-Gendersort.¹ It was conducted in Switzerland, a multilingual and multicultural country with four national languages (German, French, Italian, and Romansh) and a high proportion of foreign-trained physicians (36% overall; 33% in outpatient medicine and 40% in hospital medicine).¹ The most frequent countries of origin among foreign physicians are Germany (53%), Italy (9%), France (7%), and Austria (6%).

The database used for the current work was described in detail in the primary study.¹ In brief, it was compiled by merging several sources: (1) physicians and trainee physicians affiliated with the University Hospital of Geneva, the largest hospital in Switzerland, (2) senior physicians working in Swiss university hospitals, and (3) community-based physicians. After deduplication, the database comprised 6,131 physicians, of whom 50.3% were women.

In our primary study, the probable origin of first names was inferred using nationalize.io, which assigned an origin to 85% of names.¹ The majority of names (88%) originated from Western countries or countries where Western languages are widely spoken, while 12% originated from Asian or Arabic-speaking countries, reflecting both the predominance of Western names and the multicultural nature of the physician population in Switzerland.

For the present study, we randomly selected a subsample of 1,000 physicians. For each physician, first name, last name, and gender (as recorded in the source database) were available and used as the reference standard.

Pretesting

Before analyzing this sample, we conducted pretests with an independent random set of 100 physicians, who were not included in the final dataset. To determine the optimal batch size for prompting ChatGPT, we submitted lists of 1, 5, 10, 20, 50, and 100 names. Each test was repeated several times, always in a new chat session. Performance was assessed by comparing ChatGPT’s output against the reference gender recorded in the source database, with discrepancies manually checked. The most consistent and accurate performance was obtained with 10 names per query. This setting was therefore retained for the study, which consisted of 100 queries of 10 names each, with a new chat session for every query.

ChatGPT procedure

We used ChatGPT-5.0 (OpenAI) in a one-shot setting, a choice motivated by previous evidence of its reliability for gender determination^15,16 and by feasibility considerations given the large number of queries required. To increase the difficulty of the task and to avoid potential reliance on publicly available registries of Swiss physicians, individuals were referred to as “researchers” rather than “physicians” in the prompt. This wording was also consistent with the Swiss context, in which physicians are expected to publish scientific articles as part of their certification process.

The exact prompt used was: “Here is a list of 10 researchers (first name + last name). Your task is to determine their gender (Female / Male). Instructions: (1) If the gender can be reliably determined from the researcher’s name alone, assign it directly. (2) If the gender cannot be reliably determined from the name, search the internet to verify it. (3) Output exactly 10 lines, one per researcher, in the following format: gender (Female/Male), checked (Yes/No), first name, last name. (4) Do not include explanations, reasoning, or tables. Only output the 10 lines”.

All queries were conducted manually in a new chat window to avoid memory effects or contamination from previous prompts. Internet verification was allowed only when the model indicated that the name alone was insufficient for reliable classification.

Comparator: Gender API

As a comparator, we used Gender API (gender-api.com), identified as the most accurate name-to-gender inference tool in our primary study.¹ An Excel file containing the first and last names of the 1,000 physicians was uploaded. For each case, Gender API returned an assigned gender (female, male, or unknown), a probability score, and the sample size underlying the inference. Analyses were conducted on the full dataset as well as at multiple probability thresholds (≥60%, ≥70%, ≥80%, and ≥90%). When the probability score fell below the defined threshold, the case was coded as “unclassified” for the purposes of analysis.

Statistical analysis

For both ChatGPT-5 and Gender API, we constructed confusion matrices and compared performance using McNemar’s test. For each tool, we calculated the number and proportion of correct, incorrect, and unclassified assignments. Incorrect assignments are also referred to as misclassifications, and unclassified assignments as nonclassifications. We also computed standard performance metrics used in accuracy studies of gender inference^1,13,17,18:

- errorCoded (overall error rate), calculated as the number of incorrect plus unclassified assignments divided by the total number of cases.
- errorCodedWithoutNA (error rate excluding unclassified cases), calculated as the number of incorrect assignments divided by the number of cases with a classification.
- naCoded (proportion of unclassified cases), calculated as the number of unclassified assignments divided by the total number of cases.

All analyses were conducted using Stata version 15.1 (StataCorp, College Station, TX, USA). A two-sided p-value <0.05 was considered statistically significant.

Ethical considerations

This study was conducted in accordance with the principles of the Declaration of Helsinki. It was based exclusively on publicly available information (physicians’ names and genders from institutional websites and registries) and did not involve the collection of personal health-related data. According to Swiss law, such studies do not require approval from a local ethics committee. Therefore, formal approval from an Institutional Review Board or ethics committee was not sought.

Because the study relied solely on publicly available information and did not involve direct contact with individuals, the requirement for informed consent did not apply.

Results

Of the 1,000 physicians included in the study, 523 (52.3%) were women and 477 (47.7%) were men. Table 1 presents the confusion matrices for ChatGPT-5 and Gender API. ChatGPT-5 correctly classified 521 of 523 women (99.6%) and 475 of 477 men (99.6%). Only 4 physicians were misclassified (0.4%), and no cases were unclassified. By contrast, Gender API correctly classified 510 women (97.5%) and 467 men (97.9%). A total of 18 physicians (1.8%) were misclassified, and 5 (0.5%) remained unclassified. The difference in classification performance between ChatGPT-5 and Gender API was statistically significant (McNemar’s test, p<0.001).

Table 1. Confusion matrices for ChatGPT-5 and Gender API in gender classification (n = 1,000 physicians).

Gender detection tool	Classified as women, n (%)	Classified as men, n (%)	Unclassified, n (%)
ChatGPT-5
Female physicians	521 (99.6)	2 (0.4)	0
Male physicians	2 (0.4)	475 (99.6)	0
Gender API
Female physicians	510 (97.5)	10 (1.9)	3 (0.6)
Male physicians	8 (1.7)	467 (97.9)	2 (0.4)

Overall, 101 of 1,000 names (10.1%) were marked as “checked” by ChatGPT-5, indicating that the model searched the internet rather than assigning gender directly from the name. The proportion of checked names was substantially higher among cases misclassified or unclassified by Gender API: 16 of 23 (69.6%) in this subgroup were checked by ChatGPT-5.

Table 2 summarizes the number of correct, incorrect, and unclassified assignments. ChatGPT-5 achieved 996 correct classifications (99.6%), with 4 errors (0.4%) and no unclassified cases. Gender API, when applied to the whole sample without thresholds, achieved 977 correct classifications (97.7%), 18 errors (1.8%), and 5 unclassified cases (0.5%). When increasing the probability threshold, the number of errors decreased but at the cost of a growing number of unclassified cases: from 11 errors and 20 unclassified at the ≥60% threshold to only 1 error but 65 unclassified at the ≥90% threshold.

Table 2. Number of correct, incorrect, and unclassified assignments for ChatGPT-5 and for Gender API at different probability thresholds (n = 1,000 physicians).

Gender detection tool	Correct, n (%)	Incorrect, n (%)	Unclassified, n (%)
ChatGPT-5	996 (99.6)	4 (0.4)	0
Gender API (whole sample)	977 (97.7)	18 (1.8)	5 (0.5)
Gender API (≥60% threshold)	969 (96.9)	11 (1.1)	20 (2.0)
Gender API (≥70% threshold)	967 (96.7)	8 (0.8)	25 (2.5)
Gender API (≥80% threshold)	954 (95.4)	6 (0.6)	40 (4.0)
Gender API (≥90% threshold)	934 (93.4)	1 (0.1)	65 (6.5)

Table 3 shows the performance metrics. For ChatGPT-5, the overall error rate (errorCoded) was 0.004, identical to the error rate excluding unclassified cases (errorCodedWithoutNA), as no cases were unclassified. For Gender API, the overall error rate was higher, ranging from 0.023 for the whole sample to 0.066 at the ≥90% threshold. The error rate excluding unclassified cases decreased progressively with stricter thresholds (from 0.018 to 0.001), but this was offset by a higher proportion of unclassified cases (naCoded), increasing from 0.005 for the whole sample to 0.065 at the ≥90% threshold.

Table 3. Performance metrics for ChatGPT-5 and for Gender API at different probability thresholds (n = 1,000 physicians).

Gender detection tool	errorCoded	errorcodedWithoutNA	naCoded
ChatGPT-5	0.0040	0.0040	0
Gender API (whole sample)	0.0230	0.0181	0.0050
Gender API (≥60% threshold)	0.0310	0.0112	0.0200
Gender API (≥70% threshold)	0.0330	0.0082	0.0250
Gender API (≥80% threshold)	0.0460	0.0063	0.0400
Gender API (≥90% threshold)	0.0660	0.0011	0.0650

Discussion

Main findings

In this study based on 1,000 physicians practicing in Switzerland, ChatGPT-5 outperformed Gender API in inferring gender from names. ChatGPT-5 achieved an overall error rate (errorCoded) of 0.4% with no unclassified cases, compared with 2.3% for Gender API in the whole sample. While applying probability thresholds to Gender API reduced misclassifications, it also led to a substantial increase in unclassified cases. Importantly, ChatGPT-5 relied on internet verification (“checked” names) in only 10% of cases overall, but in 70% of the names that Gender API either misclassified or could not classify. We did not test reproducibility in this study, as this aspect has already been addressed in two previous investigations, both of which found ChatGPT to provide highly consistent results.^15,16

Comparison with existing literature

At least three prior studies have benchmarked or contextualized ChatGPT for gender inference. First, using ChatGPT-3.5 and 4 on the full sample of 6,131 Swiss physicians, we found ≤1.5% misclassifications, 0% nonclassifications, and almost perfect agreement across two runs for both versions (κ>0.98), with errorCoded values of 0.012 and 0.014 for ChatGPT-3.5 and 0.015 for ChatGPT-4 in both runs.¹⁶ Second, Goyanes et al. published a methodological procedure describing how to implement gender inference with ChatGPT, NamSor, and Gender-API; this paper offers practical guidance but does not present head-to-head accuracy outcomes.¹⁸ Third, Domínguez-Díaz et al. compared ChatGPT-3.5 and 4o with NamSor and Gender API on 5,779 names, reporting low misclassification rates for all tools: errorCoded 0.070 for NamSor, 0.072 for Gender API, 0.058 for ChatGPT-3.5, and 0.043 for ChatGPT-4o (for ChatGPT, based on the mean of twenty runs).¹⁵ Stability was assessed across the twenty runs, yielding a mean κ of 0.87 for ChatGPT-3.5 and 0.91 for ChatGPT-4o.

Our results extend this literature: ChatGPT-5 reached 99.6% accuracy with 0% unclassified cases (errorCoded 0.004) and outperformed Gender API across thresholds, while resorting to web verification in 10% of all names and in 70% of cases that Gender API misclassified or left unclassified, suggesting that ChatGPT-5, combined with the two-step procedure, achieves further gains over earlier ChatGPT implementations. Importantly, the performance metrics for Gender API were almost identical in our study and in the primary study, as expected given that we randomly sampled 1,000 names from the same dataset. Without applying a probability threshold, the errorCoded, errorCodedWithoutNA, and naCoded values were 0.023, 0.018, and 0.005, compared with 0.018, 0.015, and 0.003 in the primary study.¹

Implications for practice

The findings of this study suggest that ChatGPT-5 can serve as a highly effective tool for gender inference in bibliometric and epidemiologic research. Compared with Gender API, ChatGPT-5 offers the combined advantage of higher accuracy and the absence of unclassified cases, which are often the most resource-intensive to resolve manually. The “checked” feature is of particular interest: in practice, the model resorted to internet-based verification primarily for the most difficult cases, precisely where traditional tools are most likely to fail. This adaptive strategy reflects a strength of LLMs and could reduce the need for additional post-processing or the use of multiple tools in tandem. However, in the present implementation, queries were conducted in batches of 10 names at a time, which is time-consuming for large-scale datasets. For practical adoption, future applications should integrate automated pipelines allowing the entire dataset to be processed at once.

In general internal medicine, the use of accurate gender inference tools is particularly relevant for both workforce monitoring and research. General internal medicine plays a central role in ensuring equitable access to healthcare, and reliable data on the gender distribution of general practitioners are essential to identify disparities in recruitment, retention, career progression, and leadership opportunities. In research, gender analyses increasingly inform the study of authorship, funding, and participation in general internal medicine. Inaccurate or incomplete data may bias analyses, obscure inequities, and weaken the evidence base. By providing near-perfect accuracy without unclassified cases, ChatGPT-5 offers a robust solution that can strengthen workforce monitoring, support diversity and inclusiveness, and advance general internal medicine research and policy.

Limitations

This study has several limitations. First, it was conducted in a single country, albeit one that is linguistically and culturally diverse, with a substantial proportion of non-Swiss physicians. Generalizability to contexts with higher proportions of Asian, Middle Eastern, or other non-Western names remains to be confirmed. Second, the approach tested here relied on a specific prompt design and one-shot querying of ChatGPT-5; whether alternative designs would yield similar or better results remains unknown. In addition, the method required queries to be run in batches of 10 names, which limits scalability and increases processing time. Future studies should evaluate automated solutions capable of handling entire datasets in a single run. Third, the method, by construction, imposes a binary view of gender (female/male), which does not capture the complexity of gender identity and risks marginalizing non-binary or transgender individuals. Although this limitation applies equally to existing gender inference tools, it is important to emphasize that gender detection from names should not replace self-identification whenever feasible.

Conclusion

In summary, this study shows that ChatGPT-5 substantially outperforms Gender API in inferring gender from physicians’ names in Switzerland, achieving near-perfect accuracy with no unclassified cases. The ability of the model to selectively seek external verification for ambiguous names is a key advantage that addresses the main weakness of existing tools, supporting its use as a reliable method for gender inference in large-scale datasets. However, because the current implementation required querying names in small batches, future work should focus on automated solutions that can process entire datasets efficiently while also validating performance in more diverse populations.

Data availability

The dataset underlying this study is openly available in the Open Science Framework (OSF) repository under a CC-BY 4.0 license: https://doi.org/10.17605/OSF.IO/6KH3A.¹⁹

The file includes the following variables: gender_real (reference gender from the source database), gender_ga (gender classification returned by Gender API), accuracy_ga (accuracy of Gender API classification), gender_chatgpt (gender classification returned by ChatGPT-5), and checked (whether ChatGPT-5 performed internet verification: Yes/No).

Acknowledgements

None.

References

1. Sebo P: Performance of gender detection tools: a comparative study of name-to-gender inference services. J. Med. Libr. Assoc. 2021; 109: 414–421. PubMed Abstract | Publisher Full Text | Free Full Text
2. Gottlieb M, Krzyzaniak SM, Mannix A, et al.: Sex Distribution of Editorial Board Members Among Emergency Medicine Journals. Ann. Emerg. Med. 2021; 77: 117–123. PubMed Abstract | Publisher Full Text
3. Abbas A, Corey B, Chen H: Representation of Women as Editors of the Top Surgery Journals in the World. J. Surg. Res. 2025; 313(25): 562–572. PubMed Abstract | Publisher Full Text
4. Gottlieb R, Jozaghi E, Chen H, et al.: Gender equity in The Journal of the American Dental Association: A review of the past 2 plus decades. J. Am. Dent. Assoc. 1939. 2024; 155: 504–513.e1. PubMed Abstract | Publisher Full Text
5. Burg ML, Sholklapper T, Kohli P, et al.: Gender Disparities Among Editorial Boards of International Urology Journals. Eur. Urol. Focus. 2022; 8: 1840–1846. PubMed Abstract | Publisher Full Text
6. Tao B, Tsai C-C, Wang C, et al.: Gender disparity in Canadian Institutes of Health Research funding within neurology. BMJ Lead. 2024; 9: 261–267. PubMed Abstract | Publisher Full Text
7. Harris AB, Benes G, Ghanem D, et al.: Using a Modern Linked Research Database to Examine Gender Disparities in Orthopaedic Grant Funding from 2010 to 2022. J. Bone Joint Surg. Am. 2024; 106: 39–46. PubMed Abstract | Publisher Full Text
8. Sebo P, Clair C: Gender gap in authorship: a study of 44,000 articles published in 100 high-impact general medical journals. Eur. J. Intern. Med. 2021; 97(21): 103–105. PubMed Abstract | Publisher Full Text
9. Schluchter H, Andel D, De Bettignies A, et al.: Global Trends and Cross-Country Differences in Authorship by Women in Academic Anaesthesiology Since 1996: A Repeated Cross-Sectional Analysis. J. Clin. Med. 2025; 14: 5891. PubMed Abstract | Publisher Full Text | Free Full Text
10. Murray G, Vasdev RMS, Ellythy L, et al.: Gender Authorship Among Urology Artificial Intelligence Publications: A 10-Year Retrospective Analysis. Urology. 2025; 201: 183–189. PubMed Abstract | Publisher Full Text
11. Lauper K, Buitrago-Garcia D, Courvoisier DS, et al.: Trends and influences in women authorship in randomised controlled trials in rheumatology: a comprehensive analysis of all published RCTs from 2009 to 2023. RMD Open. 2025; 11: e005341. PubMed Abstract | Publisher Full Text | Free Full Text
12. Freire CVS, Campos LN, Rangel AG, et al.: Uncovering gaps in women’s authorship: A big data analysis in academic surgery. World J. Surg. 2024; 48: 2152–2162. PubMed Abstract | Publisher Full Text
13. Santamaría L, Mihaljević H: Comparison and benchmark of name-to-gender inference services. PeerJ. Comput. Sci. 2018; 4: e156. PubMed Abstract | Publisher Full Text | Free Full Text
14. VanHelene AD, Khatri I, Hilton CB, et al.: Inferring gender from first names: Comparing the accuracy of Genderize, Gender API, and the gender R package on authors of diverse nationality. PLoS Digit. Health. 2024; 3: e0000456. PubMed Abstract | Publisher Full Text | Free Full Text
15. Domínguez-Díaz A, Goyanes M, de-Marcos L, et al.: Comparative analysis of automatic gender detection from names: evaluating the stability and performance of ChatGPT versus Namsor, and Gender-API. PeerJ. Comput. Sci. 2024; 10: e2378. PubMed Abstract | Publisher Full Text | Free Full Text
16. Sebo P: What Is the Performance of ChatGPT in Determining the Gender of Individuals Based on Their First and Last Names? JMIR AI. 2024; 3: e53656. PubMed Abstract | Publisher Full Text | Free Full Text
17. Wais K: Gender Prediction Methods Based on First Names with genderizeR. R.J. 2016; 8: 17. Publisher Full Text
18. Goyanes M, de-Marcos L, Domínguez-Díaz A: Automatic gender detection: a methodological procedure and recommendations to computationally infer the gender from names with ChatGPT and gender APIs. Scientometrics. 2024; 129: 6867–6888. Publisher Full Text
19. Dataset for: How to achieve near-perfect gender inference accuracy in medicine using ChatGPT. Open Science Framework. 2025. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 03 Nov 2025

Author details Author details

University Institute for Primary Care, University of Geneva, Geneva, Switzerland

Paul Sebo
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 03 Nov 2025, 14:1195

https://doi.org/10.12688/f1000research.170933.1

Copyright

© 2025 Sebo P. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Sebo P. How to Achieve Near-Perfect Gender Inference Accuracy in Medicine Using ChatGPT [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1195 (https://doi.org/10.12688/f1000research.170933.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 03 Nov 2025

Open Peer Review

Reviewer Status

AWAITING PEER REVIEW

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

[1] 1. Sebo P: Performance of gender detection tools: a comparative study of name-to-gender inference services. J. Med. Libr. Assoc. 2021; 109: 414–421. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Gottlieb M, Krzyzaniak SM, Mannix A, et al.: Sex Distribution of Editorial Board Members Among Emergency Medicine Journals. Ann. Emerg. Med. 2021; 77: 117–123. PubMed Abstract | Publisher Full Text

[3] 3. Abbas A, Corey B, Chen H: Representation of Women as Editors of the Top Surgery Journals in the World. J. Surg. Res. 2025; 313(25): 562–572. PubMed Abstract | Publisher Full Text

[4] 4. Gottlieb R, Jozaghi E, Chen H, et al.: Gender equity in The Journal of the American Dental Association: A review of the past 2 plus decades. J. Am. Dent. Assoc. 1939. 2024; 155: 504–513.e1. PubMed Abstract | Publisher Full Text

[5] 5. Burg ML, Sholklapper T, Kohli P, et al.: Gender Disparities Among Editorial Boards of International Urology Journals. Eur. Urol. Focus. 2022; 8: 1840–1846. PubMed Abstract | Publisher Full Text

[6] 6. Tao B, Tsai C-C, Wang C, et al.: Gender disparity in Canadian Institutes of Health Research funding within neurology. BMJ Lead. 2024; 9: 261–267. PubMed Abstract | Publisher Full Text

[7] 7. Harris AB, Benes G, Ghanem D, et al.: Using a Modern Linked Research Database to Examine Gender Disparities in Orthopaedic Grant Funding from 2010 to 2022. J. Bone Joint Surg. Am. 2024; 106: 39–46. PubMed Abstract | Publisher Full Text

[8] 8. Sebo P, Clair C: Gender gap in authorship: a study of 44,000 articles published in 100 high-impact general medical journals. Eur. J. Intern. Med. 2021; 97(21): 103–105. PubMed Abstract | Publisher Full Text

[9] 9. Schluchter H, Andel D, De Bettignies A, et al.: Global Trends and Cross-Country Differences in Authorship by Women in Academic Anaesthesiology Since 1996: A Repeated Cross-Sectional Analysis. J. Clin. Med. 2025; 14: 5891. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Murray G, Vasdev RMS, Ellythy L, et al.: Gender Authorship Among Urology Artificial Intelligence Publications: A 10-Year Retrospective Analysis. Urology. 2025; 201: 183–189. PubMed Abstract | Publisher Full Text

[11] 11. Lauper K, Buitrago-Garcia D, Courvoisier DS, et al.: Trends and influences in women authorship in randomised controlled trials in rheumatology: a comprehensive analysis of all published RCTs from 2009 to 2023. RMD Open. 2025; 11: e005341. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Freire CVS, Campos LN, Rangel AG, et al.: Uncovering gaps in women’s authorship: A big data analysis in academic surgery. World J. Surg. 2024; 48: 2152–2162. PubMed Abstract | Publisher Full Text

[13] 13. Santamaría L, Mihaljević H: Comparison and benchmark of name-to-gender inference services. PeerJ. Comput. Sci. 2018; 4: e156. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. VanHelene AD, Khatri I, Hilton CB, et al.: Inferring gender from first names: Comparing the accuracy of Genderize, Gender API, and the gender R package on authors of diverse nationality. PLoS Digit. Health. 2024; 3: e0000456. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Domínguez-Díaz A, Goyanes M, de-Marcos L, et al.: Comparative analysis of automatic gender detection from names: evaluating the stability and performance of ChatGPT versus Namsor, and Gender-API. PeerJ. Comput. Sci. 2024; 10: e2378. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Sebo P: What Is the Performance of ChatGPT in Determining the Gender of Individuals Based on Their First and Last Names? JMIR AI. 2024; 3: e53656. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Wais K: Gender Prediction Methods Based on First Names with genderizeR. R.J. 2016; 8: 17. Publisher Full Text

[18] 18. Goyanes M, de-Marcos L, Domínguez-Díaz A: Automatic gender detection: a methodological procedure and recommendations to computationally infer the gender from names with ChatGPT and gender APIs. Scientometrics. 2024; 129: 6867–6888. Publisher Full Text

[19] 19. Dataset for: How to achieve near-perfect gender inference accuracy in medicine using ChatGPT. Open Science Framework. 2025. Publisher Full Text

How to Achieve Near-Perfect Gender Inference Accuracy in Medicine Using ChatGPT

Abstract

Background

Methods

Results

Conclusion

Keywords

Introduction

Methods

Study population

Pretesting

ChatGPT procedure

Comparator: Gender API

Statistical analysis

Ethical considerations

Results

Table 1. Confusion matrices for ChatGPT-5 and Gender API in gender classification (n = 1,000 physicians).

Table 2. Number of correct, incorrect, and unclassified assignments for ChatGPT-5 and for Gender API at different probability thresholds (n = 1,000 physicians).

Table 3. Performance metrics for ChatGPT-5 and for Gender API at different probability thresholds (n = 1,000 physicians).

Discussion

Main findings

Comparison with existing literature

Implications for practice

Limitations

Conclusion

Data availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated