Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.173732.1

Research Article

Articles

A Decade of the Turkish Pediatric Endocrinology Subspecialty Board Examination: Structure, Outcomes, and Candidate Perspectives

[version 1; peer review: 1 approved with reservations]

Çalişkan

S. Ayhan

Conceptualization Data Curation Formal Analysis Investigation Methodology Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0001-9714-6249 a 1 Demir

Korcan

Conceptualization Data Curation Methodology Project Administration Writing – Original Draft Preparation Writing – Review & Editing 2 Darcan

Şükran

Conceptualization Data Curation Methodology Project Administration Supervision Writing – Review & Editing 3 ANIK

Ahmet

Conceptualization Data Curation Investigation Methodology Project Administration Writing – Review & Editing 4 Darendeliler

Feyza

Conceptualization Data Curation Methodology Project Administration Supervision Writing – Review & Editing 5 Turkish Board of Pediatric Endocrinology Exam Committee B�BER

Ece

D�NERAY

Hakan

SAVA? ERDEVE

?enay

G�K?EN

Damla

T�T�NC�LER K�KENL?

Filiz

�ZEN

Samim

�Z�N

Alev

T�RKKAHRAMAN

Do?a

1Medical Education, United Arab Emirates University College of Medicine and Health Sciences, Al Ain, Abu Dhabi, United Arab Emirates 2Pediatric Endocrinology, Dokuz Eylül University Faculty of Medicine, İzmir, Turkey 3Pediatric Endocrinology, Ege University Faculty of Medicine, İzmir, Turkey 4Pediatric Endocrinology, Adnan Menderes University Faculty of Medicine, Aydın, Turkey 5Pediatric Endocrinology, Istanbul University Istanbul Faculty of Medicine, İstanbul, Turkey

a ayhanca@gmail.com

No competing interests were disclosed.

8 12 2025

2025

1371

27 11 2025

2025

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

To present a decade-long evaluation of the Turkish Pediatric Endocrinology Subspecialty Board Examination (TPEBE), focusing on its structural evolution, psychometric performance, and candidates’ perceptions.

Methods

This cross-sectional study analyzed examination data from 2015–2025, encompassing 263 sittings (261 eligible candidates) and post-exam survey responses from 217 participants. Examination metrics included mean scores, pass rates, and reliability coefficients (Cronbach’s α) for multiple-choice question (MCQ) and key feature problem (KFP) components. Survey items assessed perceptions of exam difficulty, fairness, relevance, and organization using a 9-point Likert scale. Quantitative data were analyzed using descriptive and inferential statistics.

Results

Mean total scores declined following the 2019 inclusion of KFPs (x̄=52.9±9.05 in 2025), while reliability improved progressively (MCQ α=0.53–0.90; KFP α=0.31–0.85). Pass rates varied from 22.6% to 85.0%. Male candidates scored higher on MCQs and total scores (p<0.05), but gender differences in KFP performance and overall pass rates were not statistically significant. Candidates rated the examination highly for organization (x̄=7.75±1.54) and clinical relevance of KFPs (x̄=7.35±1.59), though exam duration received the lowest satisfaction (x̄=5.14±2.97). Qualitative feedback emphasized the educational value of KFPs and recommended extended testing time.

Conclusions

Over ten years, the TPEBE has evolved into a psychometrically robust and educationally valuable certification process. The balanced integration of MCQs and KFPs has strengthened construct validity and candidate engagement. These examinations are expected to gain broader recognition by institutions and regulators as a benchmark for educational and professional achievement.

Graduate Medical Education Pediatric Endocrinology Subspecialty Training Board Certification Board Examination Medical Education Assessment Key Feature Problems

The author(s) declared that no grants were involved in supporting this work.

Introduction

Pediatric endocrinology was formally recognized as a subspecialty in Türkiye in 1973, alongside pediatric metabolic diseases. It subsequently attained the status of an independent subspecialty in 2002. ^{1,
2} The establishment of the Turkish Society for Pediatric Endocrinology and Diabetes (TSPED) in 1994 marked a critical milestone in the institutional development of the field. Since its inception, TSPED has played a central role in advancing pediatric endocrinology and diabetes care through the promotion of professional collaboration, standard-setting initiatives, and a broad array of educational activities—including conferences, workshops, and training programs—designed to enhance the competencies of healthcare professionals. ³

In alignment with international practices for professional certification, the Turkish Pediatric Endocrinology Subspecialty Board Examination (TPEBE) was introduced in 2015. Administered by the Turkish Board of Pediatric Endocrinology Exam Committee (TBPEC) under TSPED, the TPEBE serves as a formal mechanism to evaluate the clinical knowledge and competencies of pediatric endocrinologists in Türkiye. Board certification examinations are globally recognized as essential instruments for safeguarding the quality of clinical practice and maintaining high professional standards across medical specialties. ⁴ Pediatric subspecialty board examinations are key instruments of international competency assurance. The implementation of the TPEBE underscores Türkiye’s commitment to aligning its training and certification processes with global standards in medical education. Thus, the establishment of the TPEBE represents a significant step in harmonizing Turkish pediatric endocrinology training and credentialing processes with international norms in medical education and assessment.

During the initial three years, the examination process exclusively employed multiple-choice questions (MCQs). They are recognized as a highly effective tool for evaluating comprehensive knowledge across broad content areas, as they facilitate extensive content coverage and contribute to strong content validity. ⁵ This approach supports making valid inferences about the entire content domain. Furthermore, MCQs are extensively utilized because they offer high reliability and are easy to score, ensuring precision, uniformity, and efficiency. ⁶ However, poorly designed MCQs may inadvertently target superficial content and fail to measure higher-order cognitive processes. ^{7,
8}

Consequently, the selection of item formats for any given assessment should be guided by a clear understanding of their respective strengths and limitations. A robust assessment strategy integrates diverse methods, each tailored to meet specific evaluative objectives. ⁹ While MCQs ensured coverage and reliability, concerns about assessing higher-order reasoning prompted the integration of new formats.

In response to these considerations, the TPEBE incorporated key feature problems (KFPs) into its format starting in 2019, with the objective of enhancing the assessment of candidates’ clinical decision-making skills. ¹⁰ KFPs are designed to simulate real-life clinical scenarios that require the integration of complex data to make clinically meaningful decisions. ^{9,
11} The format focuses on pivotal points in case management—referred to as “key features”—which represent the most essential and error-prone aspects of clinical problems. ^{12,
13} Originally introduced at the Cambridge Conference in 1984, the key features format was adopted by the Medical Council of Canada in 1992 as part of the MCC Qualifying Examination (MCCQE) Part I. This innovation aimed to replace the older Patient Management Problems (PMPs) format and reduce the overreliance on MCQs in licensure examinations. ^{11,
12,
14} The adoption of KFPs by the TPEBE reflects a parallel intent: to augment the assessment of clinical competence beyond the limits of traditional MCQs.

This manuscript presents a comprehensive 10-year review of the Turkish Pediatric Endocrinology Subspecialty Board Examination, focusing on its structural evolution, aggregated examination outcomes, and candidates’ perspectives on the examination process.

Methods Study design and setting

This cross-sectional study analyzed data from the TPEBE conducted between 2015 and 2025. Examination records were combined with candidate survey responses collected immediately after each examination, except in the first year (2015), when no survey was administered. A total of ten examination sessions were included in the analysis; the 2020 session was cancelled due to the COVID-19 pandemic.

Participants

Eligible candidates were pediatric endocrinologists and subspecialty residents who met the requirements defined by the TBPEC. Eligibility was verified through documentation review, and only approved applicants were permitted to sit for the exam. Across the study period, there were 263 examination sittings corresponding to 186 unique candidates where 58 attempted the examination more than once. Of these, 193 (73.4%) were female and 70 (26.6%) males.

Exam sets

Exam questions were developed by faculty members from academic institutions across Türkiye, each submitting items within their subspecialty areas to an online item bank organized by predefined subject categories. Depending on the year, seven or eight TBPEC members reviewed these items during face-to-face structured meetings, revised them as needed, and selected the final questions for each examination. All examinations were administered in paper–pencil format, at a single venue, under TBPEC members’ supervision.

Each exam set consisted of two sections: 1.

MCQ section: In the first four years (2015–2018), the exams consisted exclusively of MCQs, ranging from 75 items in 2015 to 100 items between 2016–2018, 85 items in 2019 and 80 items between 2021-2025. All MCQs had five options with one correct answer.

KFP section: Beginning in 2019, the exam included a second section of KFPs. Each exam contained five KFP cases, with 2–4 items per case (13–14 items in total). Most required short written responses, while only five were multiple-response items allowing more than one correct option.

The maximum achievable score for each examination was 100 points. To determine the cut scores, two standard-setting methods were employed: the Nedelsky method for MCQs (applied to all exams except 2015) and the Angoff method for KFPs, each selected for its suitability to the respective item format. For exams that included both MCQ and KFP sections, the final cut score represented the sum of the two section-specific cut scores. The evolution of the examination structure across the study period is summarized in Table 1.

Table 1. Structure of the Turkish pediatric endocrinology subspecialty board examination by year, 2015–2025.

Year(s) *	MCQ items n	KFP cases (items) n	Exam format
2015	75	–	MCQ
2016–2018	100	–	MCQ
2019	85	5 cases (13–14)	MCQ + KFP
2021–2025	80	5 cases (13–14)	MCQ + KFP

2020 exam was cancelled due to COVID-19.

Post-exam review: After each examination, board members reviewed the questions with the candidates and received any appeals. Following the evaluation of these appeals, some MCQs were removed from the exam sets over the years for various reasons: 2016 (4 items), 2017, 2018, and 2021 (2 items each), and 2024 and 2025 (1 item each). Omitted questions were scored as correct for all examinees.

Survey instrument

The feedback survey was administered at the examination venue after candidates completed the examination. Candidate feedback was collected using a paper-based questionnaire administered immediately after the exam except in 2015. The instrument consisted of items questioning demographic characteristics, 11 structured items assessing perceptions of exam difficulty, relevance, fairness, and organization, each rated on a 9-point Likert scale (1: Strongly disagree/Very poor, 9: Strongly agree/Very good); and open-ended questions exploring the most useful aspects, least useful aspects, suggestions for improvement and comments. The survey was completed anonymously and voluntarily. Informed consent was obtained verbally from all participants prior to data collection. Verbal consent was obtained because the survey was anonymous, posed minimal risk, and did not involve the collection of any identifying or sensitive personal information.

Data analysis

Data were analyzed using IBM SPSS Statistics version 29.0 (IBM Corp., Armonk, NY, USA). The normal distribution of continuous variables was examined using the Kolmogorov-Smirnov test and presented as mean and standard deviation (x̄ ± SD). Comparisons between the two groups were conducted using the independent-samples t-test or Mann-Whitney U test where applicable. Categorical variables were presented as numbers and percentages. The relationship between categorical variables was examined using Pearson’s chi-square and Fisher’s exact test. Cronbach’s alpha coefficient was calculated for the reliability of the tests. A 95% confidence interval was adopted, and statistical significance was set at p < 0.05.

Results Exam performance

Between 2015 and 2025, a total of 263 candidates (193 women, 70 men) participated in the board examination. Two candidates who did not respond to any exam items were excluded, leaving 261 candidates for statistical analysis.

Mean total exam scores, cut scores, pass rates, and reliability coefficients (Cronbach’s α) for both the MCQ and KFP components are summarized in Table 2. Overall, mean total scores showed a gradual decline after 2018, with the lowest average observed in 2023 (x̄ = 52.33 ± 12.18). Pass rates fluctuated across years, ranging from 22.6% (2023) to 85.0% (2016). Reliability coefficients for the MCQ component remained moderate to high (α = 0.53–0.90), while those for the KFP component improved over time, from 0.31 in 2021 to 0.85 in 2024.

Table 2. Examination performance metrics by year (2015–2025).

Year ^a	Candidate n	Exam total score Mean (SD)	Cut score (%)	Pass rate (%)	MCQ ^b (α)	KFP ^b (α)
2015	26	70.45 (5.97)	70.0	73.1	0.529	-
2016	20	74.00 (7.06)	65.0	85.0	0.686	-
2017	26	64.04 (10.33)	60.0	69.2	0.831	-
2018	18	61.83 (13.65)	60.0	61.1	0.899	-
2019	22	57.64 (10.40)	50.0	72.7	0.813	0.425
2021	23	56.69 (8.43)	60.0	39.1	0.677	0.312
2022	29	57.98 (10.77)	60.0	41.4	0.765	0.648
2023	31	52.33 (12.18)	60.0	22.6	0.854	0.711
2024	29	53.99 (11.36)	58.0	37.9	0.807	0.854
2025	37	52.90 (9.05)	58.0	29.7	0.675	0.662

The 2020 exam cancelled due to COVID-19 pandemic.

Cronbach’s α.

Male candidates achieved significantly higher mean scores in both the MCQ and exam total scores compared with females, while no statistically significant gender difference was observed for KFP scores Table 3.

Table 3. Comparison of examination scores by gender.

Score type	Female (mean ± SD)	Male (mean ± SD)	t (df )	p ^a
MCQ	52.28 ± 12.49	57.90 ± 16.08	–2.65 (≈101)	0.009 ^b
KFP	8.50 ± 3.45	8.89 ± 3.23	–0.60 (167)	0.459 ^c
Exam Total	58.24 ± 11.49	62.35 ± 13.28	–2.45 (259)	0.015 ^b

Independent t-test.

Statistically significant (p < 0.05).

Mann-Whitney U test.

Throughout the study period, male examinees achieved an overall pass rate of 60.0%, compared with 46.6% among female examinees. This difference was not statistically significant (χ ² = 3.68, p = 0.055).

Candidate feedback

Of the 235 examinees across the 2016–2025 examination years, 217 (92.3%) completed the post-examination feedback questionnaire. The mean and standard deviation scores for each structured statement, aggregated across examination years, are presented in Table 4.

Table 4. Candidate evaluation of examination components by year: Mean (SD) Scores.

	2016 (n = 19)	2017 (n = 25)	2018 (n = 18)	2019 (n = 19)	2021 (n = 22)	2022 (n = 27)	2023 (n = 30)	2024 (n = 23)	2025 (n = 34)	Overall (n = 217)
Item	Mean (SD)
1. The MCQs were difficult.	5.22 (1.93)	6.76 (1.83)	7.11 (1.49)	6.00 (1.87)	6.71 (1.71)	6.33 (1.62)	5.79 (1.54)	5.59 (1.89)	7.24 (1.33)	6.36 (1.77)
2. The KFP questions were difficult.	–	–	–	4.59 (1.58)	5.38 (1.60)	6.27 (1.78)	5.86 (2.03)	5.67 (1.85)	7.21 (1.57)	6.01 (1.90)
3. The KFP questions were consistent with my current clinical practice.	–	–	–	7.41 (1.33)	7.48 (1.29)	6.96 (2.01)	7.34 (1.65)	7.72 (1.13)	7.35 (1.72)	7.35 (1.59)
4. I believe the KFP questions measured my clinical problem-solving skills.	–	–	–	7.47 (1.33)	7.29 (1.93)	6.92 (1.98)	7.24 (1.81)	7.42 (1.22)	7.00 (1.79)	7.18 (1.72)
5. I appreciated the inclusion of KFP questions in the exam.	7.33 (2.06)	6.36 (2.55)	5.06 (2.80)	7.29 (1.83)	7.48 (2.38)	7.46 (1.63)	7.52 (2.37)	7.35 (2.16)	6.88 (2.11)	7.31 (2.09)
6. The exam duration was sufficient.	7.22 (1.96)	7.24 (1.74)	7.67 (1.50)	5.71 (2.85)	5.14 (2.85)	5.11 (2.81)	4.89 (3.26)	3.82 (2.61)	3.91 (3.18)	5.14 (2.97)
7. The exam was well-organized.	6.44 (1.85)	7.12 (1.64)	7.28 (2.05)	6.82 (2.24)	8.24 (0.94)	8.11 (1.15)	8.07 (1.19)	8.23 (1.11)	7.76 (1.60)	7.75 (1.54)
8. The exam was designed to effectively differentiate between knowledgeable and less knowledgeable candidates.	6.67 (1.64)	6.56 (1.76)	7.33 (1.68)	7.00 (1.70)	7.33 (2.01)	7.37 (1.69)	7.24 (1.30)	7.91 (1.11)	7.26 (1.76)	7.24 (1.68)
9. The distribution of questions across topics was balanced throughout the exam.	6.59 (1.54)	6.40 (1.58)	6.78 (1.90)	6.88 (1.54)	6.95 (1.53)	7.41 (1.95)	7.17 (1.71)	7.55 (1.01)	7.26 (1.78)	7.11 (1.66)
10. The exam was suitable for pediatric endocrinolgy subspecialty qualification assessment.	6.88 (1.69)	5.84 (2.29)	7.44 (1.62)	7.00 (1.70)	6.52 (2.09)	7.11 (1.87)	7.31 (1.39)	7.41 (1.30)	7.47 (1.56)	7.00 (1.67)
11. The exam content was consistent with the scope of my pediatric endocrinolgy subspecialty training.	–	–	–	7.12 (2.03)	6.81 (1.99)	7.30 (1.84)	7.52 (1.50)	7.64 (1.09)	7.56 (1.44)	7.15 (1.79)

Overall, participants rated the MCQ section as slightly more difficult (x̄ = 6.36 ± 1.77) than the KFP section (x̄ = 6.01 ± 1.90). Most respondents agreed that the KFP questions reflected real clinical practice and effectively assessed problem-solving ability. The inclusion of KFPs received high satisfaction scores, indicating broad approval of this format.

Among all items, exam organization achieved the highest overall average rating (x̄ = 7.75 ± 1.54), suggesting that participants were highly satisfied with the administration and logistical arrangements. Conversely, the adequacy of exam duration received the lowest rating (x̄ = 5.14 ± 2.97), representing a level only slightly above neutrality on the satisfaction scale and highlighting persistent time-related concerns.

Open-ended feedback from the examinations revealed diverse yet coherent themes reflecting participants’ evaluation of exam quality, content relevance, and organizational logistics. Respondents appreciated the exam’s comprehensive coverage across all topics and its alignment with clinical practice, particularly valuing the case-based (KFP) section for fostering problem-solving and reflective learning. Several participants noted that the test effectively highlighted their knowledge gaps and motivated further study.

Conversely, time constraints emerged as the most prominent source of dissatisfaction. Participants described the allotted duration as insufficient, citing long and complex questions that limited their ability to complete the exam. Suggestions included extending the time or dividing the test into two parts.

Regarding content, opinions were mixed. While most found the questions well prepared, a few perceived the MCQs as overly difficult or focused on unnecessary details—particularly in genetic and metabolic topics. Participants recommended increasing the proportion of clinical and case-based questions and providing post-exam answer booklets to enhance learning. Despite the stress associated with the process, many expressed gratitude to the organizing committee and acknowledged the exam’s educational value in guiding professional self-assessment and development.

Discussion

This 10-year evaluation of the TPEBE provides the first comprehensive overview of its evolution, psychometric performance, and candidates’ perceptions since its inception in 2015. The findings reveal a trajectory of increasing structural improvement and reliability, particularly following the integration of KFPs in 2019. The transition from a purely MCQ format to a mixed MCQ–KFP design has progressively strengthened the exam’s ability to assess higher-order clinical reasoning while maintaining fairness and organizational quality. Pass rates fluctuated over time, likely reflecting the combined influence of item difficulty calibration, evolving training quality, and candidate preparedness. Importantly, reliability coefficients for both sections improved steadily, indicating the maturation of item-writing processes and enhanced internal consistency.

Throughout the decade, mean total scores and pass rates varied between exam years, with notably higher performance during the early years and lower outcomes after 2021. Several factors may explain this trend. The early sessions (2015–2018) consisted exclusively of MCQs, a format known for high reliability and broad content coverage but limited depth in assessing clinical judgment. ^{5,
6} As the examination evolved to include KFPs, which emphasize reasoning and decision-making, overall scores declined—an expected phenomenon also observed in other board examinations that introduced performance-oriented formats. ^{4,
11} The temporary disruption of training schedules and reduced clinical exposure during the COVID-19 pandemic may have further contributed to decreased performance in 2021–2023, a possible concern reported in other research in postgraduate assessments. ^{15–
18}

The introduction of KFPs in 2019 represents a major pedagogical advancement in the TPEBE. Designed to target “key decision points” in clinical management, KFPs enable a more valid assessment of problem-solving and integrative reasoning than MCQs alone. ^{12,
13} The progressive increase in KFP reliability—from α = 0.312 in 2021 to α = 0.854 in 2024—indicates improved case design, examiner calibration, and standardization procedures. This pattern suggests growing psychometric robustness and aligns with findings from international studies reporting that mixed-format examinations achieve better construct validity. ^{7,
8,
19,
20} The balanced use of two complementary item types—MCQs for breadth and KFPs for depth—reflects an evidence-informed approach to assessment design consistent with contemporary medical education principles. ^{9,
21}

In this study, although male candidates achieved slightly higher MCQ and total scores, gender differences in pass rates were not statistically significant. Comparable findings have been reported in other nationwide postgraduate assessments, with several studies showing no overall gender differences in examination outcomes, and in some cases, female candidates performing significantly better in certain domains or clinical components. ^{22,
23} The observed score gap may reflect differential exposure to standardized testing environments or self-perceived exam confidence. Importantly, the absence of significant disparities in KFP performance suggests that case-based, reasoning-oriented formats may reduce gender-related score variance, supporting their inclusion as a more equitable assessment component.

Candidates’ feedback provides valuable qualitative insight into the examination’s educational value and perceived validity. The majority of participants rated the exam highly for its organization, content balance, and reflection of real clinical practice. The strong endorsement of the KFP format underscores its perceived authenticity and relevance to day-to-day decision-making in pediatric endocrinology. Participants consistently reported that the exam highlighted their knowledge gaps and motivated targeted self-study—confirming the dual function of certification assessments as both evaluative and formative instruments. These findings echo international reports emphasizing the educational impact of well-designed board examinations in promoting reflective practice and continuing professional development. ^{4,
24}

After the introduction of KFPs, the most consistent area of dissatisfaction concerned the perceived inadequacy of examination duration (x̄ = 5.14 ± 2.97). The relatively large standard deviation indicates considerable variability in respondents’ perceptions, suggesting diverse experiences or expectations regarding the sufficiency of the allotted time. This concern likely arises from the inherent cognitive load of the KFP section, which requires interpretive reasoning and the formulation of written responses. Similar challenges have been reported in previous studies, noting that constructing answers under open-ended conditions is cognitively demanding and time-consuming, as it involves articulating reasoning and justifying decisions rather than merely recognizing correct options. ^{11,
25,
26} Addressing time constraints—either through adaptive scheduling or improved question pacing—could enhance candidate experience without compromising assessment validity.

The TBPEC’s iterative review of item performance and appeals, coupled with scoring adjustments for omitted questions, reflects a transparent and learner-centered quality assurance framework. The continuous monitoring of reliability and cut-score consistency over time has strengthened the exam’s credibility and accountability. Looking ahead, the transition to digital or hybrid exam delivery may offer opportunities for enhanced item analysis, automated scoring, and secure remote administration, aligning the TPEBE with global advancements in high-stakes assessment.

Limitations and future directions

This analysis has some limitations. The cross-sectional design precludes longitudinal tracking of individual progress or causal inference regarding changes in performance. Survey data were self-reported and may be influenced by response bias. Furthermore, the absence of pre-pandemic comparator data limits the interpretation of COVID-19–related effects. Future research could explore predictive validity (e.g., relationship between board scores and subsequent clinical performance), longitudinal reliability, and the psychometric behavior of KFP items across specialties.

Conclusion

Over the past decade, the TPEBE has evolved from a traditional knowledge-based test into a multidimensional assessment aligned with international best practices in medical education. The integration of KFPs, progressive enhancement of reliability indices, and strong candidate endorsement collectively indicate a maturing and credible certification process. Continued investment in psychometric monitoring, examiner training, and technological innovation will further enhance the examination’s role in ensuring clinical competence and fostering excellence in pediatric endocrinology practice in Türkiye. It is anticipated that these examinations will be increasingly recognized and utilized by educational institutions and regulatory authorities as a benchmark for achievement in both education and professional practice.

Ethical considerations

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ege University Faculty of Medicine Medical Research Ethics Board Ref. Date: 25.05.2023 #23-5.1T/46. Informed consent was obtained verbally from all participants prior to data collection. Verbal consent was obtained because the survey was anonymous, posed minimal risk, and did not involve the collection of any identifying or sensitive personal information.

Data availability statement

figshare: A Decade of the Turkish Pediatric Endocrinology Subspecialty Board Examination: Structure, Outcomes, and Candidate Perspectives.

Dataset: https://doi.org/10.6084/m9.figshare.30628319 ²⁷

This project contains the following data: •

Data File: TPEBE-Data-2015-2025.xlsx

Data are available under the terms of the CC BY 4.0

Acknowledgements

We express our sincere appreciation to the TPEBEC members—Ece Böber, Hakan Döneray, Şenay Savaş Erdeve, Damla Gökşen, Filiz Tütüncüler Kökenli, Samim Özen, Alev Özön, and Doğa Türkkahraman—for their invaluable efforts and contributions throughout the entire TPEBE process.

We are also grateful to our academic colleagues who developed and submitted the MCQ and KFP items for the examination, whose contributions were essential to the success of this work. In addition, we extend our heartfelt thanks to all TPEBE participants for their engagement in the examination and for their time, commitment, and willingness to contribute to this study.

References 1

Türkiye Cumhuriyeti Tıpta Uzmanlık Tüzüğü - Republic of Türkiye Medical Specialization Regulation.

2002. Reference Source

Türkiye Cumhuriyeti Tababet Uzmanlık Tüzüğü - Republic of Türkiye Medical Specialization Regulation.

1973. Reference Source

Tarihçe – Çocuk Endokrinolojisi ve Diyabet Derneği. Accessed July 24, 2025. Reference Source

Staudenmann

Waldner

Lörwald

: Medical specialty certification exams studied according to the Ottawa Quality Criteria: a systematic review. BMC Med. Educ. 2023;23(1):619–620. 37649019

10.1186/S12909-023-04600-X

PMC10466740

Shumway

Harden

: Medical Teacher AMEE Guide No. 25: The assessment of learning outcomes for the competent and reflective physician. 2009. 10.1080/0142159032000151907

Yudkowsky

Soo

Downing

: Assessment in Health Professions Education Editado Por Rachel Yudkowsky, Yoon Soo Park, Steven M. Downing. Routledge;2020. Accessed December 20, 2024. Reference Source

Renes

Vleuten

CPM

van der Collares

: Utility of a multimodal computer-based assessment format for assessment with a higher degree of reliability and validity. Med. Teach. 2023;45(4):433–441. 36306368

10.1080/0142159X.2022.2137011

Wijk

van Janse

Ruijter

: Use of very short answer questions compared to multiple choice questions in undergraduate medical students: An external validation study. PLoS One. 2023;18(7):e0288558. 37450485

10.1371/JOURNAL.PONE.0288558

PMC10348524

Schuwirth

LWT

Van Der Vleuten

CPM

: Different written assessment methods: what can be said about their strengths and weaknesses? Med. Educ. 2004;38(9):974–979. 15327679

10.1111/J.1365-2929.2004.01916.X

Yılmaz

Çalışkan

Darcan

: Flipped learning in faculty development programs: opportunities for greater faculty engagement, self-learning, collaboration and discussion. Turk. J. Biochem. 2022;47(1):127–135. 10.1515/TJB-2021-0071

Page

Bordage

Allen

: Developing key-feature problems and examinations to assess clinical decision-making skills. Acad. Med. 1995;70(3):194–201. Accessed January 11, 2025. Reference Source

Bordage

Gordon

: An Alternative to PMPs: The “Key Features” Concept. Further Developments in Assessing Clinical Competence, 2nd Ottawa Conference, 1987, 59-75. An Alternative to PMPs: The “Key Features” Concept. Further Developments in Assessing Clinical Competence, 2nd Ottawa Conference. 1987;59–75.

Medical Council of Canada: Guidelines for the Development of Key Feature Problems & Test Cases. 2012. Accessed December 21, 2024. Reference Source

Page

Bordage

: The Medical Council of Canada’s key features project: a more valid written examination of clinical decision-making skills. Acad. Med. 1995;70(2):104–110. Accessed March 8, 2019. 10.1097/00001888-199502000-00012

Reference Source

Ryan

Holmboe

Chandra

: Competency-Based Medical Education: Considering Its Past, Present, and a Post-COVID-19 Era. Acad. Med. 2022;97:S90–S97. 34817404

10.1097/ACM.0000000000004535

PMC8855766

Sneyd

Mathoulin

O’Sullivan

: Impact of the COVID-19 pandemic on anaesthesia trainees and their training. Br. J. Anaesth. 2020;125(4):450–455. 32773215

10.1016/j.bja.2020.07.011

PMC7377727

Patil

Ranjan

Kumar

: Impact of COVID-19 Pandemic on Post-Graduate Medical Education and Training in India: Lessons Learned and Opportunities Offered. Adv. Med. Educ. Pract. 2021;12:809–816. 34345196

10.2147/AMEP.S320524

PMC8325012

Exam Pass Rates|The American Board of Pediatrics. Accessed October 13, 2025. Reference Source

Farmer

Page

: A practical guide to assessing clinical decision-making skills using the key features approach. Med. Educ. 2005;39(12):1188–1194. 16313577

10.1111/J.1365-2929.2005.02339.X

McNamara

Scott

Boyd

: Constructing validity evidence from a pilot key-features assessment of clinical decision-making in cerebral palsy diagnosis: application of Kane’s validity framework to implementation evaluations. BMC Med. Educ. 2023;23(1):1–19. 10.1186/S12909-023-04631-4/TABLES/4

Bird

Olvet

Willey

: Patients don’t come with multiple choice options: essay-based assessment in UME. Med. Educ. Online. 2019;24(1). 31438809

10.1080/10872981.2019.1649959

PMC6720218

Sulistio

Khera

Squiers

: Effects of gender in resident evaluations and certifying examination pass rates. BMC Med. Educ. 2019;19(1):1–7. 10.1186/S12909-018-1440-7/FIGURES/2

Ellis

Knapton

Cannon

: A multivariate analysis examining the relationship between sociodemographic differences and UK graduates’ performance on postgraduate medical exams. Med. Teach. 2025;1–15. 40512226

10.1080/0142159X.2025.2513426

Bhanji

Naik

Skoll

: Competence by Design: The Role of High-Stakes Examinations in a Competence Based Medical Education System. Perspect. Med. Educ. 2024;13(1):68–74. 38343558

10.5334/PME.965

PMC10854425

Huwendiek

Reichert

Duncker

: Electronic assessment of clinical reasoning in clerkships: A mixed-methods comparison of long-menu key-feature problems with context-rich single best answer questions. Med. Teach. 2017;39(5):476–485. 28281369

10.1080/0142159X.2017.1297525

Çalışkan

Taşdelen Teker

Mavioğlu

: Digital transformation of the Turkish national neurology board examination: Implementation and candidates’ feedback. Turkish Journal of Neurology. 2025;31(3):270–277. 10.55697/TND.2025.383

Çalışkan

: Turkish Pediatric Endocrinology Subspecialty Board Examination 2015-2025.Research Data. 10.6084/m9.figshare.30628319

10.5256/f1000research.191571.r448667

Reviewer response for version 1

Mansoor

Masab

1 Referee https://orcid.org/0009-0007-4501-7016 1Edward Via College of Osteopathic Medicine, Blacksburg, Virginia, USA

Competing interests: No competing interests were disclosed.

10 1 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

Peer Review Report

Summary

This cross-sectional study presents a comprehensive 10-year evaluation (2015-2025) of the Turkish Pediatric Endocrinology Subspecialty Board Examination (TPEBE), analyzing examination data from 263 sittings (261 eligible candidates) and post-examination survey responses from 217 participants. The manuscript documents the structural evolution from exclusively multiple-choice questions (MCQs) to a mixed format incorporating key feature problems (KFPs) beginning in 2019, and evaluates psychometric performance metrics and candidate perspectives. The authors demonstrate progressive improvements in reliability coefficients, variable pass rates (22.6%-85.0%), and generally positive candidate feedback regarding exam organization and clinical relevance.

Detailed Assessment

Is the work clearly and accurately presented and does it cite the current literature?

Answer: Yes

The manuscript is well-structured, logically organized, and clearly written with appropriate scientific terminology. The literature review adequately contextualizes the study within the broader framework of medical education assessment, citing relevant international standards and contemporary assessment theory. The references span classical assessment literature (Bordage, Page, Downing) and contemporary validation frameworks, appropriately supporting the methodological choices and interpretive context.

Minor recommendations:

Consider citing more recent literature on post-pandemic effects on medical education outcomes (the manuscript mentions COVID-19 impacts but could strengthen this with additional 2023-2024 references)

The discussion of gender differences in examination performance could benefit from more recent systematic reviews or meta-analyses on this topic in medical education

Is the study design appropriate and is the work technically sound?

Answer: Yes

The cross-sectional design is appropriate for the research objectives. The 10-year timeframe provides sufficient data for trend analysis and evaluation of structural changes. The combination of quantitative examination metrics with qualitative candidate feedback creates a robust mixed-methods approach that enhances the validity of conclusions.

Strengths:

Comprehensive dataset spanning a full decade

High survey response rate (92.3%)

Appropriate psychometric analyses (Cronbach's α, pass rates, mean scores)

Transparent reporting of examination structure evolution

Minor considerations:

The exclusion of two candidates who did not respond to any items is reasonable and well-documented

The cancellation of the 2020 examination due to COVID-19 creates a minor gap but is unavoidable and well-explained

Are sufficient details of methods and analysis provided to allow replication by others?

Answer: Partly

The manuscript provides substantial methodological detail, but several areas require clarification or expansion:

Strengths:

Clear description of examination structure evolution (Table 1)

Specification of standard-setting methods (Nedelsky for MCQs, Angoff for KFPs)

Description of item development and review processes

Statistical methods appropriately described

Areas requiring clarification:

KFP Scoring Details: The manuscript states that KFPs "mostly required short written responses" with five being multiple-response items, but the specific scoring rubrics, rater training procedures, and inter-rater reliability assessments are not described. This is critical for replication and interpretation of the α coefficients for KFPs.

Standard-Setting Procedures: While the methods (Nedelsky, Angoff) are named, the specific implementation details are lacking:

How many judges participated?

What was the judge calibration process?

How were discrepancies resolved?

Were modified or traditional versions of these methods used?

Item Bank Management: The process for organizing, selecting, and ensuring content validity of questions from the item bank needs more detail:

What were the predefined subject categories?

How was content balance ensured across examination domains?

What quality control procedures were applied during item selection?

Survey Instrument: The 9-point Likert scale items are presented, but the development and validation of the survey instrument itself is not discussed. Was it pilot-tested? Was content validity established?

Missing Data: The manuscript should clarify whether there were any missing survey responses and how these were handled in analysis.

Recommendations:

Add a supplementary methods section or appendix detailing KFP scoring procedures

Provide more operational detail on standard-setting implementation

Clarify the content blueprint or specification table used for examination construction

Is the statistical analysis and its interpretation appropriate?

Answer: Yes

The statistical methods are appropriate for the research questions and data types:

Appropriate analyses:

Descriptive statistics (means, standard deviations, frequencies, percentages)

Independent samples t-tests and Mann-Whitney U tests for group comparisons

Chi-square and Fisher's exact tests for categorical associations

Cronbach's α for internal consistency reliability

Appropriate significance threshold (p < 0.05)

Strengths:

Proper selection of parametric vs. non-parametric tests based on distribution assessment (Kolmogorov-Smirnov)

Clear presentation of results with appropriate measures of central tendency and dispersion

Transparent reporting of statistical significance

Minor considerations:

Effect Sizes: While statistical significance is reported, effect sizes (e.g., Cohen's d for t-tests) would enhance interpretation of the practical significance of gender differences.

Multiple Comparisons: With multiple statistical tests conducted, consideration of family-wise error rate correction (e.g., Bonferroni adjustment) might be warranted, though given the exploratory nature of some analyses, the current approach is defensible.

Trend Analysis: Given the longitudinal nature of the data, formal trend analysis (e.g., linear regression of scores over time, joinpoint regression to identify structural breaks) could strengthen the conclusions about score trajectories and the impact of format changes.

Reliability Confidence Intervals: Presenting confidence intervals for Cronbach's α values would enhance interpretation of reliability estimates, particularly for smaller sample sizes in individual years.

Recommendations:

Include effect sizes for key comparisons

Consider adding trend analysis to formally test temporal patterns

Provide confidence intervals for reliability coefficients

Are all source data underlying the results available to ensure full reproducibility?

Answer: Yes

The authors have made their dataset publicly available through figshare (DOI: 10.6084/m9.figshare.30628319) under CC BY 4.0 license, which is commendable and facilitates transparency and potential replication or secondary analysis.

Recommendation:

Consider also depositing the survey instrument itself as supplementary material to enable full methodological transparency

Are the conclusions drawn adequately supported by the results?

Answer: Yes

The conclusions are generally well-supported by the presented data and appropriately qualified:

Well-supported conclusions:

Progressive improvement in psychometric robustness, particularly reliability

Successful integration of KFPs enhanced assessment of clinical reasoning

High candidate satisfaction with examination organization and relevance

Time constraints as a persistent challenge requiring attention

Appropriately qualified interpretations:

The authors appropriately acknowledge multiple potential explanations for score declines (format change, COVID-19 disruption, item difficulty)

Gender differences are interpreted cautiously given non-significant pass rate disparities

Limitations are transparently discussed

Minor considerations:

Causality: While the manuscript appropriately avoids strong causal claims, the discussion of score declines following KFP introduction could more explicitly acknowledge confounding (e.g., changes in candidate cohort characteristics, training program evolution, concurrent curricular changes).

Generalizability: The conclusions could be more explicit about the context-specific nature of findings (Turkish medical education system, pediatric endocrinology subspecialty) and what aspects might generalize to other contexts.

Predictive Validity: The manuscript acknowledges as a limitation the lack of data on downstream clinical performance, but this could be emphasized more strongly as it affects interpretation of "validity" claims.

Recommendations:

Add a brief statement acknowledging that observed relationships are associative rather than causal

Explicitly discuss which findings are likely context-specific vs. generalizable

Consider moderating claims about "validity" to focus on content and construct validity rather than broader validity claims without predictive validity data

Specific Technical Issues

Table 2: Examination Performance Metrics

Issue: The pass rate calculation denominator is unclear. Are candidates who were administratively ineligible or withdrew counted in the denominator?

Recommendation: Clarify in the table notes whether pass rates are calculated as passes/eligible examinees or passes/actual examinees

Table 3: Gender Comparison

Issue: The degrees of freedom for the MCQ comparison (df ≈ 101) suggests unequal variances were assumed, but this is not explicitly stated

Recommendation: Specify whether equal or unequal variances were assumed and report Levene's test results if applicable

Table 4: Longitudinal Survey Data

Strength: Excellent comprehensive presentation

Minor issue: The progressive decline in time adequacy ratings is dramatic (7.67 in 2018 to 3.91 in 2025) but not explicitly highlighted or statistically tested

Recommendation: Consider a formal trend analysis or ANOVA across years for this critical item

Figure Absence

Observation: The manuscript contains no figures

Recommendation: Visual representations would enhance accessibility:

Figure 1: Line graph showing mean scores, pass rates, and reliability coefficients over time

Figure 2: Box plots comparing MCQ vs. KFP scores, potentially stratified by gender

Figure 3: Bar chart of candidate satisfaction ratings across survey domains

Substantive Content Issues

1. KFP Implementation Fidelity

The manuscript states KFPs target "key decision points" but doesn't provide evidence that the implemented items actually focus on high-stakes, discriminating clinical decisions rather than routine information gathering. Given the initial low reliability (α = 0.31 in 2021), were blueprinting procedures adequate?

Recommendation: Add a brief description of the KFP development process, including how "key features" were identified and how items were validated to target these features.

2. Standard-Setting Defensibility

Using different standard-setting methods for different item types (Nedelsky for MCQ, Angoff for KFP) is reasonable, but the manuscript doesn't explain why these specific methods were chosen or whether combined standard-setting procedures were used when integrating the two components.

Recommendation: Add a brief justification for method selection and explain how component cut scores were combined into the overall pass standard.

3. Gender Analysis Interpretation

The finding that male candidates scored higher on MCQs and total scores (p<0.05) but not on KFPs or overall pass rates requires more nuanced interpretation. The manuscript briefly mentions "differential exposure" but doesn't explore potential mechanisms or implications.

Recommendation: Expand the discussion of gender differences, considering:

Whether MCQ vs. KFP performance patterns suggest format-related bias

Implications for assessment equity

Comparison to international literature on gender and assessment format

4. COVID-19 Impact

The manuscript attributes post-2021 score declines partly to pandemic disruptions but provides limited evidence. The 2021 exam was the first after cancellation, so reduced clinical exposure is plausible, but scores continued declining through 2025.

Recommendation: Either strengthen this interpretation with additional evidence (e.g., candidate survey data on clinical exposure, comparisons with other subspecialty exams in Turkey) or moderate the claim.

5. Reliability Progression

The dramatic improvement in KFP reliability from α = 0.31 (2021) to α = 0.85 (2024) suggests major changes in item quality, scoring procedures, or both. This deserves more explicit discussion.

Recommendation: Discuss what specific quality improvement initiatives led to enhanced KFP reliability. Were rater training procedures enhanced? Were poorly performing items systematically revised?

Minor Editorial Issues

Language and Clarity

The manuscript is generally well-written, but minor issues exist:

Page 3: "TSPED has played a central role in advancing pediatric endocrinology and diabetes care through the promotion of professional collaboration, standard-setting initiatives, and a broad array of educational activities—including conferences, workshops, and training programs—designed to enhance the competencies of healthcare professionals."

Issue: Slightly verbose

Suggestion: "TSPED has advanced pediatric endocrinology and diabetes care through professional collaboration, standard-setting, and educational programs including conferences, workshops, and training."

Page 4: "Omitted questions were scored as correct for all examinees."

Issue: Could be clearer about when this occurred

Suggestion: "Questions subsequently identified as flawed through candidate appeals were retroactively scored as correct for all examinees (2016: 4 items; 2017-2018, 2021: 2 items each; 2024-2025: 1 item each)."

Abbreviations: First use should be spelled out:

"PMPs" appears without definition on page 3 (Patient Management Problems)

Internal Consistency

Page 3: The manuscript states eligibility was "verified through documentation review" but doesn't specify what documentation (completion of residency? specific training requirements?)

Page 4: "seven or eight TBPEC members" – clarify why this varied

Ethical Considerations

The ethical approval and consent procedures are appropriate for this type of educational research. The justification for verbal rather than written consent is reasonable given the anonymous, minimal-risk nature of the survey. The use of existing examination data for quality improvement and research is appropriately covered by institutional ethics approval.

Recommendation: Consider explicitly stating whether candidates were informed during registration that de-identified examination data might be used for research purposes.

Overall Assessment and Recommendations for Revision

This manuscript presents valuable longitudinal data on subspecialty board examination evolution and provides useful insights for medical education stakeholders internationally. The core findings are sound and the conclusions are appropriately supported. However, several methodological details require clarification, and the statistical analysis could be strengthened with effect sizes and formal trend analysis.

Required Revisions (Essential for Scientific Soundness):

Expand Methods – KFP Scoring: Provide detailed description of scoring procedures, rater training, and reliability assessment for written responses

Expand Methods – Standard Setting: Provide operational details of standard-setting procedures including judge selection, number, calibration, and decision rules

Statistical Enhancement: Add effect sizes for key comparisons and consider formal trend analysis

Clarify Data Analysis: Specify handling of missing data and assumption checking for statistical tests

Recommended Revisions (Would Substantially Strengthen Manuscript):

Add Figures: Visual representation of trends, score distributions, and candidate feedback

Expand Discussion: More thorough interpretation of reliability improvements, gender differences, and COVID-19 impacts

Content Validity Evidence: Describe the examination blueprint and content coverage procedures

Supplementary Materials: Include survey instrument and additional methodological details

Minor Revisions (Would Improve Clarity):

Editorial refinements as noted above

Explicit discussion of generalizability and context-specific findings

More conservative framing of validity claims pending predictive validity studies

Conclusion

Recommendation: Revise (Minor Revision)

This manuscript makes a solid contribution to the medical education literature by providing transparent, longitudinal data on subspecialty board examination evolution. The integration of KFPs represents an important pedagogical advancement, and the psychometric data demonstrate continuous quality improvement. With clarification of scoring procedures, enhanced methodological detail, and strengthened statistical analysis, this work will serve as a valuable reference for other medical education systems implementing or refining board certification processes.

The study is fundamentally sound, the data are robust, and the conclusions are appropriate. The required revisions are primarily matters of methodological transparency rather than fundamental scientific concerns. I recommend acceptance following minor revision to address the methodological clarifications outlined above.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Pediatric board exam performance research

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.