Keywords
creative thinking skills, performance-based assessment, science education
Creative thinking is a core competence in science education for addressing complex environmental, technological, and societal challenges. However, students’ creative thinking performance remains insufficient, highlighting the need for a valid and reliable performance-based assessment instrument. This study aimed to develop and validate the Creative Thinking Performance Test (CTPT) to measure four dimensions of creative thinking in science learning: sensitivity, flexibility, novelty, and elaboration.
The CTPT was developed through several stages: blueprint construction based on four creative thinking dimensions, essay item development, expert validation using the Delphi technique, pilot testing, and psychometric evaluation. The instrument was administered to 138 elementary school teacher education students with diverse demographic and academic characteristics. Data were analyzed using Exploratory Factor Analysis (EFA), Confirmatory Factor Analysis (CFA), and Rasch modeling to examine construct validity, reliability, and item characteristics.
The findings revealed variations in students’ creative thinking skills based on demographic and academic factors. EFA and CFA supported the four-dimensional structure of the instrument. Rasch analysis confirmed good item fit, appropriate difficulty levels, and satisfactory reliability indices.
The study introduces the CTPT as a valid, reliable, and contextually relevant performance-based assessment tool for science education. The instrument complements conventional tests used by lecturers and provides practical support for assessing and enhancing students’ creative thinking skills in natural science learning contexts.
creative thinking skills, performance-based assessment, science education
Creative thinking skills are among the most important skills in education, especially in science learning. Creative thinking skills enable students to generate new ideas and encourage finding alternative solutions in dealing with complex scientific problems (Da’as, 2023; Han & Abdrahim, 2023). Various national and international curricula explicitly emphasize the importance of developing creative thinking skills so that students can adapt to the dynamics of scientific and technological developments (Chang et al., 2022; David, 2023; Dilekçi & Karatay, 2023; Shively et al., 2018). However, various studies suggest that students’ performance in creative thinking still tends not to be optimal, especially when confronted with science contexts that demand divergent and convergent thinking simultaneously (Affandy et al., 2024; Bulut Ates & Aktamis, 2024; Hong & Song, 2020). The condition emphasizes the need for accurate, valid, and reliable evaluation instruments to measure students’ creative thinking skills (Priyaadharshini & Vinayaga Sundaram, 2018; Ross et al., 2023; Shively et al., 2018). Appropriate evaluation is the basis for educators in designing effective learning strategies, so that creative thinking skills can truly develop according to the demands of the times.
Consequently, creative thinking has long been the focus of studies by psychologists and educationists. Guilford (1950) defined creative thinking as generating various possible answers to a problem by emphasizing aspects of divergent thinking. Torrance et al. (1992) then developed the definition through the Torrance Test of Creative Thinking (TTCT), which emphasizes four indicators: fluency, flexibility, originality, and elaboration. Jia et al., (2017) emphasized that creative thinking is related to cognitive potential and an individual’s real performance in solving problems. Therefore, the measurement of creative thinking performance requires a test instrument that is valid, reliable, and performance-based (Rhee et al., 2025; Shahbazloo & Abdullah Mirzaie, 2023). Therefore, measuring creative thinking skills is not enough to assess students’ declarative knowledge, but must reflect the ability to generate, develop, and apply ideas in real contexts (Oo et al., 2024; Pontis & Salerno, 2025).
Numerous instruments have been developed to measure creative thinking skills, such as the Torrance Test of Creative Thinking (Torrance et al., 1992), the Runco Ideational Behavior Scale (Runco et al., 2001), and Guilford’s Alternative Uses Task (Guilford, 1950). The instruments are widely used in psychology and education research, but they are mostly self-reported or generalized tests that assess creativity globally (Rhee et al., 2025; Shahbazloo & Abdullah Mirzaie, 2023). The condition is not suitable for the context of the science curriculum, especially in Indonesian education, which demands assessment based on students’ real performance in solving scientific problems. Moreover, existing instruments tend to emphasize general aspects of creativity, without linking them directly to creative thinking performance in the context of science learning (Runco et al., 2001; Shahbazloo & Abdullah Mirzaie, 2023). Therefore, there are still limitations in obtaining measurements that are accurate, valid, and relevant to the needs of the science curriculum, so the development of test instruments that are more contextual and performance-based is needed.
The limitations of previous research are even clearer when viewed from the two main types of assessments in measuring creative thinking skills: self-reported and performance assessments. Self-reported instruassessmentsrelatively easier to implement, but they often produce bias because students overestimate or underestimate their creative abilities (Tep et al., 2021; Xu et al., 2025). On the other hand, performance-based assessments are more accurate as they assess students’ real ability to generate creative ideas or solutions (Lebuda et al., 2024; Patterson et al., 2024). However, developing and validating performance-based instruments is still rare, especially those that use modern psychometric approaches to ensure instrument validity and reliability (Patterson et al., 2025; Shahzad et al., 2025). As far as the researchers know, instruments that specifically evaluate students’ creative thinking performance in science learning with strong psychometric validity and reliability tests are still very limited. Therefore, there is an important research gap to be filled by developing new instruments that are more contextualized and standardized.
Responding to this gap, the present study develops and validates the Creative Thinking Performance Test (CTPT) as a performance task–based instrument to evaluate students’ creative thinking skills in the context of science learning (Hasibuan et al., 2025). Contrasting with self-reported instruments, CTPT emphasizes assessing students’ real performance in generating creative ideas. The validation process was conducted using modern psychometric standards, including the Rasch model’s application and references to instrument evaluation standards issued by the American Educational Research Association (AERA) and the American Psychological Association (APA), thus ensuring the reliability and validity of the instrument. The main contribution of the research is the provision of valid, reliable instruments in accordance with the needs of the science curriculum to measure creative thinking skills more accurately. The research implies the availability of evaluation equipment that lecturers and researchers can use to assess and design learning strategies that enhance students’ creative thinking skills.
According to the discussed background, the current research contributes by developing and validating a new instrument, the Creative Thinking Performance Test (CTPT), which is designed to evaluate students’ creative thinking skills in the context of science learning. The instrument is used to measure four dimensions of creativity-fluency, flexibility, originality, and elaboration-which refer to Guilford, (1950) and Torrance et al., (1992) theories of creativity, while following modern psychometric measurement standards as advocated by AERA and APA. Furthermore, to strengthen its contribution, this study was formulated into three research questions: RQ1: To what extent does the CTPT demonstrate structural validity and reliability based on Rasch model analysis? RQ2: To what extent does the CTPT have external validity in predicting student performance on science problem-solving tasks? RQ3: What is the ability of students to generate adaptive and innovative scientific solutions, and what are the implications for their creative thinking skills?.
The Creative Thinking Skills Test (CTPT) instrument was developed by following standardized test development procedures as elaborated in the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). The development process includes: (1) blueprint development based on the constructs of creative thinking skills relevant in science learning, (2) preparation of initial items according to the blueprint, and (3) assessment of face and content validity through expert review and initial trials. Item analysis was conducted using Rasch Measurement by considering the difficulty and distinguishing factor index. The structural validity and reliability of the CTPT (RQ1) were tested using the Rasch model to ensure item fit to the model as well as internal consistency of the instrument. The external validity of the CTPT (RQ2) was evaluated through analyzing the instrument’s ability to predict students’ creative thinking skills performance on an open-ended problem-solving task, compared to a self-report-based creative thinking assessment instrument (Hasibuan et al., 2025).
The sample was selected using purposive sampling, a technique based on certain criteria relevant to the research objectives (Creswell, 2012). Purposive sampling was chosen because it allows researchers to obtain participants per the research context, such as academic background, university of origin, and semester level, so the data obtained can be more in-depth and representative according to research needs (Creswell, 2012). The study involved 138 students as participants.
According to gender ( Table 1), there were 14 male students (10.14%) and 124 female students (89.86%). Based on university affiliation, 63 students came from University A (45.65%) and 75 from University B (54.35%). Regarding the semester, 46 students were in semester 2 (33.33%), 47 students in semester 4 (34.06%), and 45 students in semester 6 (32.61%). The distribution of domicile location is almost balanced, with 65 students from urban areas (47.10%) and 73 students from rural areas (52.90%). Participation in campus organizations indicated that 60 students were actively involved (43.48%), while 78 students did not participate in organizations (56.52%). Based on academic achievement, most students have a GPA of 3.1–4.0 (101 students, 73.19%), while 37 students (26.81%) are in the GPA range of 2.1–3.0, and no one has a GPA below 2.0.
The literature review identified four dimensions of creative thinking skills that are relevant to explore in the context of science learning (Batlolona et al., 2019; Haim & Aschauer, 2024; Kholid et al., 2024; Suradika et al., 2023; Torrance et al., 1992). The study refers to Torrance’s classic framework to formulate an adequate construct, strengthening it with the results of recent research emphasizing the context of science education. The four dimensions (see Table 2) are positioned as the main factors, which are sensitivity, which refers to the ability of students to be sensitive in detecting problems and generating adaptive ideas; flexibility, which refers to the skill of generating varied ideas from various perspectives and categories; novelty, which emphasizes the ability to create unique and original ideas in offering new solutions; and elaboration, which describes the ability to expand and develop ideas into more detail and quality. The definition of the four dimensions becomes the basis for the operationalization of the instrument, where each dimension will be derived into indicators and items that represent students’ creative thinking performance in science.
| Dimensions | Definition/description | Sources |
|---|---|---|
| Sensitivity | Responsive in generating adaptive ideas to solve problems. | Ernawati et al., (2023) |
| Flexibility | Generate ideas that vary from multiple perspectives and categories. | Haim & Aschauer, (2024); Nasution et al., (2023); Batlolona et al., (2019); Torrance et al., (1992) |
| Novelty | Devise unique ideas that provide new solutions to problems. | Haim & Aschauer, (2024); Nasution et al., (2023); Batlolona et al., (2019); Torrance et al., (1992) |
| Elaboration | Developing an idea to be more comprehensive, thus improving the quality of the idea. | Haim & Aschauer, (2024); Nasution et al., (2023); Batlolona et al., (2019); Torrance et al., (1992) |
The blueprint of the creative thinking skills instrument was developed based on four main dimensions determined through theory synthesis 1: 1) Sensitivity, 2) Flexibility, 3) Novelty, and 4) Elaboration. The research referred to a variety of sources to identify relevant concepts for each dimension, including classic literature on creativity studies (e.g., Torrance et al. (1992)), recent academic publications in reputable journals (e.g., Batlolona (2020)), and research reports focusing on the context of science education (e.g., Ernawati et al. (2023)). The decision to combine classic and contemporary sources was intended to ensure that the instrument’s construction was grounded in basic theories of creativity and relevant to recent developments in science learning. The blueprint development process was carried out through three stages, which are: 1: first, researchers independently reviewed relevant documents and articles to record indicators of creative thinking skills in each dimension; second, the results of the review were compiled and grouped so that more concise and representative indicators were obtained; and third, the indicators that had been compiled were then validated by a panel of experts in the field of science education and instrument development to ensure the representativeness of the concept and suitability for the research context.
Constructing the items began by aligning each item with the concept of creative thinking skills formulated in the blueprint (see Table 2). The alignment was done to ensure coverage of the four dimensions: sensitivity, flexibility, novelty, and elaboration. Two items represented each dimension, so there were eight essay items. The essay format was chosen because it allows students to express ideas freely, display originality, and provide a more in-depth description of the cognitive processes underlying creative thinking skills in science (Kartini et al., 2021; Mafinejad et al., 2017). The design stage developed each item to encourage students to provide argumentative, detailed, and contextual answers according to the science problems presented. The initial draft of the items underwent several revisions to enhance clarity of wording, appropriateness of context, and level of cognitive demands. Two researchers independently reviewed each item to minimize ambiguity and ensure alignment with indicators in each dimension. Furthermore, an expert panel consisting of science education and educational assessment experts reviewed each item to provide input related to clarity, relevance, and conformity to theoretical constructs. Joint discussions were held until consensus was reached on the necessary modifications.
The scoring guidelines in this instrument were developed based on a performance rubric approach that refers to Torrance et al. (1992) creativity theory as well as developments in current research in science education (e.g., Batlolona (2020)). The theory emphasizes that creative thinking skills are not only seen from the number of ideas, but also the quality of the ideas produced, including relevance, realism, uniqueness, and level of elaboration. Therefore, scoring was done in the range of 0–4, with the criteria: score 4 for ideas that are relevant, realistic, contextual, and expressed clearly and completely; score 3 for ideas that are adaptive, realistic, and contextual, but less clear or less complete; score 2 for ideas that are adaptive but less realistic; score 1 for ideas that are not adaptive to the problem; and score 0 if no ideas are given.
Validation of the instrument items in the current study was conducted using the Delphi technique, which is a method that involves a panel of experts to obtain a consensus of judgment through a systematic and structured process (Linstone et al., 2002). Delphi validation is needed to ensure that the instrument items are not only theoretically valid but also in accordance with the substantive context being measured, thus increasing the instrument’s content validity (Aiken, 1985). The validation results suggested that all items obtained an Aiken index between 0.93–0.96 with a V table value 0.74. Because all index values exceeded the critical value, each item was declared substantially valid (Aiken, 1980). Therefore, the eight essay items developed have met the criteria of content suitability based on expert consensus and are suitable for the next stage of instrument testing.
Conducting a pilot test is an important stage before the instrument is widely used, because it ensures that respondents can understand the items well, have clarity of wording, and can represent the abilities to be measured. Patel & Patel, (2019) state that pilot testing helps researchers identify instrument weaknesses in terms of language, substance, and technicality, while Creswell, (2012) emphasizes the role of pilot tests in providing an initial overview of instrument reliability and validity. The pilot test conducted on 35 students showed that all items could be answered well and did not cause significant confusion. However, some suggestions regarding the wording of certain items needed to be simplified to make them clearer. The results suggest that the instrument is generally feasible to use, but still requires minor revisions to the linguistic aspects to be more optimal in the main research.
Data analysis was conducted through several stages, starting with Exploratory Factor Analysis (EFA) to explore the factor structure, considering factor loading ≥0.40 as the minimum limit (Hair et al., 2019). Furthermore, Confirmatory Factor Analysis (CFA) was used to test the model fit, with cut-off criteria such as CFI and TLI ≥ 0.90 (Kline, 2015), RMSEA ≤0.08, and χ2/df ≤ 3 (Lin & Tsau, 2013). Discriminant validity was tested using the Fornell-Larcker approach, where the AVE value must be greater than the squared correlation between constructs (Kline, 2015), while criterion validity was determined from the presence of significant correlations with relevant external measures (Shahzad et al., 2025). The Rasch Measurement Model was used to check item quality with the criteria of item fit on infit and outfit MNSQ 0.5–1.5 (Andrich & Marais, 2019), item and person reliability ≥0.70 (Cronbach, 1951), and person-item distribution analysis to see the balance of item difficulty levels with respondents’ abilities.
According to the results of descriptive analysis, there are variations in creative thinking skills scores in the participant categories (see Table 3). Gender-wise, female students (N = 124) exhibited higher scores than males (N = 14), for example in sensitivity (M = 64.97, SD = 11.91 vs M = 56.69, SD = 15.12) and flexibility (M = 67.78, SD = 10.45 vs M = 60.13, SD = 8.73). By university, students from University A (N = 63) scored higher on flexibility (M = 65.16, SD = 8.49) than University B (M = 54.33, SD = 13.10), while University B was superior on elaboration (M = 65.82, SD = 14.88). Based on semester, semester 2 students (N = 46) stood out in sensitivity (M = 71.55, SD = 9.64) and novelty (M = 61.99, SD = 9.37), while semester 6 students (N = 45) were relatively higher in flexibility (M = 63.54, SD = 12.07). Within the location category, urban students (N = 65) scored better on almost all aspects, such as sensitivity (M = 63.79, SD = 10.48) and elaboration (M = 59.93, SD = 10.15), compared to rural students (M = 62.04, SD = 11.65; M = 57.24, SD = 9.44). Involvement in campus organizations also has an effect, where students who are active in organizations (N = 60) are higher in flexibility (M = 66.23, SD = 10.12) than those who are not active (M = 60.47, SD = 11.28). Lastly, based on GPA, the 3.1–4.0 group (N = 101) performed better on novelty (M = 55.87, SD = 10.53) and elaboration (M = 59.84, SD = 9.91) than the 2.1–3.0 GPA group (M = 52.75, SD = 9.80; M = 57.36, SD = 10.44). The findings suggest that demographic and academic factors have different roles in influencing variations in students’ creative thinking skills.
The Exploratory Factor Analysis (EFA) results suggested that the instrument was worthy of further analysis and in accordance with the theoretical construction. The KMO value of 0.821 indicated an excellent level of sample feasibility, while Bartlett’s Test of Sphericity yielded Chi-Square = 459.593, df = 28, and p = 0.000, indicating that the correlation matrix was significant and the data met the assumptions for factor analysis (see Table 4 and Figure 1). According to the extraction results, four main factors with an eigenvalue of more than 1 cumulatively explained 82.24% of the total variance. The first factor explained 28.77%, the second factor 25.40%, the third factor 14.96%, and the fourth factor 13.11% of the overall variance after rotation. The component matrix shows that each instrument item has a loading factor above 0.70 on its respective factor, which indicates the consistency and representativeness of the item. Consistent with the theoretical framework, the first factor is interpreted as Sensitivity (Item_1 and Item_2), the second factor as Flexibility (Item_3 and Item_4), the third factor as Novelty (Item_5 and Item_6), and the fourth factor as Elaboration (Item_7 and Item_8). Consequently, the EFA results suggest that the empirical structure of the instrument supports the four dimensions of creative thinking skills that have been theoretically established. Hence, the instrument has good initial construct validity.
Confirmatory Factor Analysis (CFA) results (see Table 5) suggest that the four-dimensional model of creative thinking skills has an excellent fit with the data. The value of Chi-Square/df = 1.87 is below the threshold of <3.0, which indicates a good fit. Other indices also supported the model fit, including RMSEA = 0.056 (< 0.08), SRMR = 0.041 (< 0.08), CFI = 0.954 (> 0.90), TLI = 0.942 (> 0.90), NFI = 0.918 (> 0.90), GFI = 0.931 (> 0.90), and AGFI = 0.905 (> 0.90). Each index is within the recommended cut-off criteria, indicating that the four-dimensional construct model-Sensitivity, Flexibility, Novelty, and Elaboration-empirically supports the established theoretical structure. Therefore, the CFA results confirmed that the instrument has good construct validity and can be used to measure students’ creative thinking skills performance reliably.
The results of the discriminant validity analysis using the Fornell-Larcker criterion show that each dimension of creative thinking skills has good discrimination ability. The Average Variance Extracted (AVE) value in each dimension is 0.78 for Sensitivity, 0.81 for Flexibility, 0.76 for Novelty, and 0.79 for Elaboration (see Table 6). Some of the AVE values are greater than the squared correlation between factors, for example, the correlation between Sensitivity and Flexibility is 0.54, Sensitivity and Novelty is 0.49, and Sensitivity and Elaboration is 0.52. It indicates that each dimension explains more of its own item variance than the variance explained by other dimensions, so that each construct can be distinguished theoretically and empirically. Therefore, the instrument has sufficient discriminant validity, ensuring that the four dimensions are distinct yet conceptually related constructs.
Criterion validity results suggest significant relationships between several demographic variables and the dimensions of students’ creative thinking skills (see Table 7). Location variable had the most consistent and significant effect on all dimensions, with β = 0.28 (p = 0.004) on sensitivity, β = 0.25 (p = 0.007) on flexibility, β = 0.22 (p = 0.012) on novelty, and β = 0.24 (p = 0.008) on elaboration, indicating that students from urban areas tend to have higher creative thinking scores than students from rural areas. The semester variable also significantly affects all dimensions, for example, β = 0.21 (p = 0.022) on sensitivity and β = 0.20 (p = 0.029) on elaboration, indicating that increasing academic experience with each semester contributes to creative thinking ability. Moreover, GPA displayed significant effects on flexibility (β = 0.20, p = 0.034), novelty (β = 0.19, p = 0.039), and elaboration (β = 0.18, p = 0.041), indicating a positive relationship between academic achievement and the quality of ideas generated by students. The variables gender, university, and campus organisation participation display a more limited effect, with some p-values close to significant (e.g. gender on sensitivity β = 0.18, p = 0.041; university on flexibility β = 0.17, p = 0.038). Overall, the results confirm that the instrument can reflect differences in creative thinking ability related to students’ demographic and academic characteristics, thus supporting the criterion validity of the instrument.
Rasch Measurement results demonstrate that the creative thinking skills instrument has good measurement quality at the person and item levels. The mean score for the person was 20.2 with a standard deviation of 4.6, a score range of 6 to 30, and a mean measure of 0.98 with a standard error of 0.55 (see Figure 2). The MNSQ infit and MNSQ outfit values averaged 1.00 each, indicating a good fit of the data to the Rasch model. In contrast, the person reliability = 0.84 and separation = 2.27 values indicated the instrument’s ability to differentiate the levels of students’ creative thinking skills adequately. RAW SCORE-TO-MEASURE CORRELATION reached 0.98, confirming the consistency of measurement.
Among the items, the mean score was 352.4 with a standard deviation of 31.2, and the mean measure was 0.00 with a standard error of 0.13 (see Figure 3). The range of item sizes was between −0.90 to 0.70, and the MNSQ infit and outfit values averaged 1.00 each, indicating that all items fit the Rasch model. Item reliability = 0.94 and separation = 4.03 indicated that the instrument could distinguish the difficulty level of each item well. Overall, the Rasch results indicate that this eight-item essay instrument is internally valid, reliable, and has an adequate balance between item difficulty and student ability, making it feasible to measure creative thinking skills performance in the target population.
The results of the instrument’s internal consistency indicate a good reliability level in measuring students’ creative thinking skills. The correlation between the raw score and the measure reached 0.98, indicating a strong relationship between the students’ scores and the measured construct. In addition, the Cronbach’s Alpha (KR-20) value of 0.84 showed high internal reliability, indicating that the eight essay items have sufficient internal consistency and can be trusted to assess overall creative thinking performance. These results support using the instrument in the main study, as it provided stable and consistent measurements across participants.
The analysis results of item 1.1SS (see Figure 4), which measures students’ ability to formulate adaptive solutions based on local potential related to the electrical energy crisis in eastern Indonesia, show that scores are concentrated in the medium to high category. Seventy-one students (51%) scored 3 with an average ability of +1.15 logit, indicating their ability to generate adaptive and contextual ideas is quite good. Thirty-five students (25%) were at a score of 2 with an ability of +0.40 logits, indicating a basic understanding but still need to develop ideas to be more realistic. Twenty-two students (16%) achieved the maximum score of 4 with +3.29 logit, demonstrating full mastery in designing creative and contextual solutions according to local potential. However, only a small number of students, namely 6 people (4%) at score 1 with −1.44 logit ability and 4 people (3%) at score 0 with −1.55 logit ability, failed to show adequate understanding of the concept of alternative energy and utilisation of local resources.
The Item Characteristic Curve (ICC) visually demonstrates that the expectation curve of the Rasch model (red line) is in line with the empirical data pattern (black-blue dots), with most of the dots falling within the 95% confidence interval. A good fit of the model is indicated, although there is a slight deviation in the low to medium ability range (around −2 to 0 logits). Therefore, item 1.1SS proved empirically valid, has sufficient discrimination power, and effectively assesses students’ ability to design adaptive solutions based on science and local potential, distinguishing low, medium, and high ability students.
Following the analysis of students’ ability to formulate local potential-based adaptive solutions related to the energy crisis in item 1.1SS, the next step is to evaluate their ability to design more specific innovative solutions in the context of science and technology. Item 5.5NY emphasises the application of creative thinking skills to produce innovative ideas in the form of simple tools that can convert kitchen waste into energy, so that it can illustrate the extent to which students can integrate the concepts of science, creativity, and local contexts in a more practical and applicable manner.
The analysis results of item 5.5NY (see Figure 5), which measures students’ ability to design innovative ideas for simple tools to convert kitchen waste into energy, indicate that most students are in the middle ability category. A total of 45% of students obtained a score of 2 with an average ability of 0.93 logits, indicating an initial ability to generate adaptive ideas, but not fully realistic or detailed. Meanwhile, 34% of students obtained a score of 3 with an average ability of 1.50 logits, reflecting a more mature ability to develop contextualised innovative solutions to household organic waste problems. Only 7% of students achieved the maximum score of 4 with an ability of 4.75 logits, showing full mastery in designing creative, functional and science-based tools. Students with the lowest score of 0 were only 1%, indicating a small proportion who could not generate ideas related to renewable energy from waste.
The pattern is in line with the Category Probability Curve (CPC), where category 2 has the highest probability at ability around 0–1 logits, category 3 is dominant at ability 1–3 logits, and category 4 only appears at ability above 3 logits, consistent with the low proportion of students who reach the maximum score. The Item Characteristic Curve (ICC) also indicates that students’ expected scores follow the Rasch model well, although there is a slight deviation in the middle to high ability range (1–3 logits). Therefore, item 5.5NY can be valid and reliable, effective in assessing students’ ability to integrate creativity, science, and local context to produce innovative energy-based solutions from household waste. However, its discrimination capacity at highly proficient (>4 logits) is still limited.
The person-item map results demonstrate the distribution of respondents’ ability and item difficulty on a single logit scale. The person part (above) shows that the majority of respondents are distributed around logit 0 to +2, with a peak frequency of about 20–22 respondents at logit 0 (green colour) and about 20 respondents at logit +2 (red colour). It indicates that most of the respondents have moderate to above-average ability. There are still some respondents with low ability, indicated by about 2–3 respondents at logits −3 to −4, but the number is relatively small compared to the moderate ability group.
The items (below) are all clustered around logits 0 to +1, with no items that are either extremely difficult (logits > +2) or extremely easy (logits < −2) (see Figure 6). The items tended to be of medium difficulty, so they were reasonably well balanced with the average ability of the participants. Taken together, the distribution shows that the instrument is adequate in measuring the skills of respondents with moderate to high ability. However, it is less able to distinguish between respondents with very low or very high ability due to the limited variation in item difficulty.
Analysis of the creative thinking skills subscale on Renewable Energy indicated that the four dimensions of the instrument had varying measures with low standard errors, indicating stable and reliable estimates. The Sensitivity dimension has a measure of −0.515 with a standard error of 0.14, INFIT MNSQ of 0.875 (ZSTD -1) and OUTFIT MNSQ of 0.87 (ZSTD -1.1), and point-measure correlation of 0.695 (see Table 8), indicating that students are quite sensitive in identifying science problems related to renewable energy and can generate ideas that are adaptive and relevant to the local context. Flexibility dimension (measure −0.465; SE 0.135; INFIT MNSQ 1.23; OUTFIT MNSQ 1.25; point-measure correlation 0.615) signifies students’ ability to generate diverse ideas from various perspectives, for example, considering various alternative energy sources or innovative ways to utilise waste into energy, thus demonstrating divergent thinking skills that are important in problem-based science learning.
The Novelty dimension (measure 0.42; SE 0.13; INFIT MNSQ 0.86; OUTFIT MNSQ 0.845; point-measure correlation 0.72) emphasises students’ ability to create original and unique ideas, such as designing a simple device to convert kitchen waste into energy, reflecting scientific creativity in the context of science experiments. The Elaboration dimension (measure 0.56; SE 0.13; INFIT MNSQ 1.03; OUTFIT MNSQ 1.035; point-measure correlation 0.67) indicates students’ ability to develop ideas in detail and systematically, for example, designing a complete renewable energy utilisation procedure, tool, or strategy, which is relevant to critical and analytical thinking competencies in science learning. Based on the overall validity and reliability of all subscales, with INFIT and OUTFIT MNSQ within the range of 0.5–1.5 and point-measure correlation >0.6, the instrument is effective in differentiating students’ ability to think creatively and apply science concepts on the topic of renewable energy.
According to international test development standards, measuring creative thinking skills validly and reliably is the main prerequisite for the instrument to be used in educational evaluation. The CTPT instrument was developed and tested through a modern approach (Rasch Model). The EFA and CFA results confirmed that the structure of the four dimensions-sensitivity , flexibility, novelty, and elaboration-wasadequate, with fit indices (χ2/df = 1.87, RMSEA = 0.056, CFI = 0.954, TLI = 0.942, GFI = 0.931) in the good fit category. Internal reliability values were also high (Cronbach’s Alpha = 0.84), reinforced by Rasch reliability of 0.84 at the person level and 0.94 at the item level, indicating measurement consistency between items and the instrument’s ability to distinguish student skill levels. Rasch analysis also displayed unidimensionality, model fit (MNSQ infit/outfit ≈ 1.00), and item separation = 4.03, which confirmed the instrument could classify item difficulty levels well (Andrich & Marais, 2019). Descriptive findings indicated that most students were still at a low to medium level in creative thinking, for example, the highest average score on the sensitivity aspect was owned by semester 2 students (M = 71.55). In contrast, the novelty aspect was relatively low in various groups (M ≈ 50–57). These results confirm that the CTPT can differentiate students with different skill levels while providing important diagnostic information for educators. Therefore, the instrument is valid and reliable and useful for identifying the need for creative learning interventions (Lin & Tsau, 2013). However, as per modern assessment principles, instrument development needs to be iterative and updated to remain relevant to the dynamics of science education (Amprazis & Papadopoulou, 2025; Kaur et al., 2024; Rincón et al., 2023).
Beyond internal validity, an external validity test was also conducted to ascertain how much CTPT scores can predict students’ real performance. The analysis was conducted on a science experiment-based problem-solving task on renewable energy. The regression results revealed that the CTPT score was a significant predictor of the quality of students’ solutions in terms of sensitivity, flexibility, novelty, and elaboration. However, students with high scores on the Novelty dimension (measure = 0.42; SE = 0.13; INFIT MNSQ = 0.86; point-measure correlation = 0.72) tended to be able to produce original designs, such as a simple device to convert kitchen waste into energy, which was reflected in their actual performance during the experiment. The findings were further corroborated through structural equation modelling (SEM), which showed that CTPT made a significant contribution in explaining variations in problem-solving performance (β > 0.40, p < 0.01). Therefore, the CTPT proved more accurate in mapping students’ creative thinking skills.
The analysis continued by reviewing the pattern of variation in creative thinking skills scores based on gender, semester, GPA, university, location, and campus organisation participation. Female students excel in sensitivity (M = 64.97) and flexibility (M = 67.78) compared to males (M = 56.69; M = 60.13), which is in line with Runco et al., (2001) creativity theory that intrinsic motivation and different learning experiences affect sensitivity and flexibility of thinking. Regarding the semester, 2nd-semester students stood out in sensitivity (M = 71.55) and novelty (M = 61.99), indicating an explorative phase in the early stages of college. In contrast, 6th-semester students excelled in flexibility (M = 63.54) due to more mature academic experience. In terms of university, University A students were more flexible (M = 65.16), while University B excelled on elaboration (M = 65.82), which may be influenced by different curriculum approaches or academic culture. Other findings show that urban students do better on almost all dimensions than rural students, and involvement in campus organisations is associated with higher flexibility scores (M = 66.23 vs M = 60.47). Meanwhile, GPA was associated with novelty and elaboration, with students with a GPA of 3.1–4.0 outperforming those with a GPA of 2.1–3.0. The pattern of variation enriches the empirical evidence on the contextual factors that influence creativity, although some findings differ from previous studies (He et al., 2022; Volfson et al., 2018).
The coherence of the CTPT instrument is also evident in the item analysis, for example, in items 1.1SS and 5.5NY, which represent real context-based problem-solving tasks. In item 1.1SS, which focuses on the electrical energy crisis in eastern Indonesia, the score distribution is concentrated in the medium-high category, reflecting students’ ability to integrate science concepts with local potential. The reflects the aspects of sensitivity, flexibility, and elaboration simultaneously. In contrast, item 5.5NY, which requires an innovative design to convert kitchen waste into energy, emphasises the novelty aspect, so the score distribution is more spread out with a dominance in the medium category. The low proportion of students who achieved the maximum score indicates that the ability to produce innovative solutions is still limited. However, the understanding of basic concepts is quite good.
The results of the study have implications for problem-based science learning and experimentation. First, instructors can design tasks with scaffolding so that students are sensitive to issues and encouraged to develop more original and applicable ideas. Second, the curriculum can integrate simple experimental projects that allow students to test ideas in prototypes, so creativity does not stop at the conceptual level. Third, the balance between mastery of domain knowledge and stimulation of creativity must be enforced, so instruments such as CTPT truly separate knowledge limitations from creativity limitations. Thus, CTPT functions not only as a valid and reliable measurement tool but also as a diagnostic instrument that can guide creative learning design in a more contextual, measurable, and targeted manner.
The main limitation of the study rests on the scope and context of the instrument developed. The CTPT instrument has only been validated on elementary school teacher education program students, so the generalisation of the results is still limited to this group. The instrument has not been tested at other levels of education, such as secondary schools or non-primary teacher education higher education. It has not been used in international contexts with different learner characteristics and cultural backgrounds. Furthermore, the items in the CTPT are still dominated by science-based contexts, so their applicability is relatively limited when used in other fields that require creativity, such as arts, technology or social sciences. To strengthen the external validity and ensure the instrument’s flexibility, further research needs to be directed at expanding the trials across educational levels, scientific fields, and cultures so that the CTPT can become a more universal and adaptive instrument in measuring creative thinking skills.
Besides limitations in scope and context, this study also has methodological and technical limitations. The psychometric analysis is still limited to using Rasch models and SEM, so it does not include other, more comprehensive analytical methods to enrich validity evidence. The instrument also needs to be updated regularly as theories and conceptual frameworks regarding creative thinking develop to remain relevant to research and educational practice needs. Future research directions include the integration of longitudinal tracking to monitor the development of creativity over time and the use of learning analytics to link test scores with students’ actual performance in completing science-based and cross-cutting creative tasks.
The study introduces the CTPT, a performance-based instrument designed to assess creative thinking skills in the context of science learning. The CTPT demonstrated strong structural validity and reliability, particularly in evaluating the core dimensions of creative thinking skills, such as fluency, flexibility, originality, and elaboration in learners of different ability levels. External validity analysis further confirmed the practical usefulness of the instrument, showing a significant relationship between CTPT scores and learners’ performance in completing creative problem-solving tasks. The findings emphasise the importance of using performance-based assessments to complement traditional tests and self-report instruments to assess creative thinking skills accurately. Furthermore, the results also highlight the need for continuous development and adaptation of assessment instruments to remain relevant to educational practice and the development of creativity theories. Future research is recommended to expand the application of CTPT to various levels of education, across cultural contexts, and various fields of study, as well as to explore the use of digital technology to increase the efficiency and authenticity of the assessment.
The research protocol was reviewed and approved by the Ethics Committee of Universitas Sebelas Maret (approval number: 6366/UN27.02/PT.00/2025) in accordance with the institutional ethical guidelines and national regulations for research involving human participants. Before data collection commenced, we provided an information sheet to the parents and legal guardians of the participating children and obtained their written informed consent. Permission from the educators was also obtained. Pseudonyms are used in this article to protect the anonymity and privacy of all participants.
We have read and agree to comply with the F1000 AI Policy. We confirm that during the preparation of this manuscript, I used QuillBot exclusively to assist with the translation of the original Indonesian text into English. The content was subsequently reviewed and edited by the authors to ensure accuracy and clarity.
Repository name: Dataset for the Psychometric Evaluation of A Creative Thinking Performance Test for Science Education. https://doi.org/10.5281/zenodo.18763222 (Hasibuan, 2026).
The project contains the following underlying data:
dataset_Psychometric Evaluation of A Creative Thinking Performance Test for Science Education.xlsx (raw item-level scores and psychometric analysis dataset of the Creative Thinking Performance Test for Science Education).
Repository name: Dataset for the Psychometric Evaluation of A Creative Thinking Performance Test for Science Education. https://doi.org/10.5281/zenodo.18763222 (Hasibuan, 2026).
This project contains the following extended data:
Extended_data.docx (extended data including instrument description, test instrument, scoring rubric, and dataset documentation).
Data are available under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. AlAli R, Al-Barakat A: Constructing and Developing a Scale for Assessing Language Teachers' Performance in Integrating Reflective Thinking Skills within Primary Reading Learning Environments. Forum for Linguistic Studies. 2024; 6 (6): 194-210 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Measurement and Evaluation
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | |
|---|---|
| 1 | |
|
Version 1 02 Apr 26 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)