Non-native speakers of English or ChatGPT: Who thinks better?

Mohammed Q. Shormani

doi:10.12688/f1000research.161306.2

Home Browse Non-native speakers of English or ChatGPT: Who thinks better?

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Non-native speakers of English or ChatGPT: Who thinks better?

[version 2; peer review: 2 approved, 2 approved with reservations]

Mohammed Q. Shormani

PUBLISHED 31 Mar 2025

Author details Author details

Ibb University, Ibb, Ibb Governorate, Yemen

Mohammed Q. Shormani
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background

This study aimed to answer the following major question: Who thinks better, non-native speakers of English or ChatGPT?. It provides evidence from processing and interpreting center-embedding English constructions that the human brain surpasses ChatGPT and that ChatGPT cannot be regarded as a theory of language.

Methods

Fifteen non-native English speakers were recruited as participants. A center-embedding English sentence was presented to both the study participants and the ChatGPT. The ability of the ChatGPT to predict and remember was also tested.

Results

The study findings reveal that the human brain is still far ahead of Large Language Models, specifically ChatGPT, even in the case of non-native speakers of L2 English. They also showed ChatGPT’s inability to predict and remember.

Conclusions

The study concludes that the human brain’s ability to process and interpret natural language data and to predict and remember is unique and that ChatGPT still lags behind this unique human ability.

Keywords

Center-embedding, generative linguistics enterprise, non-native speakers of English, Large Language Models, ChatGPT, competence

Corresponding author: Mohammed Q. Shormani

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 Shormani MQ. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Shormani MQ. Non-native speakers of English or ChatGPT: Who thinks better? [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2025, 14:198 (https://doi.org/10.12688/f1000research.161306.2) First published: 12 Feb 2025, 14:198 (https://doi.org/10.12688/f1000research.161306.1) Latest published: 31 Mar 2025, 14:198 (https://doi.org/10.12688/f1000research.161306.2)

Revised Amendments from Version 1

In version 2, we first have expanded the literature review to provide a more balanced discussion by incorporating both supporting and opposing viewpoints on language models. Second, we have also added studies on NNSs' abilities to process second language data such as syntactic complexity and LLMs' role in academic writing. Third, we have clarified that involving native speakers as a control group was not needed, simply because this study aims to examine the ability of a homogeneous group of advanced NNSs in processing center-embedding sentence, and compare it to that of ChatGPT. Fourth, we have employed a single center-embedded sentence to maintain consistency in error patterns and avoid redundancy. Fifth, we have pointed out that the decision to use ChatGPT-3.5 Turbo was due to limited access to more advanced models at the time of conducting the study, and this has been acknowledged as one of the study limitations. Sixth, we have expanded the methodological details, including participant selection criteria and the WhatsApp-based data collection process. Finally, a new section (4.3. Methods of Analysis) has been added to clarify the approach adopted for analyzing both NNSs' and ChatGPT's responses.

See the author's detailed response to the review by Belle Li
See the author's detailed response to the review by Hui Zong
See the author's detailed response to the review by FX. Risang Baskara
See the author's detailed response to the review by Fawaz Qasem

1. Introduction

Center-embedding, as in (1), is a syntactic phenomenon in which a matrix clause contains several other relative (embedded) clauses. Put differently, center-embedding occurs when clauses are nested within other clauses, creating significant demands on working memory and sentence processing.

(1) a. The man that the soldier that the thief slapped deceived died.

b. Men women children dogs bite like marry hate pets.

(1a) presents a triple center-embedding structure, and (1b) a quadrilateral one ((1b) is taken from Karlsson, 2007, p. 8, see also Frazier & Rayner, 1988). These constructions present considerable difficulties for humans because of the complexity of their structures. These center-embedding structures have been investigated since Chomsky and Miller (1963). They impose difficulties on human working memory, giving rise to processing difficulty, perhaps due to the memory load placed on the Faculty of Language (FL) while processing them. Additionally, (1b) is more difficult than (1a), and this difficulty lies in involving more embedded clauses, indefinite nouns/subjects, and the absence of relative pronouns such as that. Because of the absence of relative clauses, sentences like (1b) are said to be reduced relative clauses (Shormani, 2013).

Artificial intelligence (AI) has aroused much controversy among linguists and AI specialists over the use of AI models, such as ChatGPT, and their capabilities. Recently, much debate has been ongoing in this regard. On one extreme, there are scholars who view AI models as incredibly able to perform processing tasks like humans (see e.g., Piantadosi, 2023; Ambridge & Blything, 2024). For instance, Piantadosi (2023) claims that Large Language Models (LLMs) like ChatGPT are good language theories, and they can even “refute” Chomsky’s generative approaches. On the other hand, several scholars refute this claim and argue that, although these models perform great tasks, AI models still fall short of reaching the human brain state (Katz, 2012; Shormani, 2024a & c). Some scholars provide good evidence from natural language processing tasks that LLMs, including ChatGPT, cannot be considered language theories because they still lag behind the human brain state (see e.g., Zhong et al., 2023; Katzir, 2023).

The first idea behind the inception of AI was how to implement “human intelligence” in computers, making them think like humans. AI is defined as “making a machine behave in ways that would be called intelligent if a human were so behaving” (McCarthy et al., 1955, p. 11). It is a computer’s underlying ability “to interpret external data correctly, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation’ (Haenlein & Kaplan, 2019, p. 5). The first definition ensues from scientific questions imposed in the 1950s, perhaps with Turing’s (1950) stimulating question “Can machine think?”. The second definition pertains to the modern use of LLMs, such as ChatGPT, in processing, computing, and interpreting natural language data.

The term “artificial” in “artificial intelligence” implies that LLMs’ “intelligence” is not real, and that they do not think like humans (see also Haenlein & Kaplan, 2019). However, there are scholars (Piantadosi, 2023; Ambridge & Blything, 2024) who see that these models think or process language data even better than humans. Thus, in this study, we aim to determine the extent to which this is true by examining ChatGPT’s ability to process center-embedding English sentences. We also aim to find out who is better at processing these constructions our participants, non-native speakers of English (NNSs), or ChatGPT. Our study recruited 15 NNSs; they are male and female. They are advanced learners of English as a Second Language. Fourteen of them were in the stage of writing their MA and PhD Theses, and one was a level-four student. We asked them a question involving a center-embedding structure. The same question was asked to ChatGPT.

The remainder of this paper is organized as follows. Section 2 briefly presents the Language Faculty and center-embedding structures along with their salient syntactic properties. Section 3 discusses the current situation of LLMs and center-embedding, tackling studies in which AI specialists develop neural algorithms in LLMs to understand center-embedding structures, and studies examining these models’ abilities to understand, process, and interpret such structures. Section 4 describes the methods used in this study. Section 5 presents and discusses the results. Section 6 outlines ChatGPT’s inability to predict or remember things. Section 7 concludes the paper and provides some limitations and suggestions for future research.

2. Language faculty and center-embedding

Generative enterprise has adopted a biological and cognitive approach to the study of language, language faculty, and language acquisition (Chomsky, 1995; Jenkins, 2000; Hauser et al., 2002; Shormani, 2016, 2017). For example, Hauser et al. (2002) discuss the unique features of human language and its evolution, proposing a framework for understanding the language faculty, the organ responsible for language production, and perception in terms of biological and cognitive mechanisms. They argued that there are two states of the Faculty of Language in the Broad Sense (FLB) and the Faculty of Language in the Narrow Sense (FLN) and that there are identified differences between them. FLB includes a combination of systems that support language: i) the sensory-motor system, which is responsible for speech and auditory processing; ii) the conceptual-intentional system, which is involved in meaning and intention; and iii) recursion, which allows the generation of infinite combinations of expressions from a finite set of elements (cf. Chomsky, 2005). In other words, FLN refers specifically to the core computational mechanism for recursion, which, they argue, is unique to humans. This mechanism enables the generation of hierarchical structures, including embedding clauses within matrix clauses. It is also central to human linguistic capability. Hauser et al. (2002) emphasized the interdisciplinary nature of studying language evolution, involving fields such as linguistics, evolutionary biology, and neuroscience. They focused on examining how FLB components evolved independently, exploring possible evolutionary origins and genetic basis of FLN. Between these two states, they suggest that LF sometimes fails to perform some linguistic tasks, including comprehending complex structures, mainly due to the load placed on it or its working memory. FL is, in principle, an intact organ that is genetically wired in human genes, and humans and only humans possess it.

Given the limited ability of FL and working memory in processing complex structures, humans encounter some difficulty in processing center-embedding structures because their syntactically complex structure imposes difficulties for human working memory. Center-embedding has been investigated since Chomsky and Miller (1963) and has been developed in several works. For example, Frazier (1985) found that the processing of these structures was broken down. Dickey (1995) conducted a study in which reading time experiments were presented, which revealed that inserting an ungrammatical resumptive pronoun in the second of the three noun gaps led to faster reading times. Thomas (1995) investigated the cognitive processes involved in understanding sentences with center-embedding and self-embedding structures. These are types of recursive sentence formation in which clauses are nested within one another, creating complex hierarchical structures. He found that center-embedding involves the insertion of subordinate clauses into a main clause. Self-embedding, on the other hand, occurs when multiple embeddings of the same type are nested within each other, leading to more complex and often harder-to-process sentences. He also explored why deeply embedded sentences challenge human working memory and comprehension. Memory limitations and working memory capacity are key factors that make these structures difficult to process (see also Uehara & Bradley, 2002).

Karlsson (2007) studied constraints on multiple center-embedding sentences and their syntactic peculiarities. He examines why sentences with multiple center-embedded clauses like (1) are challenging to process for humans, despite being grammatically correct. In this study, Karlsson introduced the concept of center-embedding ceiling, where human cognition struggles to parse sentences with more than two levels of embedding. He further argues that the difficulty arises not merely from cognitive limitations but also from structural constraints inherent to language, providing cross-linguistic examples. He argues that even though multiple center-embeddings are theoretically possible, they are rarely found in actual use because of both cognitive and communicative pressures. He concludes that center-embedding imposes limitations on working memory and the principles of processing efficiency.

Karlsson (2010) explored the limitations of human working memory when processing sentences involving complex syntactic structures, such as multiple center-embedded clauses. Karlsson may build on theories such as Syntactic Prediction Locality Theory (see e.g., Gibson, 1998), which suggests that increased distance between syntactic predictions and their resolutions leads to higher memory and integration costs. These costs become particularly problematic in cases of multiple-center-embedding, as in (1). Specifically, triple-center-embedding structures demonstrate significant processing difficulty owing to the high memory demands of maintaining unresolved syntactic predictions while simultaneously introducing new referents and dependencies. Karlsson argues that beyond a certain threshold, such structures exceed the working memory capacity, rendering them ineffective in real-time. Karlsson’s (2010) study concluded that working memory imposes a hard constraint on sentence complexity, highlighting the importance of syntactic simplicity and local dependencies in human language comprehension.

3. Center-embedding and LLMs

As stated above, center-embedding is a linguistic construction in which relative clauses (full or reduced) are inserted into the middle of a sentence, creating a highly nested structure. In this section, we focus on LLMs and their involvement in center-embedding phenomena. Center-embedding has been involved in two aspects of natural language processing (NLP), resulting in ample studies. These studies can be classified into two types: i) studies concerning the development of Neural Networking Algorithms (NNAs) with center-embedding algorithms, and ii) studies concerning testing LLMs’ abilities to process and interpret these constructions. Concerning the first type, for instance, Jiang et al. (2023) developed a prompt-based method with explicit one-word limitation (PromptEOL), a method that leverages prompts for embedding sentences, and explores its efficacy both with and without fine-tuning. PromptEOL integrates in-context learning by providing specific prompts to LLMs, allowing them to generate sentence embeddings without additional parameter updates. When fine-tuning is applied, PromptEOL significantly enhances the performance of LLMs in various sentence-level tasks. However, without fine-tuning, PromptEOL surpasses state-of-the-art methods, such as SimCSE, in semantic textual similarity benchmarks. The method benefits from model scaling, with embeddings improving as the model size increases. However, there are a number of challenges: i) diminishing returns: the performance plateau for very large models raises questions about the scalability limits of LLMs in certain tasks, and ii) task-specific optimization: tailored approaches to optimize embeddings for different downstream applications are highlighted. They concluded that their proposal provides a strong case for the scalability and adaptability of LLMs in sentence embedding tasks. By introducing PromptEOL, it provides a pathway to efficiently harness the power of large models, bridging the gap between the raw model size and practical performance.

Harris et al. (2024) explored a method to improve text embedding performance by preprocessing the input text using LLMs, specifically ChatGPT 3.5. They consider this approach crucial for various NLP tasks, but there are a number of factors that limit its functionality, such as vocabulary, lack of context, and grammatical errors. This approach involves enriching text with context, correcting grammatical errors, disambiguating terms, and including relevant metadata before generating embeddings. The aim is to enhance the performance of embedding models in downstream tasks such as classification and clustering. This study evaluated the method on three datasets: Banking 77 Classification, TwitterSemEval 2015, and Amazon Counter-Factual Classification. It uses metrics, such as cosine similarity and accuracy. The results show notable performance improvements, particularly on the TwitterSemEval dataset, where the proposed technique achieved a significant leap from the previous best performance (85.34 vs. 81.52 on the massive text embedding benchmark). However, improvements on other datasets were mixed, underscoring the dependency on the dataset characteristics.

The second type of studies examined LLMs to process and interpret center-embedding structures. For example, Kodner et al. (2023) reply to Piantadosi’s assertion that modern LLMs challenge Chomsky’s linguistic theories. The authors defend the relevance of generative linguistics and argue its continued importance in understanding human language. They focus on four issues: i) The data gap: they emphasize the disparity between the vast data requirements of LLMs and the minimal exposure young children need to acquire language. This highlights the unique mystery of human language acquisition that generative linguistics seeks to explain; ii) artificial vs. natural insights: they draw an analogy between LLMs and airplanes, suggesting that while airplanes reveal much about engineering, they offer little insight into natural avian flight. Similarly, LLMs’ functioning of LLMs may not elucidate the cognitive mechanisms underlying human language; iii) limits of LLMs as scientific theories, and the authors argue that scientific theories require interpretable explanations, not just predictive accuracy. Since LLMs lack explicit theoretical frameworks, they cannot replace linguistic theories; and iv) the necessity of independent linguistic theories: evaluating LLMs’ capabilities still depends on understanding human linguistic capacities. Generative linguistics provides a robust framework for such evaluations, underscoring its indispensable role in linguistic sciences. Kodner et al. concluded that generative linguistics will remain crucial in advancing our understanding of language despite technological progress in computational models.

Dentella et al. (2024) asserted that language is not an attribute that can be ascribed to LLMs. In their experiment, seven LLMs failed to respond to simple questions based on textual input, including examples, such as (2).

(2) a. “John deceived Mary and Lucy was deceived by Mary.”

b. “In this context, did Mary deceive Lucy?”

While human participants succeeded in this task, although they sometimes err, not only did LLMs fail, but they also kept doing the same errors, which is not human. If a human makes a mistake once, he/she does not repeat it if his/her attention is drawn to that error. They recruited 400 native English speakers, utilizing GPT-3 and GPT-3.5. They concluded that LLMs lack a compositional operator that integrates and regulates the grammatical and semantic information.

Katzir (2023) criticizes the claim that LLMs such as GPT-3 and similar models serve as robust theories of human linguistic cognition. Katzir argues against Piantadosi’s (2023) argument that LLMs outperform generative linguistics in explaining human language cognition. He provides objections in relation to three phenomena: i) Competence vs. performance: Katzir emphasizes that LLMs lack the ability to distinguish between linguistic competence and performance. The former refers to the native speaker’s underlying knowledge of his/her language, whereas the latter refers to his/her use of language in actual situations (Chomsky, 1965). In Chomsky’s words, there is “a fundamental distinction between competence (the speaker-hearer’s knowledge of his language) and performance (the actual use of language in concrete situations)” (p. 4). In this sense, there is perhaps a distinction between humans and machines or computers in that competence is a human attribute, specifically a characteristic of the human brain whose “linguistic performance” is an indication of the underlying linguistic competence. On the other hand, computers or LLMs, in specific terms, can produce a similar phrase/sentence or even a text based on the data (i.e., the corpus) they have been trained on, which does not entail that they have “competence” like humans (see also Kaufer, 1979). If this is on the right track, then it follows that the distinction between competence and performance is central to understanding human linguistic behavior, as humans often struggle with sentences due to processing limitations, not a lack of competence. In contrast, LLMs’ errors reflect deficits in their statistical learning mechanisms, not resource constraints; ii) likelihood vs. grammaticality: Katzir considers that humans can discern grammatical but unlikely sentences from likely but ungrammatical ones. This is perhaps due to the neurological mechanism the brain working mechanism is based on. In terms of connectionist models of the FL models, the human brain tries to identify the functional task through which it comprehends or produces speech (Arbib & Caplan 1979; Nelson, 1978; Arbib, 1982), and iii) typological universals: LLMs do not inherently explain cross-linguistic typological universals, which generative linguistics seeks to address. LLMs may be insufficiently biased towards these universals, making them implausible models for understanding human linguistic diversity.

Zhong et al. (2023) evaluated ChatGPT’s natural language understanding (NLU) capabilities against fine-tuned BERT models (BERT-base, BERT-large, RoBERTa-base, and RoBERTa-large) using the GLUE benchmark. They found that ChatGPT’s performance is comparable to that of BERT-base (78.7% vs. 79.2% average score) but lags behind more advanced models such as RoBERTa-large (87.8%) (see also Ettinger, 2020). However, ChatGPT struggles with paraphrase detection and semantic similarity, underperforming BERT-base by as much as 24% in some cases. This study suggests that while ChatGPT is versatile, it still lags behind highly specialized fine-tuned models in specific NLU tasks. This underscores the complementary nature of task-specific fine-tuning and the broad generalization abilities of LLMs, including the ChatGPT. This aspect can also be contrasted with the human brain, a species-specific property with a unique working mechanism (see also Chomsky, 2009; Berwick & Chomsky, 2016).

However, one should also elaborate on studies that see LLMs capable of doing linguistics tasks like human including Piantadosi (2023). To elaborate on Piantadosi’s views, he proclaims that LLMs refute Chomsky’s generative approach to the study of language. He argues that the success of LLMs undermines the necessity of an innate UG. He asserts that LLMs, which acquire linguistic competence through being trained on massive amounts of data rather than preprogrammed grammatical structures, demonstrate that statistical learning and pattern recognition may be sufficient for language acquisition. This perspective aligns with usage-based and emergentist models, which emphasize learning from experience rather than innate rules. However, critics argue that while LLMs can process and generate human-like language, they do so in ways fundamentally different from human cognition (see e.g. Katzir, 2023). Unlike children, who learn language efficiently from limited input, LLMs require immense datasets and lack explicit rule-based interpretability. Additionally, they may exhibit impressive linguistic performance without true competence, that is, they generate coherent language without understanding it in a human-like way.

Another recent study that could be addressed here is Ambridge and Blything (2024). Like Piantadosi (2023), they claim that generative linguistics often relies on verbal accounts that may lack precise quantitative predictions or depend on ancillary assumptions. In contrast, they suggest that LLMs, trained on extensive linguistic data, can effectively learn and represent these verb argument structures, offering more accurate predictions without the need for additional assumptions. This leads them to propose that LLMs might serve as superior models for understanding certain linguistic phenomena compared to traditional theoretical approaches. In their study, they tested LLMs ability to predict verb argument structures, particularly focusing on English causative constructions. They highlight the phenomenon where certain verbs can appear in both intransitive-inchoative as in The ball rolled and transitive-causative forms as in Someone rolled the ball whereas others are restricted to intransitive forms as in The man laughed but not Someone laughed the man. However, as has been discussed in details so far, this perspective has sparked debate within the linguistic community. Linguists criticized these perspectives, arguing that while LLMs excel at pattern recognition and prediction based on large datasets, they do not necessarily provide explanatory theories of linguistic competence. For instance, Fox and Katzir (2024) caution against equating the predictive success of LLMs with theoretical understanding, emphasizing the need for models that offer explanatory adequacy beyond empirical data fitting. Similarly, Reiss and Volenec (2024) asserted that adopting LLMs as theories of language overlooks the distinction between modeling linguistic behavior and explicating the underlying cognitive mechanisms. These discussions underscore the ongoing discourse on the role of LLMs in theoretical linguistics, balancing their empirical successes with the quest for deeper explanatory frameworks.

As for non-native speakers’ ability to process syntactically complex structures, for example, Lee and Phillips (2023) investigate why non-native English sometimes outperform native speakers in processing subject-verb agreement, particularly in contexts involving agreement attraction. Agreement attraction occurs when a nearby noun erroneously influences verb agreement, leading to acceptance of ungrammatical sentences. They found that native speakers are more susceptible to such errors, while L2 learners, especially those whose first language lacks number agreement, exhibit greater accuracy. This difference is attributed to native speakers’ reliance on implicit linguistic cues, making them more prone to interference from misleading elements. In contrast, L2 learners depend more on explicit processing strategies, which may enhance their accuracy in these situations. The study suggests that the cognitive mechanisms underlying language processing differ between native and non-native speakers, leading to situations where L2 learners’ analytical strategies result in fewer agreement errors.

Additionally, Wu et al. (2020) studied the syntactic complexity in English faced during academic writing, exploring how syntactic complexity is manifested in academic writing by users of English as a Lingua Franca (ELF). Using a corpus-based approach, the study analyzes sentence structures, clause embedding, and phrasal complexity in ELF writing, often comparing it to native English academic texts. It finds that ELF writers employ varying levels of syntactic complexity, influenced by their linguistic backgrounds, yet still achieve communicative effectiveness. The study challenges the notion that native-like complexity is essential for academic rigor, emphasizing instead the importance of clarity and coherence.

There are also studies concerning enhancing the research and academic writing abilities of non-native speakers of English through LLMs. For example, Li et al. (2024) conducted a study to examine the potential of ChatGPT in enhancing the academic writing skills of NNSs of English medical students. The study involves 27 third-year medical students from Sichuan University, all of whom were tasked with writing a mini-paper in English. After the initial writing, students used ChatGPT over a two-week period to revise and improve their papers. The study aims to assess the AI’s effectiveness in enhancing aspects of writing such as grammar, coherence, and structure. The results indicated significant improvements in these areas, with students reporting that ChatGPT provides valuable real-time feedback and suggestions that helped refine their work. Participants expressed positive feedback about the AI tool, noting that it helped them learn and apply academic writing conventions more effectively. However, the study also emphasized the importance of using AI tools like ChatGPT as a supplement to traditional teaching methods rather than a replacement, encouraging students to critically evaluate AI-generated suggestions. This research suggests that AI, particularly ChatGPT, can play a supportive role in improving academic writing skills for non-native English speakers, especially in complex fields like medical education, and provides a promising direction for integrating AI into educational strategies.

Another study is conducted by Hwang et al. (2023) who explored the potential of LLMs like ChatGPT to assist NNSs of English researchers in academic writing. They suggest that ChatGPT can function as a personal, round-the-clock English tutor, aiding in the composition of challenging sections such as the Introduction and Discussion of research papers. The authors draw an analogy to the mythological gift of fire, proposing that ChatGPT offers a transformative tool for enhancing writing proficiency. However, they also caution about potential over-reliance on AI, emphasizing the importance of critical engagement with AI-generated content to maintain academic integrity and personal skill development. However, Hwang et al. (2023) warn against the over-reliance on AI. They emphasize that while ChatGPT can support language improvement, it should not replace the essential skills that researchers must develop themselves. The critical engagement with the AI’s output is important to ensure academic integrity and to avoid over-reliance, which could lead to a decrease in personal writing ability over time.

Thus, considering the above review, the present study seeks to answer the following questions.

1. Who thinks better, NNSs or ChatGPT?
2. To what extent do NNSs understand, interpret, and perceive center-embedding structures?
3. To what extent do LLMs, specifically ChatGPT, understand, interpret, and perceive center-embedding structures?

4. Methods

4.1 The study data

The study data consisted of a sentence involving the center-embedding phenomenon, namely, The man that the soldier that the thief slapped deceived died. The same sentence was used as an instrument for judgement, that is, we asked ChatGPT to judge whether the sentence was grammatical. We used only one center-embedding sentence as the data of the study for two reasons: i) if human participants err in one sentence, they will commit the same errors in similar sentences, and the same thing can be said about ChatGPT, and ii) to avoid redundancy ensuing from discussing and/or interpreting the same data that contain the same errors either by human participants or ChatGPT.

4.2 Participants

This study involved 15 male and female participants. They are NNSs who are advanced L2 English students. Their L1 is Arabic, specifically Yemeni Arabic, and their advanced level is judged by being MA and PhD students, writing their Theses: two are PhD students, doing their PhD Dissertations, twelve are MA students, doing their MA Theses, and one is a level-four student. They have studied English for more than 10 years, 6 years at school and 4 or more than 4 years at university. In their study of English as a second language, they have studied knowledge-base courses including linguistics, literature, translation, in addition to skill-base courses such as Writing, Reading, and Spoken. MA and PhD participants are now writing their Theses, speciating in areas such as Syntax, Language Acquisition, Morphology, and Translation. These participants’ characteristics were our criteria of recruitment. These characteristics we believe meet the syntactic complexity of center-embedding structures. Additionally, we focused on a homogeneous group of advanced non-native speakers of English to ensure controlled investigation, and avoid complexity, hence excluding the possibility of involving a controlled group of native speakers. For NNSs’ responses, see Shormani (2025).

4.3 Procedure

The English sentence was sent to the participants via WhatsApp. The question was:

“What each did in the following sentence:

“The man that the soldier that the thief slapped deceived died.”

The man:

The soldier:

The thief:

I am doing a research and I need your answer.”

The participants were given 24 h to answer the question. The study question was sent to the participants on November 7, 2024. The time they submitted their answers was somehow different for each one; some answered immediately, some took about two hours, and two submitted their answers the next day, but within the time allotted. Those who took a long time did not see our message. We did not respond to some of them asking for further contextual clues. Our response to them was “We cannot give you any further clarification, because doing so will affect the study results.” They were, in fact, instructed to answer the study question individually. They were also explicitly informed that their responses should reflect their own understanding of the center-embedding sentence. To refer to the participants, we did not use their real names; instead, we used numbers (1-15). The same question was asked to ChatGPT. We utilized the ChatGPT-3.5, because it is the only available free version for us during conducting the study. ChatGPT’s answer was recorded on October 3, 2024. The ChatGPT version was accessed through the official ChatGPT client, which ensured that the model used was the standard version available to us at the time of the study. Figures 1 and 2 display the ChatGPT’s answers for both the processing and judgement tasks.

Figure 1. ChatGPT’s processing task.

Figure 2. ChatGPT’s judgement task.

4.4 Methods of analysis

The study adopts a quantitative and qualitative approach to the analysis of the data obtained, employing descriptive and comparative methods. The quantitative approach lies in providing statistical metrics (i.e. frequency and percentage) of the results of both human and ChatGPT answers. Human participants’ responses are divided into: i) correct responses, ii) partially correct responses, and iii) incorrect responses (categorized by error type). The same procedure was applied to ChatGPT’s responses. The analysis includes syntax, i.e. grammaticality, Semantics, i.e. interpretations and comprehension failure, mainly in ChatGPT’s inability to reconstruct sentence meaning correctly. On the other hand, quantitative analysis focuses on analyzing and describing the quantitative results accounting for these results through linking to theoretical aspects of generative approach to linguistics and L2 acquisition. It also focuses on identifying error patterns of recurring linguistic challenges in human and ChatGPT responses. The comparative method consists in comparing and evaluating both human and ChatGPT responses amalgamating the comparison and evaluation with AI and generative linguistics interpretation, thus involving a structured comparison and evaluation and reflecting a systematically coded scheme.

5. Results and discussion

5.1 Results

In this section, we tabulate the NNSs and ChatGPT responses. Table 1 lists these responses. It displays the answers of 15 participants for the three entities: the man, the soldier and the thief. It also presents ChatGPT’s responses to these entities.

Table 1. NNSs’ and ChatGPT’s responses (for the man, the soldier and the thief ).

Participant	The man	The soldier	The thief
1	died	deceived	slapped
2	died	slapped	slapped
3	died	deceived	slapped
4	died	deceived	slapped
5	died	deceived	slapped
6	died	deceived	slapped
7	slapped	deceived	died
8	nil	nil	nil
9	died	slapped	deceived
10	died	deceived	slapped
11	died	deceived	slapped & deceived
12	nil	nil	nil
13	died	deceived	slapped
14	died	deceived	slapped
15	died	deceived	slapped
ChatGPT	died	slapped	deceived

5.2 Discussion

As Table 1 shows, almost all the participants made good judgements on the stimulus sentence and stated what the man, the soldier, and the thief each did, while ChatGPT failed to do so. We present only those who got it wrong. Participant 2 made two mistakes, viz. s/he did not get the correct answer for the soldier and thief. Participant 7 got it right for the soldier, but s/he mixed between the man and thief. Participant 9 did it right for the man, but s/he mixed what the solider and thief did. Participants 8 and 12 left them in an undone state. However, we consider this as if they were wrong. Thus, we have 12 incorrect answers out of the 45 total answers for the three entities, that is, man, soldier and thief. In human answers, 26.7% were incorrect and 73.3% were correct. Five participants (including those who left it undone) made mistakes, that is, 33.3% (10 participants, viz. 66.7%) provided correct answers. ChatGPT provided only one correct answer, that is, only for the man.

The way ChatGPT answers the question is strange because it adds information messing. Human participants also add more information; for instance, stating the object who receives the action, but they do not mess things. Put simply, although the question was clear that what is needed is only what each subject did, no matter what action the object received, humans added reasonable information, while ChatGPT provided nonsensical information. That is, ChatGPT errs even with this (extra) information. The following is ChatGPT’s exact answer:

The man was deceived and died.

The soldier slapped the thief.

The thief was slapped by the soldier and indirectly involved in deceiving the man, leading to his death.

In ChatGPT’s answer, we find that it does not understand center-embedding structures, although it has been developed for this purpose, as we have seen so far. In other words, ChatGPT has incorporated algorithms to process center-embedding sentences (see e.g., Jiang et al., 2023; Harris et al., 2024; and references therein), but it fails to process and interpret sentences. In The man was deceived and died, we know that the man died, but he deceived no one. Additionally, ChatGPT’s answer with regard to the soldier, that is, The soldier slapped the thief is completely incorrect. The third part of the answer was nonsense. This is further shown in Figure 3 below.

Figure 3. ChatGPT’s answer to judgement question.

Additionally, ChatGPT’s failure to answer this question urges us to examine its grammaticality judgement. Put simply, if ChatGPT’s answer were as such, one may wonder what and how its grammaticality judgement is. We asked the following question to determine its judgement in terms of grammaticality.

User: Is the following sentence grammatical?

“The man that the soldier that the thief slapped deceived died.”

The sentence you have provided is not grammatically correct because of its structure and lack of clarity. It seems to be a complex sentence with multiple nested clauses that makes it confusing. Here, is a breakdown of the sentence.

“The man that the soldier that the thief slapped deceived died.”

It appears that there are multiple nested relative clauses in this sentence, which can be difficult to understand. A clearer and more grammatically correct version of this sentence is as follows.

“The man whom the soldier that the thief slapped deceived has died.”

The revised version maintains the basic structure of the original sentence while making it more intelligible.

This is also schematized in Figure 4 below:

Figure 4. ChatGPT’s answer to judgement question.

This finding makes our study partly contrast with Katzir (2023): while ChatGPT judges Katzir’s sentence as “grammatically correct,” in our study, it judges the sentence as “not grammatically correct.” This also provides another piece of evidence that ChatGPT is inconsistent. However, all NNSs considered this grammatical. Although we have not asked them directly about it, we, in fact, inferred their grammaticality judgement indirectly, because none of them said the sentence is “wrong/ungrammatical”. Regarding ChatGPT’s answer in relation to misunderstanding center-embedding structures, our study is in line with Katzir’s (2023) findings. ChatGPT in both studies lagged behind the expected performance. Both studies demarcate its poor performance, which requires further specialized development. Our study also supports Dentella et al.’s (2024) study, in which human participants outperformed ChatGPT. The difference between our study and theirs concerns the type of participants - while our participants were NNSs, their participants were native speakers of English.

The fact that LLMs, such as ChatGPT, fail to process center-embedding structures could be accounted for if we know the basis on which they function. They depend largely on statistics and statistical procedures. Put differently, LLMs working mechanism is based on probabilities, i.e. they just “guess” or “predict” the n-gram word, which is not always error-free. This deficiency in their mechanism could be attributed to their “competence,” viz., their underlying ability to point out or choose the correct word, and not predict or guess it. LLMs may also lack “competence” in the technical linguistic sense. In contrast, native speakers of an L may fail to perform a linguistic task, as in the case of Dentella et al. (2024), and their failure is not ascribed to “deficiency” in their competence but to performance, which is particularly ascribed to psychological factors such as slips of the tongue, fatigue, or not paying attention, which are all nonlinguistic (see also Chomsky, 2009). Likewise, if we assume that NNSs have built a “perfect” linguistic system, i.e. they mastered the L2 linguistic system, and if they fail to do a linguistic task, this failure cannot be ascribed to a “deficiency” in their linguistic competence, but rather to their performance, that is, their ability to use or judge a piece of language. Additionally, NNSs have been reported to outperform native speakers, specifically in agreement phenomena (Lee & Phillips, 2023).

These facts have long been observed in generative enterprise. For instance, Chomsky (1975) points out that the generative approach to the study of language “contrasts with a statistical approach that leads to an ordering of sequences from more to less probable, rather than a sharp division into two classes within which no such gradations are marked.” His famous nonsense phrase is a good case-in-point here. The sentence Colorless green ideas sleep furiously and its opposite sequence furiously sleep ideas green colorless as described by him “are not distinguished by their assigned probabilities. If probability is to be based on an estimate of frequency in some English corpus, then this probability will be so in both cases” (p. 145). Additionally, corpus-based probabilities and their low order lack explanatory adequacy due to their being “predictional” like when, for instance, a sentence appears in a linguistic corpus (Kaufer, 1979). We can conclude that our analysis supports studies that critique LLMs, in that LLMs, though powerful tools, fail to account for the dynamic interplay between competence and performance, which is central to human linguistic behavior (see also Katzir, 2023).

6. ChatGPT’s wrong prediction and remembrance

From a curious perspective, a further conversation takes place between us and ChatGPT the topic of which was to tell us the date and day of “Tomorrow” and “Yesterday.” The date and day of our conversation were Thursday and 7, 2024, respectively. Our whole conversation with ChatGPT is demonstrated in Figures 5 and 6.

Figure 5. ChatGPT’s wrong prediction.

Figure 6. ChatGPT’s wrong remembrance.

When we asked ChatGPT to tell us Tomorrow’s name, it gave us a date. If today is Thursday 7, 2024, tomorrow, then is Friday 8. Although the date was correct, ChatGPT failed to predict the name of tomorrow, giving us Thursday instead of Friday. Then its ability to remember is also questionable. Put simply, ChatGPT also fails to give us the name of yesterday, providing Tuesday instead of Wednesday (see also Shormani, 2025). However, if we had asked humans (say, a participant of ours), they would have certainly given us the correct answer. In fact, in this very aspect, Google seems to be better than ChatGPT, and we asked Google to tell us the date and (name of ) day of both tomorrow and yesterday, and it provided the correct answer. Thus, this conversation gives us a clear clue that ChatGPT’s ability to predict is still far from the human ability to predict and remember. This also supports our argument that ChatGPT still lags behind the human brain state and that humans possess a unique ability not only to process and comprehend linguistic data, but also any mental process required.

7. Conclusions and limitations

To conclude, we examined the ability of both NNSs and ChatGPT to process and interpret center-embedded English sentences. We recruited 15 NNSs as participants: advanced L2 English students, and BA, MA, and PhD students. They outperformed ChatGPT in both processing/interpreting and judging the grammaticality of a given center-embedding sentence (cf. also Lee & Phillips, 2023). Thus, our study highlights the fact that the human brain of (advanced) NNSs, like that of native speakers of English (as in the case of Dentella et al., 2024), processes and interprets a complex English structure, viz., a center-embedding sentence far better than LLMs such as ChatGPT, and that these LLMs cannot be considered powerful theories of language, as recently claimed by some researchers (Piantadosi, 2023; Ambridge & Blything, 2024). Another conclusion that can be drawn here concerns NNSs and their linguistic competence. Given the percentage of their answers to the center-embedding sentence presented to them (73.3%), we can conclude that their linguistic system of English is similar to that of native speakers. Those who erred (including those who did not provide answers) if we assume that NNSs have built a “perfect” linguistic system, i.e. they mastered the L2 linguistic system, and if they fail to do a linguistic task, this failure cannot be ascribed to a “deficiency” in their linguistic competence, but rather to their performance, i.e. their ability to use or judge a piece of language.

The study findings revealed that NNSs performed better than ChatGPT, which indicates that: i) LLMs need further development, and ii) the human brain (even in the case of NNSs) surpasses LLMs. Although center-embedding imposes difficulties for native speakers’ FL working memory (see e.g., Dickey 1995), the fact that our participants, who are NNSs, performed better than ChatGPT has several implications for both generative linguistics enterprise and AI technology, ChatGPT, in particular, the most prominent of which are: i) generative linguistics enterprise: NNSs’ competence mirrors that of native speakers. This, in turn, indicates that once a learner develops and masters the linguistic system of an L2, English, this linguistic system does not differ much from that of the native speaker of this L2 (see e.g., Cook, 1983; White, 2003; Shormani 2014a & b, 2015, 2016, 2023). This is further evidence of Chomsky’s conceptions of Language Faculty, Universal Grammar, Genetic Endowment, and Language Innateness (Chomsky, 1957, 2001, 2008, 2021; see also Shormani, 2016, 2023, 2024b & c), ii) AI technology: AI LLMs still require further development to overcome these challenges. The study also refutes Piantadosi’s (2023) arguments that LLMs are good theories of language, and that these models cannot refute Chomsky’s generative enterprise. This also supports Dentella et al.’s (2024) findings that language is an attribute of humans and only humans, a species-specific property, and that it cannot be attributed to LLMs.

The study findings also provide insights into LLMs’ working mechanisms of LLMs. In our study, the ChatGPT errs in both the processing and judgement tasks. However, this does not seem strange given the fact that LLMs base their predictions purely on statistical likelihood, often favoring plausible continuations that may be grammatically incorrect, which demonstrates a fundamental limitation compared to human cognition (see also Katzir, 2023). While LLMs may be used as engineering tools, their design and functionality fall short of providing a scientific model for human linguistic cognition. Our aim in this study is to pinpoint the actual State of Human brain (even in NNSs) and that of LLMs. The stimulus was a center-embedding sentence. This sentence is used to highlight the fundamental differences between human linguistic cognition and capabilities of LLMs, which are basically statistical. The fact that humans often struggle with center-embedded sentences like (1) above, cannot be ascribed to a deficiency in their genetic linguistic knowledge, but it is simply due to the fact that these structures tax working memory, and the more the center-embedding sentence gets complex, the more the load is placed on FL. Therefore, human failures are attributed to performance limitations, rather than a lack of understanding of syntactic rules (or competence) (see also Shormani, 2012). However, the success or failure is based on their statistical training and inherent model structure. Their competence directly reflects their behavior; errors are not due to transient resource constraints, but due to the limitations of their learned representations (Katzir, 2023). According to Katzir, although humans may initially struggle with center-embedded sentences, they can often parse them correctly with additional time or contextual clues. Adaptability is a hallmark of human cognition and is absent in LLMs. LLMs lack resource-based recovery mechanisms. The ability of LLMs to distinguish between competence and performance limits their usefulness as models of human cognition; hence, they cannot be regarded as powerful language theories (cf. Shormani, 2024b & c). These findings are in line with those of Katzir (2023). Like Katzir, we utilized center- embedding to examine whether LLMs can capture the nuances of how humans process, interpret, and judge complex linguistic structures.

However, this study has some limitations. The first limitation concerns the type of sentence involved. A comprehensive study might involve other types of complex English sentences such as those involving DP islands, multiple wh-questions, anaphora, and weak/strong crossovers to assess both NNSs’ and ChatGPT’s capabilities in a wider context. The second important limitation that can be tackled here is the version of ChatGPT, viz., -3-5. A broader study could utilize ChatGPT-4. The latter is said to be more developed in functionality and features, and utilizing it could widen the scope and purpose of a further study.

Ethics and consent

All participants provided consent to publish and replicate their data. This was through a WahtsApp message which means: “Please note that participating in this study is voluntary, and that your data will be made available for publishing and replicating, if needed.” The study was also approved by the Department of English Studies, Ibb University, Ethical Body in such cases (the ethical approval letter with Ref:EPP/172/11/24 is enclosed).

Data availability statement

The data underlying the results of this study are available on figshare.com , entitled Project_AI_NNSs DOI: 10.6084/m9.figshare.28270367 (Shormani, 2025). The project contained two types of data: participants’ responses and ChatGPT responses.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Acknowledgements

The author would like to thank the participants of the study for their time and valuable contributions.

References

Ambridge B, Blything L: Large language models are better than theoretical linguists at theoretical linguistics. Theor. Linguist. 2024; 50(1-2): 33–48. Publisher Full Text
Arbib MA: From artificial intelligence to neurolinguistics. Neural models of language processes. 1982; pp. 77–94. Publisher Full Text
Arbib MA, Caplan D: Neurolinguistics must be computational. Behav. Brain Sci. 1979; 2(3): 449–460. Publisher Full Text
Berwick R, Chomsky N: Why only us: Language and evolution. Cambridge: MIT Press; 2016.
Chomsky N: The Minimalist Program. Cambridge: MAA MIT Press; 1995.
Chomsky N, Miller GA: Introduction to the formal analysis of natural languages.Luce RD, Bush RR, Galanter E, editors. Handbook of Mathematical Psychology. Vol. 2. . New York, NY: John Wiley; 1963; pp. 269–321.
Chomsky N: Syntactic structures. The Hague: Mouton; 1957.
Chomsky N: Aspects of the theory of syntax. Cambridge: MIT Press; 1965.
Chomsky N: The logical structure of linguistic theory. Illinois, Chicago: University of Chicago; 1975.
Chomsky N: The architecture of language. Oxford University Press; 2001.
Chomsky N: Three factors in language design. Linguist. Inq. 2005; 36: 1–22. Publisher Full Text
Chomsky N: On phases.Otero C, Freidin R, Zubizaretta M, editors. Foundational Issues in Linguistic Theory. Cambridge: MAA MIT Press; 2008; pp. 133–166.
Chomsky N: Cartesian linguistics: A chapter in the history of rationalist thought. Cambridge University Press; 2009.
Chomsky N: Minimalism: Where are we now, and where can we hope to go. Gengo Kenkyu (Journal of the Linguistic Society of Japan). 2021; 160: 1–41.
Dentella V, Günther F, Murphy E, et al.: Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Sci. Rep. 2024; 14(1): 28083. PubMed Abstract | Publisher Full Text | Free Full Text
Cook V: Chomsky’s universal grammar and second language learning. Appl. Linguis. 1983; 6(1): 2–18. Publisher Full Text
Dickey M: Constraints on the sentence processor and the distribution of resumptive pronouns. University of Massachesetts-Amherst; 1995. Technical report.
Ettinger A: What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Trans. Assoc. Comput. Linguist. 2020; 8: 34–48. Publisher Full Text
Fox D, Katzir R: Large Language Models and theoretical linguistics. Theor. Linguist. 2024; 50(1-2): 71–76. Publisher Full Text
Frazier L, Rayner K: Parameterizing the language processing system: left- vs. right-branching within and across languages.Hawkins JA, editor. Explaining language universals. Oxford: Blackwell; 1988; pp. 246–279.
Frazier L: Syntactic complexity.Dowty D, Karttunen A, Zwicky L, editors. Natural Language Processing: Psychological, Computational and Theoretical Perspectives. Cambridge: Cambridge University Press; 1985.
Gibson E: Linguistic complexity: Locality of syntactic dependencies. Cognition. 1998; 68(1): 1–76. PubMed Abstract | Publisher Full Text
Haenlein M, Kaplan A: A brief history of artificial intelligence: On the past, present, and future of artificial intelligence. Calif. Manag. Rev. 2019; 61(4): 5–14. Publisher Full Text
Harris N, Butani A, Hashmy S: Enhancing embedding performance through large language model-based text enrichment and rewriting. arXiv preprint arXiv:2404.12283. 2024.
Hauser MD, Chomsky N, Fitch WT: The faculty of language: What is it, who has it, and how did it evolve? Science. 2002; 298(5598): 1569–1579. Publisher Full Text
Hwang SI, Lim JS, Lee RW, et al.: Is ChatGPT a “Fire of Prometheus” for Non-Native English-Speaking Researchers in Academic Writing? Korean J. Radiol. 2023; 24(10): 952–959. PubMed Abstract | Publisher Full Text | Free Full Text
Jenkins L: Biolinguistics: Exploring the biology of language. Cambridge: Cambridge University Press; 2000.
Jiang T, Huang S, Luan Z, et al.: Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645. 2023.
Karlsson F: Constraints on multiple center-embedding of clauses. J. Linguist. 2007; 43(2): 365–392. Publisher Full Text
Karlsson F: Working memory constraints on multiple center-embedding. Proc. Annu. Conf. Cogn. Sci. Soc. 2010; 32: 32.
Katz Y: Noam Chomsky on where artificial intelligence went wrong.2012. Reference Source
Katzir R: Why large language models are poor theories of human linguistic cognition. A reply to Piantadosi (2023). Manuscript. Tel Aviv University; 2023. Reference Source
Kaufer DS: The competence/performance distinction in linguistic theory. Philos. Soc. Sci. 1979; 9(3): 257–275. Publisher Full Text
Kodner J, Payne S, Heinz J: Why linguistics will thrive in the 21st century: A reply to Piantadosi (2023). arXiv preprint arXiv:2308.03228. 2023.
Lee E-KR, Phillips C: Why non-native speakers sometimes outperform native speakers in agreement processing. Biling.: Lang. Cogn. 2023; 26(1): 152–164. Publisher Full Text
Li J, Zong H, Wu E, et al.: Exploring the potential of artificial intelligence to enhance the writing of english academic papers by non-native english-speaking medical students - the educational application of ChatGPT. BMC Med. Educ. 2024; 24(1): 736. PubMed Abstract | Publisher Full Text
McCarthy J, Minsky ML, Rochester N, et al.: A proposal for the Dartmouth summer research project on artificial intelligence. AI Mag. 1955; 27(4): 12.
Nelson RJ: The competence-performance distinction in mental philosophy. Synthese. 1978; 39: 337–381. Publisher Full Text
Piantadosi ST: Modern language models refute Chomsky’s approach to language. From fieldwork to linguistic theory.2023; 353.
Reiss C, Volenec V: Adopting Large Language Models as a theory of language does refute Chomsky (but not like you think).2024. Reference Source
Shormani MQ: L2 Acquisition in the Arab World. Saarbrücken: GmbH & Co, Lambert Academic Publishing, Germany; 2012.
Shormani MQ: An introduction to English syntax: A generative approach. Germany: Lambert Academic Publishing; 2013.
Shormani MQ: The nature of language acquisition: Where L1 and L2 acquisition meet? Journal of Literature, Languages and Linguistic. 2014a; 4: 24–34.
Shormani MQ: Collocability difficulty: a UG-based model for stable acquisition. Journal of Literature, Languages and Linguistics. 2014b; 4: 54–64.
Shormani MQ: A Minimalist approach to collocation acquisition: Integrating syntax and semantics. Journal of Teaching and Teacher Education, University of Bahrain. 2015; 3: 13–32. Publisher Full Text
Shormani MQ: Biolinguistics, the ‘magnetic’ mechanism of Language Faculty and language acquisition. J. Teach. Teach. Educ. 2016; 4(01): 71–88. Publisher Full Text
Shormani MQ: SVO, (silent) topics and the interpretation of referential pro: A discourse-syntax interface approach. Ital. J. Linguist. 2017; 29(2): 91–159.
Shormani MQ: L2 acquisition of Wh-interrogatives at the syntax-discourse interface: interface hypothesis again. F1000Research. 2023; 12: 12. Publisher Full Text
Shormani MQ: Generative linguistics contribution to artificial intelligence: Where this contribution lies?2024a. Publisher Full Text
Shormani MQ: Introducing Minimalism: A Parametric Variation. Lincom Europa Press; 2024b.
Shormani MQ: What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review. arXiv preprint arXiv:2411.19858. 2024c.
Shormani MQ: Non-native speakers of English or ChatGPT: Who thinks better?2025. Publisher Full Text
Thomas JD: Center-embedding and self-embedding in human language processing. MIT; 1995. Unpublished PhD Dissertation.
Turing AM: Computing machinery and intelligence. Mind, New Series. 1950; LIX(236): 433–460. Publisher Full Text
Uehara K, Bradley DC: Center-embedding problem and the contribution of nominative case repetition.Nakayama M, editor. Sentence Processing in East Asian Languages. Staonford, California: CSLI Publications; 2002; pp. 257–287.
White L: Second language acquisition and universal grammar. Cambridge: Cambridge University Press; 2003.
Wu X, Mauranen A, Lei L: Syntactic complexity in English as a lingua franca academic writing. J. Engl. Acad. Purp. 2020 Jan; 43(43): 100798. Publisher Full Text
Zhong Q, Ding L, Liu J, et al.: Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT. arXivpreprint arXiv:2302.10198. 2023.

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 12 Feb 2025

Author details Author details

Ibb University, Ibb, Ibb Governorate, Yemen

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 31 Mar 2025, 14:198

https://doi.org/10.12688/f1000research.161306.2

version 1

Published: 12 Feb 2025, 14:198

https://doi.org/10.12688/f1000research.161306.1

© 2025 Shormani MQ. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Shormani MQ. Non-native speakers of English or ChatGPT: Who thinks better? [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2025, 14:198 (https://doi.org/10.12688/f1000research.161306.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 31 Mar 2025

Revised

Views

Reviewer Report 15 May 2025

Fawaz Qasem, University of Bisha, Bisha, Saudi Arabia

Approved

https://doi.org/10.5256/f1000research.179684.r374896

I am very pleased to write a review on this interesting and up-to-date topic.
The idea of research is informative and has a good theoretical background. All the sections and ideas are well-reported.
Adapting the framework of Chomsky's generative linguistics gives a strong theoretical foundation for the research paper and the idea of research.
The results were reported well and the focus of research on center-embedding English constructions was a good idea.
Addressing both linguistics and cognitive capabilities and sentences’ processing of ChatGPT, AI-powered model, and human’s is productive though exploring cognitive capabilities and the differences between human and AI models require sophisticated tools, experiments, and large data.
Having enough representative sample of the study would have made the findings and claims stronger and robust.
Having a recent and advanced version of ChatGPT may lead to different results and night ChatGPT perform better.
Having a large corpus with a variety of sentences patterns and samples would make the findings more productive and dynamic.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Linguistics, Applied Linguistics and Acquisition of L2, Psycholinguistics, Sociolinguistics, Corpus Linguistics, educational technology and ESP.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 09 Aug 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

09 Aug 2025

Author Response

Dear reviewer,
Thank you very much for your constructive review. Thank you also for approving the article in its version 2. Your valuable suggestions are highly appreciated.
Competing Interests: No competing interests were disclosed.
Dear reviewer,
Thank you very much for your constructive review. Thank you also for approving the article in its version 2. Your valuable suggestions are highly appreciated.
Dear reviewer,
Thank you very much for your constructive review. Thank you also for approving the article in its version 2. Your valuable suggestions are highly appreciated.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 09 Aug 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

09 Aug 2025

Author Response

Dear reviewer,
Thank you very much for your constructive review. Thank you also for approving the article in its version 2. Your valuable suggestions are highly appreciated.
Competing Interests: No competing interests were disclosed.
Dear reviewer,
Thank you very much for your constructive review. Thank you also for approving the article in its version 2. Your valuable suggestions are highly appreciated.
Dear reviewer,
Thank you very much for your constructive review. Thank you also for approving the article in its version 2. Your valuable suggestions are highly appreciated.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 06 May 2025

FX. Risang Baskara, Universitas Sanata Dharma, Depox, Indonesia

Approved with Reservations

https://doi.org/10.5256/f1000research.179684.r377338

Peer Review Report

Summary of the Article
This study investigates cognitive processing differences between non-native speakers of English (NNSs) and ChatGPT, specifically examining their ability to interpret center-embedding English constructions. The research recruited 15 advanced non-native English speakers (primarily MA and PhD students with Arabic as their first language) and presented them with a triple center-embedded sentence: "The man that the soldier that the thief slapped deceived died." Participants were asked to identify the actions performed by each entity in the sentence, and the same task was given to ChatGPT-3.5 Turbo.

The findings indicate that human participants significantly outperformed ChatGPT, with approximately 73.3% of human responses being correct compared to only one correct response (out of three possible actions) from ChatGPT. The study also assessed ChatGPT's grammaticality judgment, finding that it incorrectly deemed the center-embedded sentence ungrammatical, while NNSs recognized it as grammatical. Additional testing revealed ChatGPT's limitations in basic prediction and memory tasks, such as identifying days of the week.

The author frames these results within generative linguistic theory, arguing that the findings demonstrate the superiority of human language processing capabilities—even for non-native speakers—over Large Language Models. The study concludes that ChatGPT cannot be considered a robust theory of language and that the human brain possesses unique language processing abilities that AI systems have yet to match.

Is the work clearly and accurately presented and does it cite the current literature?
Partly

The manuscript has improved considerably through the revision process. The expanded literature review now presents a more balanced perspective by incorporating both supporting viewpoints (Piantadosi, 2023; Ambridge & Blything, 2024) and critical perspectives (Katzir, 2023; Fox & Katzir, 2024; Reiss & Volenec, 2024) on language models. The addition of studies on non-native speakers' processing of complex syntactic structures (Lee & Phillips, 2023; Wu et al., 2020) and LLMs in academic writing (Li et al., 2024; Hwang et al., 2023) helps situate the research within a broader context.

However, several issues persist:

The theoretical framing remains heavily weighted toward generative linguistics, with less comprehensive treatment of alternative perspectives on language processing. While the author has added some counterbalancing views, the integration of these perspectives into the overall theoretical framework is somewhat superficial.
The literature on center-embedding specifically in second language acquisition contexts is still limited. Given that the study focuses on NNSs processing center-embedded structures, more targeted literature in this area would strengthen the theoretical grounding.
The discussion of ChatGPT's capabilities would benefit from more empirical studies specifically addressing its performance on syntactic parsing tasks, rather than primarily theoretical arguments about LLMs in general.

To address these issues, the authors should:

Further integrate the newly added literature into the theoretical framework rather than simply mentioning these works
Include more specific literature on center-embedding in L2 acquisition
Incorporate more empirical studies on LLMs' performance on complex syntactic parsing tasks

Is the study design appropriate and is the work technically sound?
Partly

The authors have clarified several methodological aspects in version 2, but fundamental design limitations remain:

Single test stimulus: Despite the authors' justification that using one center-embedded sentence maintains consistency in error patterns and avoids redundancy, this remains a significant limitation. A scientifically sound investigation of processing abilities requires multiple test items to establish reliability and account for potential idiosyncrasies in individual sentences.
Sample considerations: The authors have explained their rationale for using a homogeneous group of 15 advanced NNSs (to minimize variability in language proficiency), but this small, specialized sample limits the generalizability of findings. The authors have acknowledged this limitation, which is appropriate, but the conclusions drawn should be more carefully qualified given this constraint.
Model selection: The authors have clarified that ChatGPT-3.5 Turbo was used due to resource constraints, and they have acknowledged this as a limitation. This explanation is reasonable, but again requires more careful qualification of the conclusions, as findings may not generalize to more advanced models.
Control considerations: The authors justify not including a native speaker control group by explaining that their primary aim was to compare NNSs to ChatGPT rather than to native speakers. While this explanation provides clarity about the research focus, it does limit the broader interpretative framework for understanding the findings.

To strengthen the technical soundness:

Include at least 2-3 additional center-embedded sentences of varying complexity to establish the reliability of findings across different stimuli
More explicitly qualify conclusions to reflect the limitations of the sample and model selection
Consider including a pilot comparison with a more advanced LLM (if now accessible) to address whether findings generalize beyond ChatGPT-3.5

Are sufficient details of methods and analysis provided to allow replication by others?
Partly

Version 2 includes significant methodological improvements, particularly the addition of section 4.3 (Methods of Analysis) and expanded details about participant characteristics and data collection procedures. However, some gaps remain:

Analysis framework: While the new section on analysis methods outlines the general approach, it lacks specific details about how responses were coded. For example, what criteria determined whether a response was "correct," "partially correct," or "incorrect"? Were there specific error categories that were identified and analyzed?
ChatGPT testing protocol: The exact procedure for testing ChatGPT needs more detail. The manuscript specifies that the test was conducted on October 3, 2024, but does not clearly indicate whether multiple attempts were made, how consistency was ensured, or whether different prompting strategies were tried.
Control measures: While the authors mention that participants were instructed to answer individually, more information about how compliance with this instruction was verified would strengthen methodological rigor.

To improve replicability:

Provide a more detailed coding scheme for categorizing responses
Specify the exact prompts used with ChatGPT and whether any variations were attempted
Explain any measures taken to verify that participants responded independently
Include the complete protocol for the supplementary test of ChatGPT's ability to predict and remember days of the week

If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable

The study primarily uses descriptive statistics (percentages of correct answers) without inferential statistics. Given the qualitative nature of the study and small sample size, this approach is appropriate.

Are all the source data underlying the results available to ensure full reproducibility?
Yes

The author has made both the participants' responses and ChatGPT's responses available on Figshare, which supports reproducibility of the analysis. This is a strength of the paper.

Are the conclusions drawn adequately supported by the results?
Partly

The authors have moderated some of their conclusions in version 2, but certain claims still extend beyond what the evidence supports:

Generalization from limited data: The study makes broad claims about human cognitive superiority over AI based on performance on a single center-embedded sentence. While the findings do demonstrate better performance by the NNSs in this specific case, generalizing to all aspects of linguistic processing is not warranted by this limited evidence.
Theoretical implications: The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The theoretical framework involves complex claims about the nature of language and cognition that require more extensive empirical support.
ChatGPT's day-of-the-week task: This supplementary test is presented as significant evidence of AI memory limitations, but it appears somewhat anecdotal and lacks methodological rigor compared to the main study.

To better align conclusions with evidence:

Qualify claims about human vs. AI processing to specifically reference the tested phenomena rather than making sweeping generalizations
Frame theoretical implications as possibilities to be explored in future research rather than established conclusions
Either strengthen the methodological rigor of the day-of-the-week test or reduce its prominence in the conclusions

Points that must be addressed to make the article scientifically sound:

Multiple test stimuli: Include at least 2-3 additional center-embedded sentences of varying complexity to establish reliability across different stimuli. This is essential for supporting claims about processing capabilities.
Scope qualification: Explicitly limit conclusions to the specific model tested (ChatGPT-3.5) and the specific linguistic phenomenon examined (center-embedding), acknowledging that findings may not generalize to all aspects of language processing or all LLMs.
Methodological detail: Provide a more comprehensive coding scheme for response analysis, including specific criteria for categorizing correct and incorrect responses, and more detailed information about the ChatGPT testing protocol.
Theoretical alignment: Ensure that theoretical claims about linguistic competence, performance, and AI capabilities are appropriately qualified given the limited scope of the evidence.

The authors have made commendable improvements to the manuscript through the revision process, addressing many previous concerns by expanding the literature review, clarifying methodological choices, and adding analysis details. With further attention to the issues outlined above—particularly the need for multiple test stimuli and appropriately scoped conclusions—this study could make a valuable contribution to our understanding of the differences between human and AI language processing capabilities.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: - Technology-Enhanced Language Learning (TELL)- Blended Learning Approaches in EFL Contexts- Flipped Classroom Methodologies for Language Acquisition- Artificial Intelligence Applications in Language Education- Computer-Assisted Language Learning (CALL)- Digital Tools and Platforms for EFL Teaching- Innovative Pedagogies in Second Language Acquisition

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 16 Jun 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

16 Jun 2025

Author Response

Dear reviewer,

Thank you very much for your time and efforts in reviewing the article. Thank you also for acknowledging that the study has been improved considerably in version ... Continue reading Dear reviewer,

Thank you very much for your time and efforts in reviewing the article. Thank you also for acknowledging that the study has been improved considerably in version 2. I just wanted to bring to your kind notice that I have made it clear in version 2 that the methodological design, results, and conclusions are limited to this very study, I have also acknowledged all these in the study limitations.
Dear reviewer,

Thank you very much for your time and efforts in reviewing the article. Thank you also for acknowledging that the study has been improved considerably in version 2. I just wanted to bring to your kind notice that I have made it clear in version 2 that the methodological design, results, and conclusions are limited to this very study, I have also acknowledged all these in the study limitations.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Jun 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

16 Jun 2025

Author Response

Dear reviewer,

Thank you very much for your time and efforts in reviewing the article. Thank you also for acknowledging that the study has been improved considerably in version ... Continue reading Dear reviewer,

Thank you very much for your time and efforts in reviewing the article. Thank you also for acknowledging that the study has been improved considerably in version 2. I just wanted to bring to your kind notice that I have made it clear in version 2 that the methodological design, results, and conclusions are limited to this very study, I have also acknowledged all these in the study limitations.
Dear reviewer,

Thank you very much for your time and efforts in reviewing the article. Thank you also for acknowledging that the study has been improved considerably in version 2. I just wanted to bring to your kind notice that I have made it clear in version 2 that the methodological design, results, and conclusions are limited to this very study, I have also acknowledged all these in the study limitations.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 10 Apr 2025

Hui Zong, Sichuan University, Chengdu, Sichuan, China

Approved

https://doi.org/10.5256/f1000research.179684.r374526

No ... Continue reading

CITE

Report a concern

Author Response 17 Apr 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

17 Apr 2025

Author Response

Dear reviewer,

Thank you very much for your constructive review, which has contributed much to improving the article in form and content. Thank you also for approving our amendments ... Continue reading Dear reviewer,

Thank you very much for your constructive review, which has contributed much to improving the article in form and content. Thank you also for approving our amendments in version 2.
Dear reviewer,

Thank you very much for your constructive review, which has contributed much to improving the article in form and content. Thank you also for approving our amendments in version 2.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 17 Apr 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

17 Apr 2025

Author Response

Dear reviewer,

Thank you very much for your constructive review, which has contributed much to improving the article in form and content. Thank you also for approving our amendments ... Continue reading Dear reviewer,

Thank you very much for your constructive review, which has contributed much to improving the article in form and content. Thank you also for approving our amendments in version 2.
Dear reviewer,

Thank you very much for your constructive review, which has contributed much to improving the article in form and content. Thank you also for approving our amendments in version 2.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Version 1

VERSION 1

PUBLISHED 12 Feb 2025

Views

Reviewer Report 11 Mar 2025

Belle Li, Purdue University, West Lafayette, Indiana, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.177317.r367807

Lit review
The paper did a good job to provide a theoretical background on center-embedding structures and their significance in linguistic theory. The literature review effectively traces the history of center-embedding research from Chomsky and Miller (1963) through more recent studies, and clearly positions itself within ongoing debates about whether LLMs can be considered theories of language. However, the paper exhibits an imbalanced literature review, heavily citing work supporting the Chomskyan perspective while giving insufficient attention to alternative views. A more balanced presentation of Piantadosi’s (2023) arguments and similar perspectives is needed to fairly represent the debate. Additionally, the discussion of AI and language models would benefit from more recent empirical studies comparing human and AI language processing, as the paper references primarily theoretical papers about AI limitations rather than the most current empirical research. Despite focusing on non-native speakers, the paper contains minimal discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings.

Research Design
The use of center-embedded structures is appropriate given their known complexity and processing challenges, with a clearly operationalized research question that captures interesting data points about both human and AI language processing. However, the study’s reliance on only 15 participants, all advanced learners from similar academic backgrounds, significantly limits generalizability. Using only one center-embedded sentence severely restricts the study’s validity. The author acknowledges this limitation but does not sufficiently justify it. Multiple test sentences with varying complexity levels are necessary for robust conclusions. The absence of a native speaker control group makes it impossible to contextualize NNS performance within broader language processing capabilities, a fundamental design flaw that must be addressed. Besides, testing only ChatGPT-3.5, which was not the most advanced model available at the time, limits the study's relevance. The author should test multiple LLMs, including more advanced models, or clearly acknowledge this as a significant limitation.

Methods & Analysis
The test sentence is clearly specified, along with the prompt given to participants and ChatGPT, and the basic procedure for data collection is described. However, the paper lacks crucial details about participant selection criteria, specific language backgrounds, and how their “advanced” status was determined. These details are essential for replication. The informal WhatsApp-based data collection raises methodological concerns that must be addressed, including whether participants were monitored during response formulation, whether they could consult resources or other individuals, the exact instructions given beyond the brief prompt shown, and how much time was allowed for responses. The paper lacks a systematic framework for analyzing responses, where a clearer coding scheme for categorizing correct and incorrect answers would improve reproducibility. Additionally, the paper does not specify how many times ChatGPT was prompted or whether different prompting strategies were attempted, which are crucial details for reproducible AI testing.

Findings
The finding that the tested non-native speakers outperformed ChatGPT-3.5 on the specific center-embedded structure is directly supported by the data, as is the observation that ChatGPT made errors in grammaticality judgment. However, the paper makes broad claims about LLMs as theories of language based on limited evidence. These conclusions must be qualified to acknowledge that testing was limited to one model (ChatGPT-3.5), only one linguistic phenomenon was tested, and the specific limitations identified may not generalize to all aspects of language processing. The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The author must either gather additional evidence or moderate claims about the theoretical implications. The section on ChatGPT’s inability to predict days of the week is presented as significant evidence of memory limitations but seems tangential and anecdotal rather than systematically investigated. This should either be developed with more rigorous methods or reduced in prominence.

To make the article scientifically sound, the methodology must be expanded to include multiple test sentences with varying complexity levels, test performance across a range of center-embedded structures, and provide complete information about participant selection and characteristics. Experimental controls should be improved by adding a native speaker control group for proper comparison, implementing and describing controls for ensuring independent responses, and detailing exact protocols for both human and AI testing. The theoretical framework needs better balance by presenting alternative perspectives on language models more thoroughly, acknowledging limitations in generalizing from one specific linguistic phenomenon, and considering how technological improvements might affect conclusions. Finally, the scope should be better limited by clearly acknowledging the preliminary nature of findings given sample limitations, restricting conclusions to specifically what was tested rather than broad claims, and discussing how future research might address current limitations.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: AI in language learning and teaching

CITE

Report a concern

Author Response 31 Mar 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

31 Mar 2025

Author Response

Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated ... Continue reading Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

Lit review
A more balanced presentation of Piantadosi’s (2023) arguments and similar perspectives is needed to fairly represent the debate. Additionally, the discussion of AI and language models would benefit from more recent empirical studies comparing human and AI language processing, as the paper references primarily theoretical papers about AI limitations rather than the most current empirical research. Despite focusing on non-native speakers, the paper contains minimal discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings.

Response:
Thank you for your insightful comment. We have expanded our literature review to provide a more balanced discussion of Piantadosi (2023), Fox and Katzir (2024) and Reiss and Volenec (2024), thus incorporating both supporting and opposing viewpoints. Thank you also for suggesting adding discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings. Thus, we incorporated relevant studies including Wu et al. (2020), Lee and Phillips (2023). We also added some studies concerning enhancing the research and academic writing abilities of non-native speakers of English through LLMs including Li et al. (2024) and Hwang et al. (2024).

Research design
However, the study’s reliance on only 15 participants, all advanced learners from similar academic backgrounds, significantly limits generalizability. Using only one center-embedded sentence severely restricts the study’s validity. The author acknowledges this limitation but does not sufficiently justify it. Multiple test sentences with varying complexity levels are necessary for robust conclusions. The absence of a native speaker control group makes it impossible to contextualize NNS performance within broader language processing capabilities, a fundamental design flaw that must be addressed.

Response:
Thank you for your insightful comment. We agree with you regarding the sample size and the use of a single center-embedded sentence. However, our concern was to focus on a homogeneous group of advanced NNSs of English to minimize variability in language proficiency, which allowed for a more controlled investigation of their performance relative to ChatGPT. While a larger and more diverse sample would enhance generalizability, our primary aim was to explore how advanced NN learners process complex syntactic structures rather than to make broad generalizations about all non-native speakers of English. Concerning the use of a single center-embedded sentence, and for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper, we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT. As for the absence of native speaker control group, our primary focus was to examine how advanced NNSs process center-embedding structures in comparison to ChatGPT, rather than to compare their performance to native speakers. By keeping the participant pool homogeneous, we aimed to minimize variability in proficiency levels and maintain a controlled investigation.

Besides, testing only ChatGPT-3.5, which was not the most advanced model available at the time, limits the study's relevance. The author should test multiple LLMs, including more advanced models, or clearly acknowledge this as a significant limitation.

Response:
Thank you for your insightful comment. We agree that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the important limitations of the study.

Methods and analysis
However, the paper lacks crucial details about participant selection criteria, specific language backgrounds, and how their “advanced” status was determined. These details are essential for replication. The informal WhatsApp-based data collection raises methodological concerns that must be addressed, including whether participants were monitored during response formulation, whether they could consult resources or other individuals, the exact instructions given beyond the brief prompt shown, and how much time was allowed for responses.

Response:
Thank you for your insightful comment. We have incorporated details concerning methodological issues such as participant selection criteria, language backgrounds, WhatsApp-based data collection, among other related aspects.

The paper lacks a systematic framework for analyzing responses, where a clearer coding scheme for categorizing correct and incorrect answers would improve reproducibility. Additionally, the paper does not specify how many times ChatGPT was prompted or whether different prompting strategies were attempted, which are crucial details for reproducible AI testing.

Response:
Thank you for your insightful comment. We have added a new section (4.3. Methods of analysis) in which we outlined the framework and approaches to analyzing both NNSs and ChatGPT’s responses.

Findings
However, the paper makes broad claims about LLMs as theories of language based on limited evidence. These conclusions must be qualified to acknowledge that testing was limited to one model (ChatGPT-3.5), only one linguistic phenomenon was tested, and the specific limitations identified may not generalize to all aspects of language processing. The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The author must either gather additional evidence or moderate claims about the theoretical implications. The section on ChatGPT’s inability to predict days of the week is presented as significant evidence of memory limitations but seems tangential and anecdotal rather than systematically investigated. This should either be developed with more rigorous methods or reduced in prominence.

Response:
Thank you for your insightful comment. In addition to what has been stated in the manuscript, we have also added that the participants were, in fact, instructed to answer the study question individually. They were also explicitly informed that their responses should reflect their own understanding of the center-embedding sentence. Regarding the idea that “tangential and anecdotal” claims about ChatGPT’s inability to state the date and name of “yesterday” and “tomorrow”, we actually meant that this example illustrates limitations in ChatGPT’s handling of structured, rule-based knowledge, while Google “knows” it, but we do not overstate its significance in relation to broader linguistic competence.
Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

Lit review
A more balanced presentation of Piantadosi’s (2023) arguments and similar perspectives is needed to fairly represent the debate. Additionally, the discussion of AI and language models would benefit from more recent empirical studies comparing human and AI language processing, as the paper references primarily theoretical papers about AI limitations rather than the most current empirical research. Despite focusing on non-native speakers, the paper contains minimal discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings.

Response:
Thank you for your insightful comment. We have expanded our literature review to provide a more balanced discussion of Piantadosi (2023), Fox and Katzir (2024) and Reiss and Volenec (2024), thus incorporating both supporting and opposing viewpoints. Thank you also for suggesting adding discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings. Thus, we incorporated relevant studies including Wu et al. (2020), Lee and Phillips (2023). We also added some studies concerning enhancing the research and academic writing abilities of non-native speakers of English through LLMs including Li et al. (2024) and Hwang et al. (2024).

Research design
However, the study’s reliance on only 15 participants, all advanced learners from similar academic backgrounds, significantly limits generalizability. Using only one center-embedded sentence severely restricts the study’s validity. The author acknowledges this limitation but does not sufficiently justify it. Multiple test sentences with varying complexity levels are necessary for robust conclusions. The absence of a native speaker control group makes it impossible to contextualize NNS performance within broader language processing capabilities, a fundamental design flaw that must be addressed.

Response:
Thank you for your insightful comment. We agree with you regarding the sample size and the use of a single center-embedded sentence. However, our concern was to focus on a homogeneous group of advanced NNSs of English to minimize variability in language proficiency, which allowed for a more controlled investigation of their performance relative to ChatGPT. While a larger and more diverse sample would enhance generalizability, our primary aim was to explore how advanced NN learners process complex syntactic structures rather than to make broad generalizations about all non-native speakers of English. Concerning the use of a single center-embedded sentence, and for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper, we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT. As for the absence of native speaker control group, our primary focus was to examine how advanced NNSs process center-embedding structures in comparison to ChatGPT, rather than to compare their performance to native speakers. By keeping the participant pool homogeneous, we aimed to minimize variability in proficiency levels and maintain a controlled investigation.

Besides, testing only ChatGPT-3.5, which was not the most advanced model available at the time, limits the study's relevance. The author should test multiple LLMs, including more advanced models, or clearly acknowledge this as a significant limitation.

Response:
Thank you for your insightful comment. We agree that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the important limitations of the study.

Methods and analysis
However, the paper lacks crucial details about participant selection criteria, specific language backgrounds, and how their “advanced” status was determined. These details are essential for replication. The informal WhatsApp-based data collection raises methodological concerns that must be addressed, including whether participants were monitored during response formulation, whether they could consult resources or other individuals, the exact instructions given beyond the brief prompt shown, and how much time was allowed for responses.

Response:
Thank you for your insightful comment. We have incorporated details concerning methodological issues such as participant selection criteria, language backgrounds, WhatsApp-based data collection, among other related aspects.

The paper lacks a systematic framework for analyzing responses, where a clearer coding scheme for categorizing correct and incorrect answers would improve reproducibility. Additionally, the paper does not specify how many times ChatGPT was prompted or whether different prompting strategies were attempted, which are crucial details for reproducible AI testing.

Response:
Thank you for your insightful comment. We have added a new section (4.3. Methods of analysis) in which we outlined the framework and approaches to analyzing both NNSs and ChatGPT’s responses.

Findings
However, the paper makes broad claims about LLMs as theories of language based on limited evidence. These conclusions must be qualified to acknowledge that testing was limited to one model (ChatGPT-3.5), only one linguistic phenomenon was tested, and the specific limitations identified may not generalize to all aspects of language processing. The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The author must either gather additional evidence or moderate claims about the theoretical implications. The section on ChatGPT’s inability to predict days of the week is presented as significant evidence of memory limitations but seems tangential and anecdotal rather than systematically investigated. This should either be developed with more rigorous methods or reduced in prominence.

Response:
Thank you for your insightful comment. In addition to what has been stated in the manuscript, we have also added that the participants were, in fact, instructed to answer the study question individually. They were also explicitly informed that their responses should reflect their own understanding of the center-embedding sentence. Regarding the idea that “tangential and anecdotal” claims about ChatGPT’s inability to state the date and name of “yesterday” and “tomorrow”, we actually meant that this example illustrates limitations in ChatGPT’s handling of structured, rule-based knowledge, while Google “knows” it, but we do not overstate its significance in relation to broader linguistic competence.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 31 Mar 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

31 Mar 2025

Author Response

Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated ... Continue reading Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

Lit review
A more balanced presentation of Piantadosi’s (2023) arguments and similar perspectives is needed to fairly represent the debate. Additionally, the discussion of AI and language models would benefit from more recent empirical studies comparing human and AI language processing, as the paper references primarily theoretical papers about AI limitations rather than the most current empirical research. Despite focusing on non-native speakers, the paper contains minimal discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings.

Response:
Thank you for your insightful comment. We have expanded our literature review to provide a more balanced discussion of Piantadosi (2023), Fox and Katzir (2024) and Reiss and Volenec (2024), thus incorporating both supporting and opposing viewpoints. Thank you also for suggesting adding discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings. Thus, we incorporated relevant studies including Wu et al. (2020), Lee and Phillips (2023). We also added some studies concerning enhancing the research and academic writing abilities of non-native speakers of English through LLMs including Li et al. (2024) and Hwang et al. (2024).

Research design
However, the study’s reliance on only 15 participants, all advanced learners from similar academic backgrounds, significantly limits generalizability. Using only one center-embedded sentence severely restricts the study’s validity. The author acknowledges this limitation but does not sufficiently justify it. Multiple test sentences with varying complexity levels are necessary for robust conclusions. The absence of a native speaker control group makes it impossible to contextualize NNS performance within broader language processing capabilities, a fundamental design flaw that must be addressed.

Response:
Thank you for your insightful comment. We agree with you regarding the sample size and the use of a single center-embedded sentence. However, our concern was to focus on a homogeneous group of advanced NNSs of English to minimize variability in language proficiency, which allowed for a more controlled investigation of their performance relative to ChatGPT. While a larger and more diverse sample would enhance generalizability, our primary aim was to explore how advanced NN learners process complex syntactic structures rather than to make broad generalizations about all non-native speakers of English. Concerning the use of a single center-embedded sentence, and for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper, we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT. As for the absence of native speaker control group, our primary focus was to examine how advanced NNSs process center-embedding structures in comparison to ChatGPT, rather than to compare their performance to native speakers. By keeping the participant pool homogeneous, we aimed to minimize variability in proficiency levels and maintain a controlled investigation.

Besides, testing only ChatGPT-3.5, which was not the most advanced model available at the time, limits the study's relevance. The author should test multiple LLMs, including more advanced models, or clearly acknowledge this as a significant limitation.

Response:
Thank you for your insightful comment. We agree that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the important limitations of the study.

Methods and analysis
However, the paper lacks crucial details about participant selection criteria, specific language backgrounds, and how their “advanced” status was determined. These details are essential for replication. The informal WhatsApp-based data collection raises methodological concerns that must be addressed, including whether participants were monitored during response formulation, whether they could consult resources or other individuals, the exact instructions given beyond the brief prompt shown, and how much time was allowed for responses.

Response:
Thank you for your insightful comment. We have incorporated details concerning methodological issues such as participant selection criteria, language backgrounds, WhatsApp-based data collection, among other related aspects.

The paper lacks a systematic framework for analyzing responses, where a clearer coding scheme for categorizing correct and incorrect answers would improve reproducibility. Additionally, the paper does not specify how many times ChatGPT was prompted or whether different prompting strategies were attempted, which are crucial details for reproducible AI testing.

Response:
Thank you for your insightful comment. We have added a new section (4.3. Methods of analysis) in which we outlined the framework and approaches to analyzing both NNSs and ChatGPT’s responses.

Findings
However, the paper makes broad claims about LLMs as theories of language based on limited evidence. These conclusions must be qualified to acknowledge that testing was limited to one model (ChatGPT-3.5), only one linguistic phenomenon was tested, and the specific limitations identified may not generalize to all aspects of language processing. The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The author must either gather additional evidence or moderate claims about the theoretical implications. The section on ChatGPT’s inability to predict days of the week is presented as significant evidence of memory limitations but seems tangential and anecdotal rather than systematically investigated. This should either be developed with more rigorous methods or reduced in prominence.

Response:
Thank you for your insightful comment. In addition to what has been stated in the manuscript, we have also added that the participants were, in fact, instructed to answer the study question individually. They were also explicitly informed that their responses should reflect their own understanding of the center-embedding sentence. Regarding the idea that “tangential and anecdotal” claims about ChatGPT’s inability to state the date and name of “yesterday” and “tomorrow”, we actually meant that this example illustrates limitations in ChatGPT’s handling of structured, rule-based knowledge, while Google “knows” it, but we do not overstate its significance in relation to broader linguistic competence.
Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

Lit review
A more balanced presentation of Piantadosi’s (2023) arguments and similar perspectives is needed to fairly represent the debate. Additionally, the discussion of AI and language models would benefit from more recent empirical studies comparing human and AI language processing, as the paper references primarily theoretical papers about AI limitations rather than the most current empirical research. Despite focusing on non-native speakers, the paper contains minimal discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings.

Response:
Thank you for your insightful comment. We have expanded our literature review to provide a more balanced discussion of Piantadosi (2023), Fox and Katzir (2024) and Reiss and Volenec (2024), thus incorporating both supporting and opposing viewpoints. Thank you also for suggesting adding discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings. Thus, we incorporated relevant studies including Wu et al. (2020), Lee and Phillips (2023). We also added some studies concerning enhancing the research and academic writing abilities of non-native speakers of English through LLMs including Li et al. (2024) and Hwang et al. (2024).

Research design
However, the study’s reliance on only 15 participants, all advanced learners from similar academic backgrounds, significantly limits generalizability. Using only one center-embedded sentence severely restricts the study’s validity. The author acknowledges this limitation but does not sufficiently justify it. Multiple test sentences with varying complexity levels are necessary for robust conclusions. The absence of a native speaker control group makes it impossible to contextualize NNS performance within broader language processing capabilities, a fundamental design flaw that must be addressed.

Response:
Thank you for your insightful comment. We agree with you regarding the sample size and the use of a single center-embedded sentence. However, our concern was to focus on a homogeneous group of advanced NNSs of English to minimize variability in language proficiency, which allowed for a more controlled investigation of their performance relative to ChatGPT. While a larger and more diverse sample would enhance generalizability, our primary aim was to explore how advanced NN learners process complex syntactic structures rather than to make broad generalizations about all non-native speakers of English. Concerning the use of a single center-embedded sentence, and for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper, we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT. As for the absence of native speaker control group, our primary focus was to examine how advanced NNSs process center-embedding structures in comparison to ChatGPT, rather than to compare their performance to native speakers. By keeping the participant pool homogeneous, we aimed to minimize variability in proficiency levels and maintain a controlled investigation.

Besides, testing only ChatGPT-3.5, which was not the most advanced model available at the time, limits the study's relevance. The author should test multiple LLMs, including more advanced models, or clearly acknowledge this as a significant limitation.

Response:
Thank you for your insightful comment. We agree that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the important limitations of the study.

Methods and analysis
However, the paper lacks crucial details about participant selection criteria, specific language backgrounds, and how their “advanced” status was determined. These details are essential for replication. The informal WhatsApp-based data collection raises methodological concerns that must be addressed, including whether participants were monitored during response formulation, whether they could consult resources or other individuals, the exact instructions given beyond the brief prompt shown, and how much time was allowed for responses.

Response:
Thank you for your insightful comment. We have incorporated details concerning methodological issues such as participant selection criteria, language backgrounds, WhatsApp-based data collection, among other related aspects.

The paper lacks a systematic framework for analyzing responses, where a clearer coding scheme for categorizing correct and incorrect answers would improve reproducibility. Additionally, the paper does not specify how many times ChatGPT was prompted or whether different prompting strategies were attempted, which are crucial details for reproducible AI testing.

Response:
Thank you for your insightful comment. We have added a new section (4.3. Methods of analysis) in which we outlined the framework and approaches to analyzing both NNSs and ChatGPT’s responses.

Findings
However, the paper makes broad claims about LLMs as theories of language based on limited evidence. These conclusions must be qualified to acknowledge that testing was limited to one model (ChatGPT-3.5), only one linguistic phenomenon was tested, and the specific limitations identified may not generalize to all aspects of language processing. The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The author must either gather additional evidence or moderate claims about the theoretical implications. The section on ChatGPT’s inability to predict days of the week is presented as significant evidence of memory limitations but seems tangential and anecdotal rather than systematically investigated. This should either be developed with more rigorous methods or reduced in prominence.

Response:
Thank you for your insightful comment. In addition to what has been stated in the manuscript, we have also added that the participants were, in fact, instructed to answer the study question individually. They were also explicitly informed that their responses should reflect their own understanding of the center-embedding sentence. Regarding the idea that “tangential and anecdotal” claims about ChatGPT’s inability to state the date and name of “yesterday” and “tomorrow”, we actually meant that this example illustrates limitations in ChatGPT’s handling of structured, rule-based knowledge, while Google “knows” it, but we do not overstate its significance in relation to broader linguistic competence.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 11 Mar 2025

Hui Zong, Sichuan University, Chengdu, Sichuan, China

Approved with Reservations

https://doi.org/10.5256/f1000research.177317.r366335

This research investigates whether non-native English speakers or ChatGPT exhibit better cognitive processing abilities, specifically in understanding and interpreting center-embedded English constructions. The study involved 15 non-native English speakers and compared their performance against ChatGPT’s ability to predict and remember. The results indicate that the human brain remains superior to large language models like ChatGPT, even for non-native English speakers. Here are my comments:

1. Were all 15 participants non-native English speakers? What were their native languages? It is recommended to specify this information in the paper.

2. Was ChatGPT accessed via the official client or third-party software? The experiment was conducted through WhatsApp and the ChatGPT client. Please include details on when the experiment took place.

3. In the method section, the study tested only one question, which may impact the validity of the conclusion.

4. The current title may not fully reflect the scope of the research. It is suggested to revise the title to be more specific.

5. The study only uses ChatGPT-3.5 Turbo, lacking comparisons with other large language models. More advanced models, such as ChatGPT-4o and Grok-3, are now available and have stronger reasoning capabilities. If possible, the authors are encouraged to include comparisons with these models.

6. Please consider adding the following reference:

Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT(Refer 1)

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Li J, Zong H, Wu E, Wu R, et al.: Exploring the potential of artificial intelligence to enhance the writing of english academic papers by non-native english-speaking medical students - the educational application of ChatGPT.BMC Med Educ. 2024; 24 (1): 736 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: medical informatics; artificial intelligence

CITE

Report a concern

Author Response 31 Mar 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

31 Mar 2025

Author Response

Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated ... Continue reading Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

1. Were all 15 participants non-native English speakers? What were their native languages? It is recommended to specify this information in the paper.

Response:
Thank you for this valuable comment. The participants' L1 is Arabic; we have pinpointed this in the article in addition to outlining the L2 English courses they have studded in their study journey.

2. Was ChatGPT accessed via the official client or third-party software? The experiment was conducted through WhatsApp and the ChatGPT client. Please include details on when the experiment took place.

Response:
Thank you for your valuable comment. We have made it clear that we accessed ChatGPT through the official ChatGPT client, which ensured that the model used was the standard version available to users at the time of the study. we have also pointed out that the study data (question) was sent to the participants on November 7, 2024.

3. In the method section, the study tested only one question, which may impact the validity of the conclusion.

Response:
Thank you for your valuable feedback. We understand your concern regarding the use of only one center-embedding sentence in the study. We have also acknowledged this as one of the limitations of the study in the sense of involving other constructions such as those involving DP islands, multiple wh-questions, anaphora, and weak/strong crossovers (different phenomena). However, for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT.

4. The current title may not fully reflect the scope of the research. It is suggested to revise the title to be more specific.

Response:
Thank you for your feedback regarding the title. After considering your suggestion, we have decided to retain the original title, Non-native speakers of English or ChatGPT: Who thinks better?, as it directly reflects the central focus of the study—comparing the cognitive performance and language processing abilities of non-native English speakers and ChatGPT. The term "thinking" could involve both the ability of processing/interpreting and remembrance, which was aimed at. Thus, we believe that such phrasing captures the essence of the study in a way that is both engaging and informative for the reader, highlighting the core questions of our study.

5. The study only uses ChatGPT-3.5 Turbo, lacking comparisons with other large language models. More advanced models, such as ChatGPT-4o and Grok-3, are now available and have stronger reasoning capabilities. If possible, the authors are encouraged to include comparisons with these models.

Response:
Thank you for your insightful comment. We acknowledge that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the limitations of the study.

6. Please consider adding the following reference:
Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT(Refer 1)

Response:
Thank you very much for suggesting adding this interesting reference; we have added it in addition to some others including Hwang et al. (2023).
Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

1. Were all 15 participants non-native English speakers? What were their native languages? It is recommended to specify this information in the paper.

Response:
Thank you for this valuable comment. The participants' L1 is Arabic; we have pinpointed this in the article in addition to outlining the L2 English courses they have studded in their study journey.

2. Was ChatGPT accessed via the official client or third-party software? The experiment was conducted through WhatsApp and the ChatGPT client. Please include details on when the experiment took place.

Response:
Thank you for your valuable comment. We have made it clear that we accessed ChatGPT through the official ChatGPT client, which ensured that the model used was the standard version available to users at the time of the study. we have also pointed out that the study data (question) was sent to the participants on November 7, 2024.

3. In the method section, the study tested only one question, which may impact the validity of the conclusion.

Response:
Thank you for your valuable feedback. We understand your concern regarding the use of only one center-embedding sentence in the study. We have also acknowledged this as one of the limitations of the study in the sense of involving other constructions such as those involving DP islands, multiple wh-questions, anaphora, and weak/strong crossovers (different phenomena). However, for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT.

4. The current title may not fully reflect the scope of the research. It is suggested to revise the title to be more specific.

Response:
Thank you for your feedback regarding the title. After considering your suggestion, we have decided to retain the original title, Non-native speakers of English or ChatGPT: Who thinks better?, as it directly reflects the central focus of the study—comparing the cognitive performance and language processing abilities of non-native English speakers and ChatGPT. The term "thinking" could involve both the ability of processing/interpreting and remembrance, which was aimed at. Thus, we believe that such phrasing captures the essence of the study in a way that is both engaging and informative for the reader, highlighting the core questions of our study.

5. The study only uses ChatGPT-3.5 Turbo, lacking comparisons with other large language models. More advanced models, such as ChatGPT-4o and Grok-3, are now available and have stronger reasoning capabilities. If possible, the authors are encouraged to include comparisons with these models.

Response:
Thank you for your insightful comment. We acknowledge that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the limitations of the study.

6. Please consider adding the following reference:
Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT(Refer 1)

Response:
Thank you very much for suggesting adding this interesting reference; we have added it in addition to some others including Hwang et al. (2023).
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 31 Mar 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

31 Mar 2025

Author Response

Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated ... Continue reading Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

1. Were all 15 participants non-native English speakers? What were their native languages? It is recommended to specify this information in the paper.

Response:
Thank you for this valuable comment. The participants' L1 is Arabic; we have pinpointed this in the article in addition to outlining the L2 English courses they have studded in their study journey.

2. Was ChatGPT accessed via the official client or third-party software? The experiment was conducted through WhatsApp and the ChatGPT client. Please include details on when the experiment took place.

Response:
Thank you for your valuable comment. We have made it clear that we accessed ChatGPT through the official ChatGPT client, which ensured that the model used was the standard version available to users at the time of the study. we have also pointed out that the study data (question) was sent to the participants on November 7, 2024.

3. In the method section, the study tested only one question, which may impact the validity of the conclusion.

Response:
Thank you for your valuable feedback. We understand your concern regarding the use of only one center-embedding sentence in the study. We have also acknowledged this as one of the limitations of the study in the sense of involving other constructions such as those involving DP islands, multiple wh-questions, anaphora, and weak/strong crossovers (different phenomena). However, for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT.

4. The current title may not fully reflect the scope of the research. It is suggested to revise the title to be more specific.

Response:
Thank you for your feedback regarding the title. After considering your suggestion, we have decided to retain the original title, Non-native speakers of English or ChatGPT: Who thinks better?, as it directly reflects the central focus of the study—comparing the cognitive performance and language processing abilities of non-native English speakers and ChatGPT. The term "thinking" could involve both the ability of processing/interpreting and remembrance, which was aimed at. Thus, we believe that such phrasing captures the essence of the study in a way that is both engaging and informative for the reader, highlighting the core questions of our study.

5. The study only uses ChatGPT-3.5 Turbo, lacking comparisons with other large language models. More advanced models, such as ChatGPT-4o and Grok-3, are now available and have stronger reasoning capabilities. If possible, the authors are encouraged to include comparisons with these models.

Response:
Thank you for your insightful comment. We acknowledge that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the limitations of the study.

6. Please consider adding the following reference:
Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT(Refer 1)

Response:
Thank you very much for suggesting adding this interesting reference; we have added it in addition to some others including Hwang et al. (2023).
Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

1. Were all 15 participants non-native English speakers? What were their native languages? It is recommended to specify this information in the paper.

Response:
Thank you for this valuable comment. The participants' L1 is Arabic; we have pinpointed this in the article in addition to outlining the L2 English courses they have studded in their study journey.

2. Was ChatGPT accessed via the official client or third-party software? The experiment was conducted through WhatsApp and the ChatGPT client. Please include details on when the experiment took place.

Response:
Thank you for your valuable comment. We have made it clear that we accessed ChatGPT through the official ChatGPT client, which ensured that the model used was the standard version available to users at the time of the study. we have also pointed out that the study data (question) was sent to the participants on November 7, 2024.

3. In the method section, the study tested only one question, which may impact the validity of the conclusion.

Response:
Thank you for your valuable feedback. We understand your concern regarding the use of only one center-embedding sentence in the study. We have also acknowledged this as one of the limitations of the study in the sense of involving other constructions such as those involving DP islands, multiple wh-questions, anaphora, and weak/strong crossovers (different phenomena). However, for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT.

4. The current title may not fully reflect the scope of the research. It is suggested to revise the title to be more specific.

Response:
Thank you for your feedback regarding the title. After considering your suggestion, we have decided to retain the original title, Non-native speakers of English or ChatGPT: Who thinks better?, as it directly reflects the central focus of the study—comparing the cognitive performance and language processing abilities of non-native English speakers and ChatGPT. The term "thinking" could involve both the ability of processing/interpreting and remembrance, which was aimed at. Thus, we believe that such phrasing captures the essence of the study in a way that is both engaging and informative for the reader, highlighting the core questions of our study.

5. The study only uses ChatGPT-3.5 Turbo, lacking comparisons with other large language models. More advanced models, such as ChatGPT-4o and Grok-3, are now available and have stronger reasoning capabilities. If possible, the authors are encouraged to include comparisons with these models.

Response:
Thank you for your insightful comment. We acknowledge that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the limitations of the study.

6. Please consider adding the following reference:
Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT(Refer 1)

Response:
Thank you very much for suggesting adding this interesting reference; we have added it in addition to some others including Hwang et al. (2023).
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 12 Feb 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 2 (revision) 31 Mar 25	read		read	read
Version 1 12 Feb 25	read	read

Hui Zong, Sichuan University, Chengdu, China
Belle Li, Purdue University, West Lafayette, USA
FX. Risang Baskara, Universitas Sanata Dharma, Depox, Indonesia
Fawaz Qasem, University of Bisha, Bisha, Saudi Arabia

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

7 Views

15 May 2025 | for Version 2

Fawaz Qasem, University of Bisha, Bisha, Saudi Arabia

7 Views Cite this report Responses(1)

Approved

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Linguistics, Applied Linguistics and Acquisition of L2, Psycholinguistics, Sociolinguistics, Corpus Linguistics, educational technology and ESP.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Back to all reports

Reviewer Report

7 Views

06 May 2025 | for Version 2

FX. Risang Baskara, Universitas Sanata Dharma, Depox, Indonesia

7 Views Cite this report Responses(1)

Approved With Reservations

The theoretical framing remains heavily weighted toward generative linguistics, with less comprehensive treatment of alternative perspectives on language processing. While the author has added some counterbalancing views, the integration of these perspectives into the overall theoretical framework is somewhat superficial.
The literature on center-embedding specifically in second language acquisition contexts is still limited. Given that the study focuses on NNSs processing center-embedded structures, more targeted literature in this area would strengthen the theoretical grounding.
The discussion of ChatGPT's capabilities would benefit from more empirical studies specifically addressing its performance on syntactic parsing tasks, rather than primarily theoretical arguments about LLMs in general.

To address these issues, the authors should:

Further integrate the newly added literature into the theoretical framework rather than simply mentioning these works
Include more specific literature on center-embedding in L2 acquisition
Incorporate more empirical studies on LLMs' performance on complex syntactic parsing tasks

Is the study design appropriate and is the work technically sound?
Partly

The authors have clarified several methodological aspects in version 2, but fundamental design limitations remain:

Single test stimulus: Despite the authors' justification that using one center-embedded sentence maintains consistency in error patterns and avoids redundancy, this remains a significant limitation. A scientifically sound investigation of processing abilities requires multiple test items to establish reliability and account for potential idiosyncrasies in individual sentences.
Sample considerations: The authors have explained their rationale for using a homogeneous group of 15 advanced NNSs (to minimize variability in language proficiency), but this small, specialized sample limits the generalizability of findings. The authors have acknowledged this limitation, which is appropriate, but the conclusions drawn should be more carefully qualified given this constraint.
Model selection: The authors have clarified that ChatGPT-3.5 Turbo was used due to resource constraints, and they have acknowledged this as a limitation. This explanation is reasonable, but again requires more careful qualification of the conclusions, as findings may not generalize to more advanced models.
Control considerations: The authors justify not including a native speaker control group by explaining that their primary aim was to compare NNSs to ChatGPT rather than to native speakers. While this explanation provides clarity about the research focus, it does limit the broader interpretative framework for understanding the findings.

To strengthen the technical soundness:

Include at least 2-3 additional center-embedded sentences of varying complexity to establish the reliability of findings across different stimuli
More explicitly qualify conclusions to reflect the limitations of the sample and model selection
Consider including a pilot comparison with a more advanced LLM (if now accessible) to address whether findings generalize beyond ChatGPT-3.5

Analysis framework: While the new section on analysis methods outlines the general approach, it lacks specific details about how responses were coded. For example, what criteria determined whether a response was "correct," "partially correct," or "incorrect"? Were there specific error categories that were identified and analyzed?
ChatGPT testing protocol: The exact procedure for testing ChatGPT needs more detail. The manuscript specifies that the test was conducted on October 3, 2024, but does not clearly indicate whether multiple attempts were made, how consistency was ensured, or whether different prompting strategies were tried.
Control measures: While the authors mention that participants were instructed to answer individually, more information about how compliance with this instruction was verified would strengthen methodological rigor.

To improve replicability:

Provide a more detailed coding scheme for categorizing responses
Specify the exact prompts used with ChatGPT and whether any variations were attempted
Explain any measures taken to verify that participants responded independently
Include the complete protocol for the supplementary test of ChatGPT's ability to predict and remember days of the week

Generalization from limited data: The study makes broad claims about human cognitive superiority over AI based on performance on a single center-embedded sentence. While the findings do demonstrate better performance by the NNSs in this specific case, generalizing to all aspects of linguistic processing is not warranted by this limited evidence.
Theoretical implications: The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The theoretical framework involves complex claims about the nature of language and cognition that require more extensive empirical support.
ChatGPT's day-of-the-week task: This supplementary test is presented as significant evidence of AI memory limitations, but it appears somewhat anecdotal and lacks methodological rigor compared to the main study.

To better align conclusions with evidence:

Qualify claims about human vs. AI processing to specifically reference the tested phenomena rather than making sweeping generalizations
Frame theoretical implications as possibilities to be explored in future research rather than established conclusions
Either strengthen the methodological rigor of the day-of-the-week test or reduce its prominence in the conclusions

Points that must be addressed to make the article scientifically sound:

Multiple test stimuli: Include at least 2-3 additional center-embedded sentences of varying complexity to establish reliability across different stimuli. This is essential for supporting claims about processing capabilities.
Scope qualification: Explicitly limit conclusions to the specific model tested (ChatGPT-3.5) and the specific linguistic phenomenon examined (center-embedding), acknowledging that findings may not generalize to all aspects of language processing or all LLMs.
Methodological detail: Provide a more comprehensive coding scheme for response analysis, including specific criteria for categorizing correct and incorrect responses, and more detailed information about the ChatGPT testing protocol.
Theoretical alignment: Ensure that theoretical claims about linguistic competence, performance, and AI capabilities are appropriately qualified given the limited scope of the evidence.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

- Technology-Enhanced Language Learning (TELL)- Blended Learning Approaches in EFL Contexts- Flipped Classroom Methodologies for Language Acquisition- Artificial Intelligence Applications in Language Education- Computer-Assisted Language Learning (CALL)- Digital Tools and Platforms for EFL Teaching- Innovative Pedagogies in Second Language Acquisition

Respond to this report

Responses (1)

Back to all reports

Reviewer Report

5 Views

10 Apr 2025 | for Version 2

Hui Zong, Sichuan University, Chengdu, Sichuan, China

5 Views Cite this report Responses(1)

Approved

No further comments

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

medical informatics; artificial intelligence

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Back to all reports

Reviewer Report

19 Views

11 Mar 2025 | for Version 1

Belle Li, Purdue University, West Lafayette, Indiana, USA

19 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

AI in language learning and teaching

Respond to this report

Responses (1)

Author Response

31 Mar 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

Lit review
A more balanced presentation of Piantadosi’s (2023) arguments and similar perspectives is needed to fairly represent the debate. Additionally, the discussion of AI and language models would benefit from more recent empirical studies comparing human and AI language processing, as the paper references primarily theoretical papers about AI limitations rather than the most current empirical research. Despite focusing on non-native speakers, the paper contains minimal discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings.

Response:
Thank you for your insightful comment. We have expanded our literature review to provide a more balanced discussion of Piantadosi (2023), Fox and Katzir (2024) and Reiss and Volenec (2024), thus incorporating both supporting and opposing viewpoints. Thank you also for suggesting adding discussion of relevant research on second language processing of complex syntactic structures, which is crucial for contextualizing the findings. Thus, we incorporated relevant studies including Wu et al. (2020), Lee and Phillips (2023). We also added some studies concerning enhancing the research and academic writing abilities of non-native speakers of English through LLMs including Li et al. (2024) and Hwang et al. (2024).

Research design
However, the study’s reliance on only 15 participants, all advanced learners from similar academic backgrounds, significantly limits generalizability. Using only one center-embedded sentence severely restricts the study’s validity. The author acknowledges this limitation but does not sufficiently justify it. Multiple test sentences with varying complexity levels are necessary for robust conclusions. The absence of a native speaker control group makes it impossible to contextualize NNS performance within broader language processing capabilities, a fundamental design flaw that must be addressed.

Response:
Thank you for your insightful comment. We agree with you regarding the sample size and the use of a single center-embedded sentence. However, our concern was to focus on a homogeneous group of advanced NNSs of English to minimize variability in language proficiency, which allowed for a more controlled investigation of their performance relative to ChatGPT. While a larger and more diverse sample would enhance generalizability, our primary aim was to explore how advanced NN learners process complex syntactic structures rather than to make broad generalizations about all non-native speakers of English. Concerning the use of a single center-embedded sentence, and for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper, we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT. As for the absence of native speaker control group, our primary focus was to examine how advanced NNSs process center-embedding structures in comparison to ChatGPT, rather than to compare their performance to native speakers. By keeping the participant pool homogeneous, we aimed to minimize variability in proficiency levels and maintain a controlled investigation.

Besides, testing only ChatGPT-3.5, which was not the most advanced model available at the time, limits the study's relevance. The author should test multiple LLMs, including more advanced models, or clearly acknowledge this as a significant limitation.

Response:
Thank you for your insightful comment. We agree that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the important limitations of the study.

Methods and analysis
However, the paper lacks crucial details about participant selection criteria, specific language backgrounds, and how their “advanced” status was determined. These details are essential for replication. The informal WhatsApp-based data collection raises methodological concerns that must be addressed, including whether participants were monitored during response formulation, whether they could consult resources or other individuals, the exact instructions given beyond the brief prompt shown, and how much time was allowed for responses.

Response:
Thank you for your insightful comment. We have incorporated details concerning methodological issues such as participant selection criteria, language backgrounds, WhatsApp-based data collection, among other related aspects.

The paper lacks a systematic framework for analyzing responses, where a clearer coding scheme for categorizing correct and incorrect answers would improve reproducibility. Additionally, the paper does not specify how many times ChatGPT was prompted or whether different prompting strategies were attempted, which are crucial details for reproducible AI testing.

Response:
Thank you for your insightful comment. We have added a new section (4.3. Methods of analysis) in which we outlined the framework and approaches to analyzing both NNSs and ChatGPT’s responses.

Findings
However, the paper makes broad claims about LLMs as theories of language based on limited evidence. These conclusions must be qualified to acknowledge that testing was limited to one model (ChatGPT-3.5), only one linguistic phenomenon was tested, and the specific limitations identified may not generalize to all aspects of language processing. The discussion of competence vs. performance makes interesting points but extends beyond what the current data can support. The author must either gather additional evidence or moderate claims about the theoretical implications. The section on ChatGPT’s inability to predict days of the week is presented as significant evidence of memory limitations but seems tangential and anecdotal rather than systematically investigated. This should either be developed with more rigorous methods or reduced in prominence.

Response:
Thank you for your insightful comment. In addition to what has been stated in the manuscript, we have also added that the participants were, in fact, instructed to answer the study question individually. They were also explicitly informed that their responses should reflect their own understanding of the center-embedding sentence. Regarding the idea that “tangential and anecdotal” claims about ChatGPT’s inability to state the date and name of “yesterday” and “tomorrow”, we actually meant that this example illustrates limitations in ChatGPT’s handling of structured, rule-based knowledge, while Google “knows” it, but we do not overstate its significance in relation to broader linguistic competence.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

19 Views

11 Mar 2025 | for Version 1

Hui Zong, Sichuan University, Chengdu, Sichuan, China

19 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

medical informatics; artificial intelligence

Respond to this report

Responses (1)

Author Response

31 Mar 2025

Mohammed Q. Shormani, Ibb University, Ibb, Yemen

Dear reviewer,
Thank you very much for the time and efforts you exerted in reviewing the article. Thank you also for your insightful comments and suggestions. Our responses are indicated as text.

1. Were all 15 participants non-native English speakers? What were their native languages? It is recommended to specify this information in the paper.

Response:
Thank you for this valuable comment. The participants' L1 is Arabic; we have pinpointed this in the article in addition to outlining the L2 English courses they have studded in their study journey.

2. Was ChatGPT accessed via the official client or third-party software? The experiment was conducted through WhatsApp and the ChatGPT client. Please include details on when the experiment took place.

Response:
Thank you for your valuable comment. We have made it clear that we accessed ChatGPT through the official ChatGPT client, which ensured that the model used was the standard version available to users at the time of the study. we have also pointed out that the study data (question) was sent to the participants on November 7, 2024.

3. In the method section, the study tested only one question, which may impact the validity of the conclusion.

Response:
Thank you for your valuable feedback. We understand your concern regarding the use of only one center-embedding sentence in the study. We have also acknowledged this as one of the limitations of the study in the sense of involving other constructions such as those involving DP islands, multiple wh-questions, anaphora, and weak/strong crossovers (different phenomena). However, for the scope of this study, we felt that focusing on a single sentence allowed us to draw clear comparisons and avoid unnecessary complexity. As outlined in the paper we selected a single sentence for two key reasons: i) consistency in error patterns: We aimed to ensure that if human participants made errors in one sentence, they would likely make the same errors in similar sentences (of the same type, i.e. center-embedding), and the same would apply to ChatGPT. By focusing on a single sentence, we could effectively examine error patterns without introducing variability that might arise from different sentence structures, and ii) avoiding redundancy: Using a single sentence allowed us to focus on a specific type of linguistic phenomenon, reducing redundancy and ensuring that the data were not overwhelmed by repeated interpretations of the same errors, whether made by human participants or ChatGPT.

4. The current title may not fully reflect the scope of the research. It is suggested to revise the title to be more specific.

Response:
Thank you for your feedback regarding the title. After considering your suggestion, we have decided to retain the original title, Non-native speakers of English or ChatGPT: Who thinks better?, as it directly reflects the central focus of the study—comparing the cognitive performance and language processing abilities of non-native English speakers and ChatGPT. The term "thinking" could involve both the ability of processing/interpreting and remembrance, which was aimed at. Thus, we believe that such phrasing captures the essence of the study in a way that is both engaging and informative for the reader, highlighting the core questions of our study.

5. The study only uses ChatGPT-3.5 Turbo, lacking comparisons with other large language models. More advanced models, such as ChatGPT-4o and Grok-3, are now available and have stronger reasoning capabilities. If possible, the authors are encouraged to include comparisons with these models.

Response:
Thank you for your insightful comment. We acknowledge that more advanced models such as ChatGPT-4 and Grok-3 are now available and have stronger reasoning capabilities. However, as noted in the article, we utilized ChatGPT-3.5 Turbo because it was the only free version accessible to us at the time of the study. Due to resource constraints and the timing of our research, we were unable to access the more advanced models for comparison. We have also acknowledged this as one of the limitations of the study.

6. Please consider adding the following reference:
Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students - the educational application of ChatGPT(Refer 1)

Response:
Thank you very much for suggesting adding this interesting reference; we have added it in addition to some others including Hwang et al. (2023).

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Ambridge B, Blything L: Large language models are better than theoretical linguists at theoretical linguistics. Theor. Linguist. 2024; 50(1-2): 33–48. Publisher Full Text

[2] Arbib MA: From artificial intelligence to neurolinguistics. Neural models of language processes. 1982; pp. 77–94. Publisher Full Text

[3] Arbib MA, Caplan D: Neurolinguistics must be computational. Behav. Brain Sci. 1979; 2(3): 449–460. Publisher Full Text

[4] Berwick R, Chomsky N: Why only us: Language and evolution. Cambridge: MIT Press; 2016.

[5] Chomsky N: The Minimalist Program. Cambridge: MAA MIT Press; 1995.

[6] Chomsky N, Miller GA: Introduction to the formal analysis of natural languages.Luce RD, Bush RR, Galanter E, editors. Handbook of Mathematical Psychology. Vol. 2. . New York, NY: John Wiley; 1963; pp. 269–321.

[7] Chomsky N: Syntactic structures. The Hague: Mouton; 1957.

[8] Chomsky N: Aspects of the theory of syntax. Cambridge: MIT Press; 1965.

[9] Chomsky N: The logical structure of linguistic theory. Illinois, Chicago: University of Chicago; 1975.

[10] Chomsky N: The architecture of language. Oxford University Press; 2001.

[11] Chomsky N: Three factors in language design. Linguist. Inq. 2005; 36: 1–22. Publisher Full Text

[12] Chomsky N: On phases.Otero C, Freidin R, Zubizaretta M, editors. Foundational Issues in Linguistic Theory. Cambridge: MAA MIT Press; 2008; pp. 133–166.

[13] Chomsky N: Cartesian linguistics: A chapter in the history of rationalist thought. Cambridge University Press; 2009.

[14] Chomsky N: Minimalism: Where are we now, and where can we hope to go. Gengo Kenkyu (Journal of the Linguistic Society of Japan). 2021; 160: 1–41.

[15] Dentella V, Günther F, Murphy E, et al.: Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Sci. Rep. 2024; 14(1): 28083. PubMed Abstract | Publisher Full Text | Free Full Text

[16] Cook V: Chomsky’s universal grammar and second language learning. Appl. Linguis. 1983; 6(1): 2–18. Publisher Full Text

[17] Dickey M: Constraints on the sentence processor and the distribution of resumptive pronouns. University of Massachesetts-Amherst; 1995. Technical report.

[18] Ettinger A: What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Trans. Assoc. Comput. Linguist. 2020; 8: 34–48. Publisher Full Text

[19] Fox D, Katzir R: Large Language Models and theoretical linguistics. Theor. Linguist. 2024; 50(1-2): 71–76. Publisher Full Text

[20] Frazier L, Rayner K: Parameterizing the language processing system: left- vs. right-branching within and across languages.Hawkins JA, editor. Explaining language universals. Oxford: Blackwell; 1988; pp. 246–279.

[21] Frazier L: Syntactic complexity.Dowty D, Karttunen A, Zwicky L, editors. Natural Language Processing: Psychological, Computational and Theoretical Perspectives. Cambridge: Cambridge University Press; 1985.

[22] Gibson E: Linguistic complexity: Locality of syntactic dependencies. Cognition. 1998; 68(1): 1–76. PubMed Abstract | Publisher Full Text

[23] Haenlein M, Kaplan A: A brief history of artificial intelligence: On the past, present, and future of artificial intelligence. Calif. Manag. Rev. 2019; 61(4): 5–14. Publisher Full Text

[24] Harris N, Butani A, Hashmy S: Enhancing embedding performance through large language model-based text enrichment and rewriting. arXiv preprint arXiv:2404.12283. 2024.

[25] Hauser MD, Chomsky N, Fitch WT: The faculty of language: What is it, who has it, and how did it evolve? Science. 2002; 298(5598): 1569–1579. Publisher Full Text

[26] Hwang SI, Lim JS, Lee RW, et al.: Is ChatGPT a “Fire of Prometheus” for Non-Native English-Speaking Researchers in Academic Writing? Korean J. Radiol. 2023; 24(10): 952–959. PubMed Abstract | Publisher Full Text | Free Full Text

[27] Jenkins L: Biolinguistics: Exploring the biology of language. Cambridge: Cambridge University Press; 2000.

[28] Jiang T, Huang S, Luan Z, et al.: Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645. 2023.

[29] Karlsson F: Constraints on multiple center-embedding of clauses. J. Linguist. 2007; 43(2): 365–392. Publisher Full Text

[30] Karlsson F: Working memory constraints on multiple center-embedding. Proc. Annu. Conf. Cogn. Sci. Soc. 2010; 32: 32.

[31] Katz Y: Noam Chomsky on where artificial intelligence went wrong.2012. Reference Source

[32] Katzir R: Why large language models are poor theories of human linguistic cognition. A reply to Piantadosi (2023). Manuscript. Tel Aviv University; 2023. Reference Source

[33] Kaufer DS: The competence/performance distinction in linguistic theory. Philos. Soc. Sci. 1979; 9(3): 257–275. Publisher Full Text

[34] Kodner J, Payne S, Heinz J: Why linguistics will thrive in the 21st century: A reply to Piantadosi (2023). arXiv preprint arXiv:2308.03228. 2023.

[35] Lee E-KR, Phillips C: Why non-native speakers sometimes outperform native speakers in agreement processing. Biling.: Lang. Cogn. 2023; 26(1): 152–164. Publisher Full Text

[36] Li J, Zong H, Wu E, et al.: Exploring the potential of artificial intelligence to enhance the writing of english academic papers by non-native english-speaking medical students - the educational application of ChatGPT. BMC Med. Educ. 2024; 24(1): 736. PubMed Abstract | Publisher Full Text

[37] McCarthy J, Minsky ML, Rochester N, et al.: A proposal for the Dartmouth summer research project on artificial intelligence. AI Mag. 1955; 27(4): 12.

[38] Nelson RJ: The competence-performance distinction in mental philosophy. Synthese. 1978; 39: 337–381. Publisher Full Text

[39] Piantadosi ST: Modern language models refute Chomsky’s approach to language. From fieldwork to linguistic theory.2023; 353.

[40] Reiss C, Volenec V: Adopting Large Language Models as a theory of language does refute Chomsky (but not like you think).2024. Reference Source

[41] Shormani MQ: L2 Acquisition in the Arab World. Saarbrücken: GmbH & Co, Lambert Academic Publishing, Germany; 2012.

[42] Shormani MQ: An introduction to English syntax: A generative approach. Germany: Lambert Academic Publishing; 2013.

[43] Shormani MQ: The nature of language acquisition: Where L1 and L2 acquisition meet? Journal of Literature, Languages and Linguistic. 2014a; 4: 24–34.

[44] Shormani MQ: Collocability difficulty: a UG-based model for stable acquisition. Journal of Literature, Languages and Linguistics. 2014b; 4: 54–64.

[45] Shormani MQ: A Minimalist approach to collocation acquisition: Integrating syntax and semantics. Journal of Teaching and Teacher Education, University of Bahrain. 2015; 3: 13–32. Publisher Full Text

[46] Shormani MQ: Biolinguistics, the ‘magnetic’ mechanism of Language Faculty and language acquisition. J. Teach. Teach. Educ. 2016; 4(01): 71–88. Publisher Full Text

[47] Shormani MQ: SVO, (silent) topics and the interpretation of referential pro: A discourse-syntax interface approach. Ital. J. Linguist. 2017; 29(2): 91–159.

[48] Shormani MQ: L2 acquisition of Wh-interrogatives at the syntax-discourse interface: interface hypothesis again. F1000Research. 2023; 12: 12. Publisher Full Text

[49] Shormani MQ: Generative linguistics contribution to artificial intelligence: Where this contribution lies?2024a. Publisher Full Text

[50] Shormani MQ: Introducing Minimalism: A Parametric Variation. Lincom Europa Press; 2024b.

[51] Shormani MQ: What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review. arXiv preprint arXiv:2411.19858. 2024c.

[52] Shormani MQ: Non-native speakers of English or ChatGPT: Who thinks better?2025. Publisher Full Text

[53] Thomas JD: Center-embedding and self-embedding in human language processing. MIT; 1995. Unpublished PhD Dissertation.

[54] Turing AM: Computing machinery and intelligence. Mind, New Series. 1950; LIX(236): 433–460. Publisher Full Text

[55] Uehara K, Bradley DC: Center-embedding problem and the contribution of nominative case repetition.Nakayama M, editor. Sentence Processing in East Asian Languages. Staonford, California: CSLI Publications; 2002; pp. 257–287.

[56] White L: Second language acquisition and universal grammar. Cambridge: Cambridge University Press; 2003.

[57] Wu X, Mauranen A, Lei L: Syntactic complexity in English as a lingua franca academic writing. J. Engl. Acad. Purp. 2020 Jan; 43(43): 100798. Publisher Full Text

[58] Zhong Q, Ding L, Liu J, et al.: Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT. arXivpreprint arXiv:2302.10198. 2023.

Non-native speakers of English or ChatGPT: Who thinks better?

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 1

1. Introduction

2. Language faculty and center-embedding

3. Center-embedding and LLMs

4. Methods

4.1 The study data

4.2 Participants

4.3 Procedure

Figure 1. ChatGPT’s processing task.

Figure 2. ChatGPT’s judgement task.

4.4 Methods of analysis

5. Results and discussion

5.1 Results

Table 1. NNSs’ and ChatGPT’s responses (for the man, the soldier and the thief ).

5.2 Discussion

Figure 3. ChatGPT’s answer to judgement question.

Figure 4. ChatGPT’s answer to judgement question.

6. ChatGPT’s wrong prediction and remembrance

Figure 5. ChatGPT’s wrong prediction.

Figure 6. ChatGPT’s wrong remembrance.

7. Conclusions and limitations

Ethics and consent

Data availability statement

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated