Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani Arabic terms as a case-in-point

Mohammed Q. Shormani; Alia. Ali Al-Samki

doi:10.12688/f1000research.165879.2

Home Browse Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani Arabic terms as a case-in-point

[version 2; peer review: 1 approved, 1 approved with reservations, 1 not approved]

Mohammed Q. Shormani¹, Alia. Ali Al-Samki¹

PUBLISHED 16 Jan 2026

Author details Author details

¹ Department of English Studies, Ibb University, Ibb, Ibb Governorate, Yemen

Mohammed Q. Shormani
Roles: Conceptualization, Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Alia. Ali Al-Samki
Roles: Data Curation, Resources, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background

This study aims to detect the efficiency of two Artificial Intelligence (AI) translation models, ChatGPT and DeepSeek, in the translation of Yemeni San’ani Arabic (YSA) dialectical terms into English. As dialectal Arabic presents significant linguistic variability and cultural specificity, accurate translation remains a major challenge for the current ChatGPT and DeepSeek (and perhaps other AI models).

Methods

Fifty San’ani Arabic terms were involved in the translation process, assessing the ability of both models to capture their semantic fidelity, cultural relevance, and contextual accuracy.

Results

The study findings reveal that, while both models demonstrate a foundational understanding of Standard Arabic (SA), their performance diminishes considerably when faced with the nuances and idiomatic expressions of the San’ani Arabic dialect. ChatGPT displays a relatively better performance in certain cases, particularly when translating terms with dialectical connotations. However, both models exhibit limitations, such as literal translation, misinterpretation, or complete ignorance of the intended meaning.

Conclusions

The study concludes by highlighting the critical need for dialect-aware AI development and provides recommendations for improving the dialectical accuracy and cultural sensitivity of AI model translation.

Keywords

Dialect translation, ChatGPT, DeepSeek, Sana’ani Arabic, dialectical and cultural nuances

Corresponding author: Mohammed Q. Shormani

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2026 Shormani MQ and Al-Samki AA. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Shormani MQ and Al-Samki AA. Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani Arabic terms as a case-in-point [version 2; peer review: 1 approved, 1 approved with reservations, 1 not approved]. F1000Research 2026, 14:694 (https://doi.org/10.12688/f1000research.165879.2) First published: 16 Jul 2025, 14:694 (https://doi.org/10.12688/f1000research.165879.1) Latest published: 16 Jan 2026, 14:694 (https://doi.org/10.12688/f1000research.165879.2)

Revised Amendments from Version 1

This revised version of the article substantially extends and refines the previously published study in terms of theoretical grounding, methodological rigor, and analytical depth. The literature review has been extended to focus more explicitly on Arabic dialect machine translation, and updated references on DeepSeek. Methodologically, the study now provides a more transparent account of data selection, criteria, evaluation criteria, and reproducibility, including clearer definitions of (in)correctness and appropriateness, specific recommendations for AI developers concerning YSA, expanding the limitations to include those concerning utilizing isolated words in the study, and suggesting further research to address these limitations.

See the authors' detailed response to the review by Liang Ding
See the authors' detailed response to the review by Cao-Tuong DINH

1. Introduction

In the era of global communication, technological and digital advancements, and for the sake of ease and building varied cultural relationships among nations of different languages, conventional manual translation has been revolutionized by the integration of automated machine translation artificial intelligence (AI) tools, which have largely taken a significant part in translation procedures (Alafnan, 2024; Koehn, 2009; Ali et al., 2023; Çetin & Duran, 2024; Puppel & Borg, 2024). Several AI electronic tools have been employed in the execution of translation, some of the most prominent of which are ChatGPT, DeepSeek, DeepL, Google Translate, Babylon, Amazon Translate, Yandex Translate, Systran, and Bing Microsoft Translator. There is a huge argument regarding the efficient use of such tools and their potential to replace the roles of humans in the process of translation, and millions use them daily without scrutinization and with no evaluation (Çetin & Duran, 2024; Postigo, 2024). The output of such intelligent tools, however, is not error-free; and thus, many researches have studied the effectiveness of such tools not for getting perfect translation models but for minimizing the rate of inadequacy in translation outcomes and enhancing their outputs.

As an elaboration of such studies, this study focuses on the efficiency of two notable AI models, viz. ChatGPT and DeepSeek (cf. Çetin & Duran, 2024), and compares their performances in translating 50 San’ani dialectical terms since dialects are considered to be of great importance in linguistic studies and culturally global communication (Bamunusinghe & Bamunusinghe, 2014). The insights of this study are used to identify their strengths and limitations. They are expected to be an appropriate overview used for optimizing the efficiency of AI translation tools by their designers as well as human translators. Simply put, this study aims to evaluate and compare the efficiency of ChatGPT and DeepSeek AI tools in the translation of San’ani dialectical terms into English. It also aims to identify their strengths and weaknesses, hoping to optimize the efficiency of AI translation tools by their designers and human translators. Thus, the significance of this study lies in evaluating and comparing the efficiency of ChatGPT and DeepSeek in the translation of San’ani dialectical terms into English. To the best of our knowledge, this is the first academic study to address the efficiency of DeepSeek model and compare it to ChatGPT in terms of the San’ani dialect. San’ani Arabic is spoken in Sana’a governorate, the capital of Yemen. Like other Yemeni Arabic varieties, San’ani Arabic is an understudied language variety that necessitates and urges linguists to address its distinct linguistic features (cf. Shormani, 2019).

The remainder of this paper is organized as follows. In Section 2, the study posits the conceptual framework, tackling a historical overview of machine translation and two of its currently notable AI models. Section 3 outlines the translating of dialects and focuses on related studies. Section 4 tackles the study methodology. Section 5 analyzes and discuss the results, respectively. Section 6 presents the study conclusions, recommendations, and limitations.

2. Conceptual framework

2.1 Machine translation systems

The implementation of technology in translation has a long history; however, it has become highly remarkable in recent decades (Postigo, 2024). Machine Translation (MT) systems “have been at the forefront of translation technology since the 1950s” and has crossed through major developments (Postigo, 2024: 1; Alafnan, 2024; Çetin & Duran, 2024) and by the time, they got more directed towards better mirroring of natural language processing representing a surprising shift in translation technology (Koehn, 2009; Postigo, 2024). MT is “one of the applications studied in computational linguistics” (Alafnan, 2024, p. 21). The processes are based on codes, encoding, and decoding.

MT has three main approaches categorized according to its functionality: Rule-based Machine Translation (RMT), which extended from the nearly 1960s to the early 1990s; Statistical Machine Translation (SMT), which extended from the early 1990s to the 2010s; and Neural Machine Translation (NMT), which was shaped nearly in 2010 and extended up to now (Hutchins, 1986; Koehn et al., 2003; Hofmann et al., 2010; Wu et al., 2016; Castellani, 2017; Vaswani et al. 2017; Elkaffash, 2020; Shormani, 2024a & b; Alafnan, 2024; Çetin & Duran, 2024; Postigo, 2024; Sindhuja, 2021; Shormani & Alfahd, 2025). The first is based on predefined forms, rules, and lexemes posited by expert linguists, according to which translations are generated. It revolves around a linguistic equation of word forms, and phrase structure rules in the source language (SL) that can be rendered into counterpart word forms and phrase structures in the target language (TL). This approach is best represented by Systran and is suitable for translating simple texts rather than complex linguistic constructions and idioms (Shormani & AlSohbani, 2025). The second approach is based on statistics of phrase-strings corpora rather than just word forms, whereby a large amount of data and texts in two languages are analyzed in alternative forms, patterned, and best modeled for usage in fore-coming texts that need to be translated (Shormani & AlSohbani, 2025). However, it still has grammatical and semantic weaknesses, as well as contextual limitations. The third approach, which is the most widely used nowadays, is distinguished from the other two approaches in that it uses deep neural networks and is more contextualized. That is, when translating a text, it looks at the context of the text as a whole, providing better translation than RMT and SMT (Shormani & AlSohbani, 2025). The stages of the three approaches are discussed further in the following sections.

Rule-based machine translation was the earliest approach to MT, which emerged in the 1960s as a solution to linguistic barriers in cross-border communication. Initially, the RMT was developed to assist the US Air Forces in translating Russian documents during the Cold War, aiming to facilitate intelligence gathering, diplomatic efforts, and communication among people (Hutchins, 1986). The term “rule-based” refers to the application of grammatical rules, syntactic parsing, and predefined linguistic structures to convert text from one language to another (see e.g., Hutchins, 1986; Koehn et al., 2003; Shormani & AlSohbani, 2025). RMT relies on explicit human-crafted rules rather than learning from data. Linguists and computational experts manually defined syntax, morphology, and semantics for both source and target languages, creating structured frameworks for MT (cf. Shormani & AlSohbani, 2025; Shormani, 2024b). This approach requires extensive linguistic expertise, making it both resource-intensive and time-consuming. However, the deterministic nature of RMT ensures predictable translations for specific language pairs, making it a valuable tool for structured and well-defined text types such as legal and technical documents (Hutchins, 1986).

RMT systems performed well when translating from and into languages with similar syntactic structures or typologically similar languages, as in the case of English and German, where word order and syntax followed relatively similar patterns. However, these tools struggle with languages that have significant structural differences, such as English and Korean, owing to variations in sentence formation and morphological complexity (see also Castellani, 2017). Additionally, RMT found it challenging to process idiomatic and proverbial expressions (cf. Shormani, 2020), which often do not have direct equivalents in the target language. Cultural nuances, figurative language, and polysemic words (i.e., words with multiple meanings) pose major obstacles, leading to awkward or inaccurate translations. For example, the English idiom kick the bucket, meaning to die, would be translated into Arabic by RMT literally as يركل الدلو which is far away from “to die,” thus losing its intended meaning in Arabic. Another structure causing difficulty for RMT involves center-embedding sentence structures, where clauses are nested within other clauses, resulting in a complex sentence (cf. Shormani, 2025a). This nesting often results in poor translations due to the difficulty in processing these structures by RMT. These issues highlight the limitations of a purely rule-based approach and demonstrate the need for more flexible translation methods.

Another major problem with RMT is its scalability. Because linguistic rules had to be manually crafted and refined for each language pair, developing RMT for multiple languages was both effort- and time-consuming, and costly. Maintaining and updating rule sets requires continuous input from expert linguists, which makes it difficult to adapt to new linguistic changes or variations (see also Castellani, 2017; Elkaffash, 2020). Additionally, RMT is unable to effectively handle the vast diversity of natural language, as language evolves with time, context, and usage. Despite these limitations, RMT has laid the groundwork for future advancements in machine translation by emphasizing the importance of linguistic structure. Recognizing the inefficiencies of RMT, researchers have sought alternative approaches that can handle translation tasks more dynamically and efficiently (Brown et al., 1993; Elkaffash, 2020; Shormani & AlSohbani, 2025). This has led to the emergence of Statistical Machine Translation (SMT), which has shifted from manually defined rules to a probabilistic, data-driven approach to translation.

Statistical machine translation emerged in the 1990s as a probabilistic alternative to RMT, leveraging large bilingual corpora and statistical models to improve translation accuracy (see e.g., Hutchins, 1986; Shormani, 2024a & b). Unlike RMT, which relies on predefined linguistic rules, SMT introduces a probabilistic approach using vast amounts of parallel texts to predict the most accurate translations (Brown et al., 1993; Elkaffash, 2020). At its core, SMT operates by analyzing bilingual corpora, where sentences in one language are aligned with their corresponding translations in another. The system then applies statistical models to determine the probability of a given target language sentence based on source language input. One of the earliest and most influential SMT frameworks was IBM’s model series, which introduced word-alignment techniques and phrase-based translation methods (Koehn et al., 2003). Another major innovation in SMT is the noisy-channel model (see e.g. Neubig et al., 2010; Hofmann et al., 2010; Saito et al., 2012), which treats translation as a decoding process in which the most probable output is selected based on statistical similarity (Koehn et al., 2003). These statistical methods significantly improve translation fluency compared to RMT by allowing the system to adapt dynamically to large datasets.

However, SMT faces several challenges, particularly in handling complex syntactic structures and long-range dependencies (Hofmann et al., 2010). Since SMT relies on phrase-level probabilities rather than deep linguistic understanding, it often struggles with sentence coherence and grammatical correctness. Thus, while SMT can accurately translate isolated words or short phrases, it often fails to maintain the logical flow of longer sentences, resulting in disjointed or incoherently structured outputs. Additionally, SMT models require extensive training data, that is, low-resource languages such as Hindi, Marathi, and Irish are often inadequately represented (Shormani & AlSohbani, 2025; Shormani, 2024b & c). As a result, translations of these languages tended to be less reliable, with significant errors in syntax and semantics. Early versions of Google Translate, which originally relied on SMT (Elkaffash, 2020), demonstrated these weaknesses by frequently generating grammatically inconsistent and contextually inaccurate translations. Although SMT improved translation accuracy compared to RMT, it was still far from achieving human-level accuracy.

One of the major limitations of SMT is its dependence on large high-quality bilingual corpora. The availability of these corpora varied greatly between languages, with high-resource languages such as English, Spanish, and French benefiting from better-trained models, while other languages, specifically low-resource languages, remained underrepresented. Moreover, as noted by Elkaffash (2020), SMT requires significant human interference to refine translations, which is known as the post-editing process (cf. Groves & Dag, 2009; Krings, 2001; de Almeida & O’Brien, 2010). A post-editing process is often necessary to correct errors in grammar and meaning. Recognizing these shortcomings, researchers have sought more advanced techniques that can better capture the nuances of natural language and improve its contextual accuracy. In response to these challenges, neural AI translation was introduced, as we will see below, marking a significant leap forward in machine translation utilizing deep learning and artificial neural networks.

NMT aims to overcome the limitations of SMT by modeling entire sentences as continuous representations, thereby allowing for greater contextual understanding and fluency. The NMT revolutionized the field of MT in the 2010s by introducing deep learning and artificial neural networks. Unlike SMT, which translates text at the phrase level using probabilistic models, NMT considers entire sentences and their broader contextual relationship. This approach allows for more fluent, coherent, and natural-sounding translations, reducing the disjointed outputs that are common in SMT. The fundamental shift brought about by NMT was its ability to process words not as isolated units but as part of a continuous representation, using deep learning techniques to capture the semantic and syntactic structures of a sentence. Early NMT models were based on Neural Networking Algorithms (NNAs) and Recurrent Neural Networks (RNNs), which helped maintain sequential dependencies during translation. This is particularly true with the introduction of Transformers and Vectors. However, RNNs struggle with long-range dependencies, meaning that the quality of translation decreases for longer and more complex sentences. This limitation prompted the search for more effective architectures that could process language in a nonsequential, context-aware manner.

A major breakthrough came in 2017 when Vaswani and colleagues introduced the transformer architecture, which replaced RNNs in many NMT models, including ChatGPT (see also Lee, 2023; Siu, 2023; Kumar et al., 2024; Shormani, 2024a & b). Unlike RNNs, transformers process entire sequences of words, phrases, and sentences simultaneously, making them significantly more efficient and capable of handling long-range dependencies (Vaswani et al., 2017). The self-attention mechanism in transformers allows models to weigh the importance of different words in a sentence, ensuring that translations preserve their meaning across languages. This innovation has led to the rise of Large Language Models (LLMs), such as OpenAI’s Generative Pre-trained Transformer (GPT), which leveraged transformers to produce more contextually aware and fluent translations. NMT dramatically improved translation accuracy, minimized common issues such as word-for-word errors, and enhanced contextual understanding. As a result, major translation platforms, including Google Translate and DeepL, have adopted NMT to replace older SMT-based models. The continued refinement of transformer-based architectures has brought machine translation closer to the human-level sound transition output (see also Jiao et al., 2023; Lee, 2023; Siu, 2023; Kumar et al., 2024). Another major advancement of the AI translation industry is the incorporation of neural vectors.

However, with its advancements, NMT still faces challenges, particularly when dealing with culture- and religion-based texts (see e.g., Shormani & AlSohbani, 2025; Shormani, 2025b). Unlike other types of texts, such as technical and scientific texts, cultural- and religion-based texts require more than simply rendering wordings (Shormani, 2020); they demand a deep understanding of cultural, religious, and historical contexts (Shormani, 2025b). Many NMT models, including ChatGPT, struggle to accurately translate idiomatic expressions, culturally specific texts, and religious terminology (cf. Shormani, 2020). This is because AI models primarily learn from vast amounts of (Internet) data, which may not include data containing these cultural nuances or the sensitivities of religious discourse. Additionally, ethical concerns arise when translating sensitive materials, as different cultures have varying interpretations of words and concepts (Shormani & AlSohbani, 2025). Addressing these limitations requires further advancements in AI translation training data, methods, dataset diversification, and post-editing to refine NMT’s ability to handle complex cultural and religious texts effectively (see e.g., Shormani, 2024a & b).

ChatGPT was introduced in 2022 by OpenAI. It is based on the syntactic, morphological, logical, and algorithmic transformation of a large number of language samples, patterns, dictionaries, information, and models, known as Large Language Models, through which it is enabled to generate new outputs and more distinctly adhere to contextual backgrounds for further usage when performing new tasks (Gill & Kaur, 2023; Çetin & Duran, 2024; Puppel & Borg, 2024). Included within Generative Pre-Trained Transformer (GPT) systems, specifically GPT-3.5 and GPT-4, ChatGPT is enabled to generate new outputs and perform many tasks, some of which are transforming data, translating texts, post-editing, translation evaluation, summarizing, and responding to users’ inquiries and conversations (Gill & Kaur, 2023; Jiang, et al., 2024; Macken, 2024; Puppel & Borg, 2024). It can perform various tasks and generate different outcomes, such as written, visual, or auditory, that almost resemble those produced by humans. It essentially benefits from deep learning and Natural Language Processing, which is a branch of AI that teaches machines to comprehend human language and produce similar output models (Gill & Kaur, 2023).

As a result of sustainable technological advancements as well as an alternative novel tool competing with other AI tools in the market with its high scalability and efficient performance, accessible open-source DeepSeek emerged in January 2025 (Guo, et al. 2024; Joshi, 2025; Peng, et al. 2025; Wang & Kantarcioglu, 2025). Like ChatGPT, DeepSeek is a transformative, Large Language Model (LLM) that performs various tasks, such as mathematical and structural reasoning, language processing, problem solving, finance, and healthcare diagnosis. It is comprehensively pretrained on high-quality linguistic and syntactic corpora of codes (Guo, et al. 2024). It features the architecture of Mixture of Experts and Multi-Head Latent Innovation (Joshi, 2025; Peng, et al. 2025; Wang & Kantarcioglu, 2025). DeepSeek, despite having challenges in innovative tasks and the safety of users, has many advantages and an inspiring future that exceeds those of other AI tools, including ChatGPT and Codex (Guo, et al. 2024; Joshi, 2025). Some of its major advantages are grammatical accuracy and contextual evaluation (Joshi, 2025).

3. Translating dialects

Translating nonstandard dialects presents significant challenges for human translators because of their lack of formal codification, regional variation, and deep cultural embedding. Unlike standardized languages, which have established grammatical rules and extensive documentation, non-standard dialects often rely on oral traditions and are shaped by local customs, idioms, and phonetic shifts that may not have direct equivalents in the target language (Federici, 2011; Kong, 2013). Additionally, dialects often carry the connotations of social class, regional identity, and cultural heritage. Translating these aspects requires more than linguistic proficiency; cultural fluency is required to ensure that the translation resonates with the target audience while preserving the source’s authenticity. For example, when translating Swedish dialects into English, preserving local identity and authenticity is crucial, as dialects are expressions of local identity and community (Federici, 2011).

Thus, if translating dialects is difficult for human translators, then it is expected that MT tools face more difficulty in this regard because these dialects are not within the training data. Put simply, these tools struggle with non-standard dialects; their training data are typically standard language corpora and may not accurately process vernacular speech, leading to errors and loss of meaning (see also Puppel & Borg, 2024). Arabic is a diglossic language with several dialects and vernaculars (see also Ferguson, 1959). In this study, we examine this aspect using both ChatGPT and DeepSeek to determine the extent to which they can translate Yemeni San’ani Arabic (YSA).

For a decade or so, a number of studies have examined and evaluated the workflow of AI translation tools, seeking to identify their efficiency and usefulness, as well as their weaknesses and limitations in translation. Taking the English-German pair as a case study, Puppel and Borg (2024) evaluated the performance of ChatGPT and showed its strengths and weaknesses based on prompts. The most prominent strength of ChatGPT translation is the coherence of the translated output. However, the prevailing limitation of ChatGPT exhibited in this study is related to style. Other limitations include fluency and accuracy. This study highlighted the need for post-edition of ChatGPT translated output to detect errors and edit them. Therefore, “the intervention of trained translators is required to correct errors and fine-tune the machine-translated text” (Puppel & Borg, 2024, p. 22).

Applying his study on three machine-translated short stories from English into Dutch, Macken (2024) evaluated the ability of ChatGPT 4-o, in a post-editing machine, to translate literary texts automatically in comparison to the post-edition of experienced professional translators in the field of translating literary works from English into Dutch. The study showed that the automatic changes made by ChatGPT were at the level of words and that it made more lexical changes than those made by human editors. In contrast, professional editors made changes not only at the word level but also at the style level. Overall, the study concluded that although ChatGPT could actually correct a number of errors, it still provided edited texts with more problems than texts post-edited by professional translators.

In a comparative study of the advantages and limitations of conventional MT systems, new chatbots, and ChatGPT tool in the translation from English into Spanish and Portuguese, Postigo (2024) stated that there are differences between the translation output obtained by MT and those gained by AI tools. Nevertheless, she found that all MT and AI tools present several grammatical and semantic challenges. Additionally, Jiang et al. (2024) conducted a quantitative, qualitative study on the potentiality of the automated evaluation of machine translation from Chinese into Portuguese by means of two ChatGPT models, i.e., 3.5 and 4.0, in addition to five human raters, for the sake of analytic comprehensiveness. The sample consisted of 20 sentences translated from Chinese into Portuguese. The study concluded that the capability of ChatGPT, especially the 4.0 model, reached the efficiency level of conventional human evaluation and that it has a more inspiring future if it gets more enriched with further balanced capabilities.

To detect the cultural integration of AI tools in their translations, Cao et al. (2024) addressed the adaptation of nuanced cultural contexts by translating Chinese recipes to English. These recipes are both automatically formed and human-enriched. The study concluded that, although GPT-4 showed mesmerizing capability in the adoption of cultural recipes, it still remained behind the scale of human ability and expertise. Additionally, Deilen et al. (2023) conducted an intralingual German-Easy Language machine translation and translated it using ChatGPT. The analysis was based on readability, correctness, and syntactic complexity. The results showed that the output content was not completely correctly intra-lingually translated, and some content was missing, while syntactic constructions were rendered easier but not as required. The paper concluded that there is an inevitable need for professional human translators to pre-instruct the tool and edit the tool’s translated output.

3.1 Translating Arabic dialects

Early work on translating Arabic dialects into foreign languages has predominantly relied on Standard Arabic (SA) as a pivot language and on hybrid or statistical machine translation approaches. For example, Sawaf (2010) proposed a hybrid MT system combining rule-based and statistical methods, showing that normalizing dialectal input into MSA significantly improves translation quality across multiple Arabic dialect groups. Similar normalization-based strategies were adopted to reduce out-of-vocabulary (OOV) rates and enhance BLEU scores, particularly for noisy web and broadcast data (Salloum & Habash, 2012). These studies collectively demonstrate that character-level transformations, morphological analysis, and dialect-to-MSA mapping are effective for mitigating lexical and orthographic variation in dialectal Arabic.

Subsequent research moved beyond single-system approaches by explicitly addressing dialectal variation and system complementarity. Salloum and Habash (2012) showed that generating multiple MSA paraphrases for dialectal input and using them in SMT improves translation quality, while Salloum and Habash (2012) introduced sentence-level dialect identification to dynamically select the most appropriate MT system. Large-scale efforts under DARPA’s BOLT program further confirmed that combining MSA and dialectal data, applying morphological segmentation, and carefully balancing training corpora yield superior performance for informal Arabic texts that mix registers and dialects (Zbib et al., 2012; Aransa, 2015). These findings underscore the importance of treating Arabic dialects and SA as related but distinct linguistic domains in MT.

More recent studies have focused on morphological segmentation, OOV handling, domain adaptation, and non-standard writing systems. Unsupervised segmentation techniques were shown to improve translation quality for dialect-to-English and English-to-dialect MT (Al-Mannai et al., 2014), while targeted OOV normalization using dialect identification and morphological tools yielded consistent gains (Durrani et al., 2014). In addition, research has highlighted the growing challenge posed by Arabizi, a non-standard Latin-script representation of Arabic dialects widely used in social media, which introduces substantial orthographic variability and further complicates MT for dialectal Arabic. Thus, the literature converges on the view that effective Arabic dialect MT requires dialect-aware preprocessing, adaptive modeling strategies, and robust handling of linguistic and orthographic variation.

In his study, Alafnan (2024) detected the effectiveness of ChatGPT and Google Translate in the translation of selected Arabic and English speeches of the King of Jordan, Abdullah II, from Arabic to English, and from English to Arabic. This study revealed that Google Translate’s translated outputs were inadequate and required major editing. Translated outputs of ChatGPT, specially from Arabic into English, on the other hand, though needing some sort of adjustment, were acceptable. However, the study emphasized that machine translation complements human professional translators and does not substitute them, since human mediation is indispensable for the adequacy of translation.

Thus, the study has three questions to answer:

1. Can AI models, specifically ChatGPT and DeepSeek, capture the YSA dialectical nuances?
2. Which is better in translating YSA terms, ChatGPT or DeepSeek?
3. What are the problematic nuances that ChatGPT and DeepSeek face in translating such terms?

4. Methods

4.1 Data collection

We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of San’ani Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow for both linguistic and cultural analyses. After collecting the data, the terms were classified into categories. The results are presented in Table 1.

Table 1. Categories Yemeni San’ani Arabic terms.

Nominals					Nonnominals
Places	Clothes	Animals	Stuff	Households	Verbs	Adj & Advs
شاقوص	مُغْمُق	دِمِّة	شِرْكِة	طاسة	نِبْدَع	مطنن
صومعة	مَحْزَق	حلباني	جَعالة	بالدي	وخّر	شوعة
دَيْمِة	صُماطة	تيس	قِرِّيْح	بُورِي	قوٌّى	حالي
قاع	أدوان	معزة	تـُتنْ	مَعْشَرَة	كوزر	بحين
غُرقة	مشمع	شقري	بيعة	مَدَق	سير	أيحين
دِرجان	بَرْدِة			مَلَتْ	بيشقص	فيسع
حانوت	نصلة			قوطي	يلغّج	لَلْمه
	زنة			المدعي	سمخ	سَعْما
				محواش

As shown in Table 1, the data were divided into seven types: Place, Clothes, Animals, Stuff, Household, Verbs, and Adj & Advs. The idea behind this categorization is twofold: (i) easy to expose and (ii) to understand the nature of each category. The category Place contained seven terms: Clothes 8, Animals 5 Stuff 5, Households 9, Verbs 8, and Adjs & Advs 8 terms.

4.2 Procedure

The study has passed through three stages. Stage 1 concerns data collection and classification, in which we collected the data from different sources, as alluded to above, and classified these data. Stage 2 consists in translating the collected data into English using ChatGPT-4o and DeepSeek v3. After translating the data using both AI models, we translated the terms. Our translation depended heavily on two factors: i) our knowledge of YSA and ii) the first author’s relatives who are native speakers of YSA. If we did not know what a term meant, we asked them to give us the meaning in SA or explain its meaning by giving us an example, for instance. When we understand the meaning, the translation task becomes easier. We then tabulated the results in terms of ChatGPT, DeepSeek, and human translations. Stage 3 deals with the data analysis and discussion.

4.3 Methods of analysis

The methodology employed was a qualitative and quantitative approach to evaluate the accuracy and contextual appropriateness of translations. In our analysis, we followed a simple count and percentage for quantitative analysis. For the qualitative part, we adopted a linguistic and sociocultural analysis framework, known as micro and macro analysis (see e.g. Fairclough, 2003). We assessed the translation (in)correctness and appropriateness of dialectical terms between the two AI models, ChatGPT and DeepSeek, and human translation. We considered the translation of an item correct if cultural and linguistic aspects are both maintained in the translation of this term. However, if one of these aspects is violated in the translation, we consider it appropriate, and if both the cultural and linguistic aspects are violated in the translation, we considered it incorrect. We set the following evaluation criteria: i) dialectal and cultural subtleties, in which we examined whether both AI models capture these nuances and which are more successful, and ii) linguistic subtleties. Using this criterion, we assessed whether both models capture linguistic subtleties that lie within morphophonology/orthography and syntax) contextual subtleties. Here, we investigated whether the models’ translations reflect the intended meaning in different contexts, and iv) Standard Arabic (SA) biases, where we evaluated whether the translations produced by both models are biased towards SA.

5. Results and discussion

5.1 Results

We first present the results in the form of ChatGPT, DeepSeek, and human translations. Table 2 presents these findings.

Table 2. AI Translation (ChatGPT and DeepSeek) and Human translation.

No	Term	ChatGPT	DeepSeek	Human
1	شاقوص	Hook (or a type of tool with a hooked end)	A type of fish (likely a local or dialectal term)	A small and narrow window
2	صومعة	Silo/Minaret/Monastery	Silo or minaret	minaret
3	دَيْمِة	Continuous rain (light drizzle)	Light rain or drizzle	Kitchen
4	قاع	Valley/Lowland/Open land	Bottom or base	Bottom
5	غُرقة	Room	Drowning or flood	hole
6	دِرجان	Stairs	Stairs or steps	Stairs
7	حانوت	Shop/Store	Shop or small store	shop
8	مُغْمُق	Dark/Dim	Confused or muddled	Veil
9	مَحْزَق	Tight place/Constricted area	Narrow or tight spot	belt
10	صُماطة	Leftover food/Small food scraps	A small bag or pouch	shawl
11	أدوان	Enemies	Tools or utensils	clothes
12	مشمع	Waxed cloth/Plastic cover	Oiled cloth or tarpaulin	Plastic cover
13	بَرْدِة	Curtain	A type of cloak or garment	Curtain
14	نصلة	Blade/Tip of a knife or sword	Arrowhead or sharp point	Djambia
15	زنة	Weight	Weight or measure	garment
16	دِمِّة	Silence/Stillness	A type of plant or herb (context-dependent)	cat
17	حلباني	Milkman	Related to milk or milking	Millipede
18	تيس	Male goat	Male goat	Male goat
19	معزة	Female goat	Female goat	Female goat
20	شقري	Blond/Fair-skinned	Chestnut-colored (often used for horses)	Cock
21	شِرْكِة	Company/Partnership	Partnership or company	meat
22	جَعالة	A small reward or treat	Reward or compensation	Sweet presents
23	قِرِّيْح	Wound/Injury	A type of plant or herb	firecrackers
24	تـُتنْ	Tobacco	Tobacco	Tobacco
25	بيعة	Sale/Transaction	Oath of allegiance or sale	leek
26	نِبْدَع	We start/We begin	We innovate or create	We start
27	وخّر	Move away/Step aside	He delayed or postponed	Step aside
28	قوٌّى	Strengthened/Madestronger	He strengthened or empowered	welcome/please
29	كوزر	Drank excessively	A type of plant or herb	sit
30	سير	Go/Walk/Travel	Walk or journey	go
31	بيشقص	Cuts into pieces/Trims	He cuts or slices	oversee
32	يلغج	Chews noisily/Mumbles while eating	He combs or untangles	Talkative
33	طاسة	Bowl/Metal cup	A small bowl or cup	bowl
34	بالدي	Cold/Chilled	Cold or chilly	Bucket
35	بُورِي	Pipe/Exhaust pipe	A type of mat or straw	Tobacco container
36	مَعْشَرَة	Gathering/Assembly	Companionship or group	Big and wide bowl
37	مَدَق	Mortar and pestle (used for grinding)	He pounded or crushed	pestle
38	مَلَتْ	Got bored/Fed up	She tilted or inclined	Wooden pot used to mix tobacco
39	قوطي	Can/Tin	A type of plant or herb	Can
40	مدعي	Pretender/Faker	Claimant or plaintiff	hookah
41	محواش	Farmyard/Enclosure	He erased or removed	Small wooden mixer used for Aseed
42	مطنن	Distracted/Not paying attention	He exaggerated or embellished	upset
43	شوعة	Bright light/Flash	A small branch or twig	ugly
44	حالي	Sweet/Delicious	My condition or state	beautiful
45	بحين	When/At the time	At the time or meanwhile	fast
46	أيحين	Right now/At this moment	When or at what time	when
47	فيسع	Quickly/Fast	He expands or makes room	quickly
48	لَلْمه	Gather it up/Collect it	A type of plant or herb	why
49	سَعْما	Sometimes	A type of plant or herb	like
50	سمخ	Jumped/Leaped	He raised or elevated	brave

In Table 2, there are several types of translations resulting from the translation of both ChatGPT and DeepSeek. Both models provided correct translations, incorrect translations, and appropriate translations. For some terms, they provide both correct and incorrect translations simultaneously. As for ChatGPT translations, we have correct translations including translating term 1, viz. شاقوص which was translated by ChatGPT as ‘Hook (or a type of tool with a hooked end)’. The incorrectness of translation here lies in rendering the Yemeni San’ani Arabic term with Hook which is far away from the correct translation. However, humans render it a small and narrow window which is correct. The YSA term شاقوص is found in old houses, specifically those in the Old Sana’a City. It is located on the stairs (or even in rooms) and is used by a person to see through, but without being noticed by another person. For example, if someone knocks on the door, and only women in the house, they used شاقوص to see who is the knocker before opening the door. To exemplify the correct translations by ChatGPT, take the term دِرجان which means in SA دِرج , a plural form of دِرجة ‘stair.’ As for appropriate translations of ChatGPT, take, for example, the term جَعالة which is a mixed sweet present including biscuit, sweets, brought by a father, mother, or older brother for children. ChatGPT translates this term as a small reward or treat, which is acceptable but not absolutely correct. This is rendered through human translation.

Additionally, DeepSeek translations result in several types of renderings. We obtained seven correct translations, 34 incorrect translations, and eight appropriate translations. Like ChatGPT, DeepSeek provides both correct and incorrect translations simultaneously.

DeepSeek has several correct translations. For example, the term أيحين was translated as When or at what time. أيحين is in fact a wh-word in YSA which is used for asking about time (see also Shormani et al., 2025). To exemplify the incorrect translations by DeepSeek, the term شاقوص was translated as ‘a type of fish.’ This translation is far from the correct translation, which is ‘a small and narrow window.’ As for appropriate translations by DeepSeek take, for instance, the term بَرْدِة which means ‘curtain’ as in the human translation, but DeepSeek translates it as ‘a type of cloak or garment.’ This translation is acceptable due to the fact that بَرْدِة is a piece of cloak or garment used to cover windows. DeepSeek also provides both correct and incorrect translations as in the case of صومعة which was translated as Silo and minaret, the first of which is incorrect while the latter is correct. That the rendered term ‘Silo’ is incorrect is due to the fact that ‘Silo’ is used for keeping corps after harvesting, while ‘minaret’ (of a mosque) is what is meant here.

Table 3 displays the frequency and percentage of correct, incorrect, and appropriate translations of both ChatGPT and DeepSeek. It also presents the frequency and percentage where both models provide correct and incorrect translations simultaneously. For ChatGPT, there were 17 (34%) correct translations. It translated 29 terms incorrectly and 58% of the total number of terms involved. There are three appropriate translations by ChatGPT. The term for which ChatGPT provides correct and incorrect translations at the same time is only 1, namely صومعة as has been discussed above ( Table 3). DeepSeek differs from ChatGPT. For example, it has only seven terms that were translated correctly, that is, less than ChatGPT, amounting to 14% of the total number of terms involved in the study. There are 34 terms that were incorrectly translated by DeepSeek (68%). Appropriate translations scored eight terms, that is, 16%. Finally, DeepSeek provides both correct and incorrect translations for term 1, namely صومعة as just noted with regard to ChatGPT. Interestingly, both AI models provided correct and incorrect translations for the same term.

Table 3. Summary of the results.

ChatGPT			DeepSeek
Tra. category	Freq	%	Freq	%
Correct	17	34	7	14
Incorrect	29	58	34	68
Appropriate	3	6	8	16
Co&inc	1	2	1	2

5.2 Discussion

5.2.1 ChatGPT translations

As shown in Table 2, ChatGPT yielded 18 correct translations. These correct translations include دِرجان , translated as ‘stairs.’ This is accurate because in Standard Arabic, دِرجان is the plural form of دِرجة , meaning ‘stair.’ This term is widely used in YSA to refer to a set of stairs leading to another floor of a multifloored building. Another accurate translation is that of قاع , rendered as ‘valley/lowland/open land’. In YSA, قاع refers to a flat, low-lying area of land, often used for agriculture. The term is commonly used in regions where such landscapes are found, where ChatGPT translation is correct. Similarly, حانوت was correctly translated as ‘shop/store’. This word, originating from “older” Yemeni Arabic, remains in use in YSA to describe a small store or shop selling various goods. This aligns well with ChatGPT translation. Another well-rendered term is تـُتنْ , translated by ChatGPT as ‘tobacco.’ In YSA, تـُتنْ is a commonly used word in YSA. تـُتنْ is used as the substance for smoking hookah, making this translation highly accurate. The term بَرْدِة was also correctly translated by ChatGPT as ‘curtain.’ The term is widely used in YSA; it is part of households referring to a piece of cloth covering a window or doorway.

Additionally, the term نِبْدَع was accurately translated by ChatGPT as ‘we start/we begin’. In YSA, it is used to indicate the initiation of an action or event, such as starting work or journey. Another correct translation is سير , translated as ‘go/walk/travel’ though ‘go’ is the best translation, and this is what San’anis mean when using it. The term ‘walk’ has another term in YSA, which is إخطى . In YSA, سير is a commonly used verb to tell somebody ‘to go,’ specifically by walking. Also, طاسة was correctly translated by ChatGPT as ‘bowl/metal cup’. In YSA, it refers to a small, often metallic container used for drinking (soup) or eating. The term بُورِي was incorrectly rendered as ‘pipe/exhaust pipe’. The term بُورِي is a word for the container of تـُتنْ which is put on hookah. Likewise, مَدَق was correctly translated as ‘mortar and pestle’ (used for grinding). It is often made of copper or metal, an essential kitchen tool in Yemen used for grinding spices or grains. The term قوطي was also correctly translated as ‘can/tin’.

We now turn to ChatGPT incorrect translations. This category included 29 terms. For example, ChatGPT translated شاقوص as ‘Hook (or a type of tool with a hooked end),’ which is incorrect. In YSA, شاقوص refers to a small window found in traditional houses, particularly in the Old Sana’a City. These windows are strategically placed on stairs or rooms, allowing residents to see outside without being noticed. For example, if someone knocks on the door and only women are home, they use the شاقوص to identify the visitor before opening the door, as noted to above. This highlights the cultural and architectural significance of the term that ChatGPT missed. ChatGPT translated دَيْمِة as ‘Continuous rain (light drizzle),’ which is incorrect. The human translation, ‘kitchen,’ is accurate. Though the term مطبخ , is used in other parts of Yemen, there are also old Yemeni terms used for ‘kitchen’ in some parts of Yemen, as in the case of سقيفة which is used in Ibb region., For instance, the term دَيْمِة is widely used in YSA to mean ‘kitchen’. This term is deeply tied to daily life in YSA, and this ChatGPT translation is unrelated to its actual meaning. ChatGPT translated غُرقة as ‘room,’ which is incorrect. The human translation, ‘hole,’ is accurate due to the fact that غُرقة refers to a hole in YSA. ChatGPT translation of this term reflects a lack of understanding of the term’s true meaning. It translated مُغْمُق as ‘dark/dim,’ which is incorrect, compared to the human translation, ‘veil,’ which is what this term in YSA refers to. مُغْمُق refers to a woman’s veil, which San’ani women used to cover their faces. It is a piece of cloth to covering women’s faces, having cultural and religious significance in Yemeni society. It seems that ChatGPT translation of this term does not capture the term’s cultural and religious connotations. ChatGPT translated مَحْزَق as ‘tight place/constricted area,’ which is incorrect. The incorrectness of this translation lies in capturing neither the cultural nor the dialectical nuances. In YSA, the term مَحْزَق refers to a belt decorated with gun bullets’ holes. In Sana’a, and northern places in Yemen, مَحْزَق is used as a sign of “manhood,” courage and a signal of fighting. مَحْزَق is often worn with a gun. All these features of مَحْزَق are reflected in the human translation.

ChatGPT translated شِرْكِة as ‘company/partnership,’ which is far from the dentation of this term in YSA. The term شِرْكِة simply means ‘meat’. refers to meat, a staple in yemeni cuisine, and daily life. ChatGPT translation reflects a misunderstanding of the term, likely due to the word’s similarity to the Arabic term for ‘partnership’ ( شركة ), but with orthographical differences. The latter is written as شَرٍكَة , note the different ‘harakat’, as we will see shortly.

Finally, the appropriate (acceptable) translations by ChatGPT included only three terms. In this category, ChatGPT provides a translation that is somewhat acceptable but not fully accurate as in translating the term جَعالة , translated by ChatGPT as ‘a small reward or treat’. While this conveys a general idea, the specific meaning in YSA culture refers to a mix of sweets, biscuits, and treats brought by a father, mother, or older sibling for children as a gesture of care and love. It is a traditional and cultural practice rather than just a generic ‘treat.’ Another example is بحين , translated as ‘when/at the time’. While this is an acceptable translation, the term in YSA more precisely means ‘as soon as’ or even ‘fast’. Similarly, أيحين was translated as ‘right now/at this moment’, which is mostly correct. However, in YSA, أيحين carries a sense of questioning, meaning ‘when’ as can be observed in human translation.

5.2.2 DeepSeek translations

Recall that DeepSeek has seven correct translations, 34 incorrect and eight appropriate translations, and both correct and incorrect translations. We exemplify some of these correct translations. For example, DeepSeek translated the term قاع , as ‘bottom or base.’ This is accurate because in YSA, قاع also refers to the lowest part of something such as the bottom of a container. However, as we have seen in ChatGPT translation, the term قاع refers also to a valley which is fertile for agriculture. There are also many well-known قيعان ‘plural of قاع ” in Yemen such as قاع البون، قاع جهران (Jahran valley, Bawn valley, respectively), etc. There are several well-known fertile valleys in Yemen. Almost all types of crops are grown in these fertile valleys. Another term translated correctly by DeepSeek is دِرجان ‘stairs.’ DeepSeek adds ‘steps’ as an alternative translation which does not reflect the actual context like ‘stairs.’ حانوت was also accurately translated as ‘shop or small store,’ reflecting its meaning in the local dialect. Note that this translation aligns with ChatGPT translation. DeepSeek adds ‘small’ which is true; in San’ani dialect, حانوت is a small shop in a traditional old market like Souq almilħ, a famous Souq in Sana’a Old City. Additionally, تيس was correctly rendered as ‘male goat,’ and معزة as ‘female goat,’ both of which match the human translations. Another accurate translation is تُتن , which DeepSeek translated as ‘tobacco,’ a common term in YSA for tobacco products.

However, several incorrect translations were made. The term شاقوص , for example, which DeepSeek translated as ‘a type of fish (likely a local or dialectal term)’ is a good case in point here. This is incorrect, as شاقوص in YSA dialect refers to a small and narrow window in old houses, particularly in Old Sana’a City, as we have noted earlier in relation to ChatGPT translation. The term صومعة in addition was mistranslated as ‘light rain or drizzle,’ while in reality, it means ‘minaret,’ i.e. a tall circled narrow tower of a mosque. Another significant mistranslation is دَيْمِة , which was rendered as ‘confused or muddled,’ whereas in YSA, it actually refers to kitchen.

Unlike ChatGPT, DeepSeek translated the term صُماطة as ‘a small bag or pouch,’ which is not correct. It refers to some sort of ‘shawl,’ a traditional Yemeni piece of clothes that covers men’s heads, or put on their shoulders. Another incorrect translation concerns the term زنة , which DeepSeek rendered as ‘weight or measure,’ while its actual meaning in YSA is ‘garment’ like robe. دِمِّة was mistranslated as by DeepSeek as ‘a type of plant or herb,’ but it actually means ‘cat’ as in the human translation of the term. The term بُورِي was translated as ‘a type of mat or straw,’ which is not correct. The correct translation is the container of تـُتنْ which is put on hookah, as we have noted so far. DeepSeek also mistranslated يلغج as ‘he combs or untangles,’ whereas it actually means to repeat telling something several times or simply ‘talkative’ in YSA. بالدي was translated as ‘cold or chilly,’ but it means ‘bucket’ in San’ani dialect. مَعْشَرَة was mistranslated as ‘companionship or group,’ but its actual meaning is ‘big and wide bowl,’ which is used for serving Aseed in San’ani dialect. مَدَق was translated as ‘he pounded or crushed,’ but it refers to a small wooden or copper mixer used for grinding especially spices as indicated in the human translation. مَلَتْ was rendered as ‘she tilted or inclined,’ whereas its correct meaning is a wooden or aluminum container in which tobacco is mixed.

Regarding appropriate translations, DeepSeek translated غُرقة as ‘drowning or flood,’ which is somewhat close but not fully accurate, as its proper meaning is ‘hole. أدوان was translated as ‘tools or utensils,’ which is acceptable but not entirely a precise translation, as its actual meaning is clothes. مشمع was translated as ‘oiled cloth or tarpaulin’, which is reasonable but slightly different from the correct meaning of ‘plastic cover. بَرْدِة was translated as’ a type of cloak or garment’, while it actually means ‘curtain’. نصلة was translated as’arrowhead or sharp point’, which is somewhat close but does not fully convey its actual meaning, Djambia (the traditional Yemeni dagger). جَعالة was translated as’reward or compensation’, but it should be understood as a mixture of sweets, biscuits, and treats given to children. سير was translated as ‘walk or journey,’ which is close but does not capture its everyday use in YSA as ‘go.’ بحين was acceptably translated as ‘at the time or meanwhile,’ though the human translation conveys the meaning more naturally.

6. ChatGPT and DeepSeek: Convergence and divergence

6.1 Dialectal and cultural subtleties

Language reflects culture and vice versa (see e.g. Newmark, 1988; Bassnett, 2013; Shormani, 2020), and given that dialect is a regional variety of a language, one could argue that dialect could reflect culture more precisely than language because it represents a regional identity (Arzu & Issa, 2014; Daulay, 2017). A dialect carries regional variations mirroring everyday affairs, people’s needs, and emotions. Thus, the dialect is closer to people than to language. In our case, YSA is closer than SA to Sana’ anis, as it is their mother tongue. A dialect may also include lexes that are not found in a standard language. For example, the term جَعالة belongs from YSA, but not SA. Given these dialectical peculiarities, we examined how ChatGPT and DeepSeek deal with such dialectical subtleties and whether they both understand them. For instance, the term جَعالة refers to a gift of sweets for children brought by father, mother, or relatives. ChatGPT translates it as ‘a small reward or treat,’ which is close to the meaning, whereas DeepSeek mistranslates it as ‘reward or compensation,’ losing the cultural sense. The term نِبْدَع was accurately translated by ChatGPT as ‘we start/we begin’. However, DeepSeek was not able to translate and capture this dialectical aspect, hence providing an incorrect translation, ‘we innovate or create.’

Both ChatGPT and DeepSeek incorrectly translated حلباني as ‘milkman,’ and ‘elated to milk or milking,’ respectively, which shows their failure to capture the cultural and dialectical nuances of this term. The term حلباني in YSA refers to an insect commonly found in Yemen, specifically in rain seasons, whose English equivalent is ‘millipede. Both ChatGPT and DeepSeek also translated شقري as ‘blond/fair-skinned’ and ‘chestnut-colored’. Neither term reflects the correct dialectical or the Sana’ni Yemeni culture. شقري refers to a living cock, specifically when sold/bought in the market. Another term that has been mistranslated by both models is مدعي . ChatGPT translated it as ‘retender/faker’ while DeepSeek ‘claimant or plaintiff.’ Both translations are neither correct nor appropriate, as can be observed from contrasting them with the human translation, viz., hookah. Another example is محواش which manifests YSA both dialect and culture. As for dialect, the term محواش refers to a wooden tool used for mixing Aseed (see here). However, the same tool is referred to as مجحي ‘majhi’ in Ibbi dialect, for instance. Here lies the concept of dialectical difference. Additionally, this word may not belong to SA because, as far as we can tell, there is nothing called “Aseed” in SA. “Aseed” is a Yemeni dish; no other Arab country has it, and here lies the cultural peculiarities of the term “Aseed,” in general, and محواش , in particular. A further term reflecting San’ani dialect and culture is لَلْمه , which is a San’ani Arabic-specific term, and San’anis are sometimes mocked by other Yemenis belonging to other Yemeni regions. This term is an adverb, specifically a wh-word meaning ‘why.’ A final term that reflects both San’ani dialect and culture that can be highlighted here is مغمق . As we have discussed so far, the term مغمق simply means ‘veil,’ though the “veil” is used in all Yemen, مغمق has a specific (dialectical) cultural connotation. In Sana’a, ‘veil’ is different from any ‘veil’ used in other Yemeni regions; it is piece of cloth worn by women only in Sana’a governorate (and some parts of Amran, a governorate which was part of Sana’a (which has recently become an independent governorate).

6.2 Linguistic subtleties

In this category, we examine the ability of both models to capture linguistic nuances in terms of syntax and morphophonology/orthography. Regarding the former, there are four syntactic categories in our data: nouns, verbs, adjectives, and adverbs. The category nouns includes most of the terms involved, such as Place, Clothes, Animals, Stuff, and Households. The latter is discussed in terms of the morphophonology/orthography features that these terms involve.

6.2.1 Syntax

In our corpus, the syntactic category, which includes eight verbal terms, seems to be the most difficult syntactic category for both models. For example, while ChatGPT has 3 verbal terms (out of 7), viz., وخّر , نِبْدَع , and سير translated correctly, DeepSeek has no correctly translated verbs. It has only one verb which was translated appropriately, namely سير . Additionally, the most translated category by both models seems to be nouns, although ChatGPT scores more correct translations than DeepSeek. While the former translated 15 nouns and one adjective and one adverb correctly (as marked in black), the latter translated six nouns and only one adverb correctly.

6.2.2 Morphophonology/orthography

To correctly provide the dialectical representation of some words in our data, we provided diacritics, known in Arabic as harakat to differentiate them from their SA equivalents or to mark the Sana’ ani-specific peculiarities. These include َ , ِ , ُ , and ْ (fataha, kasrah, dhumah, and sukun, respectively) (see e.g. Shormani, 2013). These harakats, except sukun can be equalized to the English short vowels a, i, u, respectively. They are placed on letters to indicate how they should be pronounced. Words having these harakats include تـُتنْ , شِرْكِة , بَرْدِة , مُغْمُق and مَعْشَرَة . For example, in SA, we have the term نُبْدِع which means’we innovate’, which in turn is different from the YSA نِبْدَع . Linguistically, the two words have different pronunciations. Most of these words were mistranslated by both models ( Table 2). However, ChatGPT seems to capture the morphophonology/orthography features of these words more than DeepSeek.

6.3 Contextual subtleties

The SL context in which a word is used is considered a crucial issue in translation, as it transfers the intended meaning to the TL audience. The fact that a specific term is used in more than one context is conveyed by the operator “or” or “/”. ChatGPT used “or”/ “/” 39 times, but DeepSeek used “or” 47 times ( Table 2). Considering this aspect, in addition to the (in) correctness of translations provided by both models, there seem to be two contradictory aspects: i) regardless of the (in) correctness of translations, it seems that both models are “aware” of the context of terms, providing more than one translation for a considerable number of terms, that is, 39 vs. 47. In this very aspect, DeepSeek seems to capture context better than ChatGPT does. And ii) if, however, we consider the (in) correctness of the translations, it seems that ChatGPT captures dialectical and cultural nuances more than DeepSeek. The number of correct (and incorrect) translations by ChatGPT and DeepSeek gives us a clear clue that ChatGPT demonstrates a stronger grasp of the cultural and dialectical contexts of YSA than DeepSeek, as it has 17 correct translations, while DeepSeek has only seven correct translations.

6.4 Standard Arabic biases

Given that both models’ incorrect translations are more than their correct ones, although ChatGPT succeeds in capturing YSA nuances to some extent, it seems that both models are SA-biased. Both models often seem to default on SA or broad literal meanings (cf. Alwagieh & Shormani, 2024). For instance, both models translate صومعة incorrectly. While DeepSeek translates it as ‘light rain or drizzle,’ ChatGPT translates it as ‘continuous rain,’ both missing the true meaning ‘minaret.’ The term شقري , which means ‘cock’ in YSA, is mistranslated by DeepSeek as ‘chestnut-colored,’ a more generic SA meaning, perhaps from the SA term أشقر ” blond’. ChatGPT makes a similar mistake with ‘blond/fair-skinned’. It seems that both models are SA-biased; both retain a link to color-based descriptions of the SA meaning, showing a clear SA bias. Another term to be considered here showing SA bias is قوّى translated as ‘strengthened/made stronger’ and ‘he strengthened or empowered’ by ChatGPT and DeepSeek, respectively. Apart from its correct dialectical meaning ‘welcome,’ both models’ translations, though somehow different, seem to take SA meaning considerably in translating this term. Note that the term قوّى can also mean ‘please’ in YSA, as reflected in the human translation. The SA term قوي ‘strong’ seems to influence both models’ translations, again demarcating their SA bias. Thus, this SA bias could be ascribed to the training data, that is, both models appear to be trained only on SA data.

7. Conclusions, recommendations and limitations

The findings demonstrate a clear disparity in the models’ abilities, with ChatGPT consistently outperforming DeepSeek in capturing the YSA-specific dialectical features. Several conclusions could be drawn from this study. First, dialectal and cultural subtleties pose considerable challenges. Both models struggle significantly with dialectal and culturally embedded terms, although ChatGPT demonstrates a better approximation of meaning in several instances. This is most evident in culturally loaded terms such as نِبْدَع , جَعالة , and محواش , where ChatGPT translations are closer to the intended meanings, while DeepSeek often fails to capture cultural connotations. However, in several cases such as مدعي , شقري , and مغمق , both model fail to provide culturally and contextually appropriate translations, highlighting a systemic challenge in handling regionally specific lexicon. Difficulty is particularly centered around the “stuff” category, which includes highly dialectical and culturally specific items. Culture-based terms are reported as difficult to translate, even for advanced English students (see Alshawsh & Shormani, 2025). None of the five terms in this category were correctly translated by either model, emphasizing a significant gap in the ability of current AI models to handle deeply localized cultural elements (cf. Lilli, 2023; Datta, 2023). Second, it is clear that ChatGPT outperforms DeepSeek in linguistic subtleties. For example, in terms of syntactic categorization, both models struggled most with verbs, which are often morphologically complex and context dependent. While ChatGPT translated three out of the seven verbs correctly, DeepSeek succeeded in only one instance, showing a stark contrast in performance. Third, ChatGPT outperforms DeepSeek concerning the morphophonological and orthographic distinctions. ChatGPT exhibits high performance over DeepSeek. The use of diacritics (harakat) in the dataset helped distinguish YSA terms from their SA counterparts (e.g., نِبْدَع vs. نُبْدِع ), yet most of these marked terms were still mistranslated by both models. However, ChatGPT handles these distinctions with relatively higher accuracy, suggesting a more nuanced internalization of orthographic and morphophonological cues. Fourth, while both models exhibit some degree of contextual awareness, as seen in their use of multiple translation options (i.e., use of “or” & “/”, ChatGPT: 39 instances, DeepSeek: 47), this does not necessarily translate into correct translations. DeepSeek’s higher frequency of “or” usage suggests greater surface-level awareness of polysemy or ambiguity, but this does not correlate with translation accuracy. The final conclusion concerns the SA bias exhibited by both models. This is particularly visible in the mistranslations of terms such as قوى , شقري , and صومعة , where both models defaulted to SA meanings that do not reflect the YSA usage. This SA bias can be attributed to the composition of the model training data, which are likely dominated by SA sources. Consequently, dialectical terms that deviate from SA norms are either misinterpreted or forcibly aligned with their SA counterparts, leading to semantic distortions and a lack of cultural and contextual fidelity.

Thus, these aspects require careful attention from the AI developers. AI developers should expand the dialectal data coverage in the training data to improve the performance of both AI models in translating dialects. For example, in the San’ani dialect, data should be collected from various sources such as recorded conversations, social media content, and dialect-specific literature or oral history (see e.g. Morano et al., 2025). AI developers are advised to incorporate cultural and social contexts into training data. Many dialectal expressions are deeply rooted in the local culture and social norms. Therefore, AI models should be designed to consider these contexts to avoid literal or inaccurate translations that miss the implied meanings or connotations. Additionally, AI models, here ChatGPT and DeepSeek, should include or improve features that can automatically detect the dialect used in the input text. This would enable a more precise adaptation of translation strategies specific to the YSA dialect (or other dialects across several languages), improving the overall output quality. AI developers should consider enabling community- or researcher-led fine-tuning of models in niche dialects. They could also encourage (and perhaps fund) studies that tackle dialectical varieties (see also Kadaoui et al., 2023). Thus, we propose ‘a “multi-dialectal” pre-training approach’ (see e.g. Zan et al., 2022) incorporating YSA data as well as other Yemeni dialects. And this “multi-dialectal pre-training” could be extended to all Arabic multi-dialects such as Egyptian Arabic, Gulf Arabic, Moroccan Arabic. Another well-documented method that could be proposed here is a Bidirectional Training (BiT) approach (see e.g. Ding et al., 2021). This approach could be fine-tuned utilizing ‘both YSA-to-English and English-to-YSA data’ to enhance LLMs’ ability to translate YSA terminology into English, grasping YSA’s linguistic and cultural nuances.

However, this study has some limitations. The first limitation is that it focuses exclusively on the San’ani dialect. While this dialect is widely spoken, the results may not generalize to other Yemeni dialects, such as Ibbi, Adeni, Hadhrami, or Arabic dialects from other regions, which may exhibit distinct lexical, morphophonological, or syntactic features. The second limitation concerns the sample of the dialectal terms and phrases involved. It was relatively limited in size. A larger and more diverse dataset can improve the reliability of the findings and better reflect the full linguistic complexity of the dialect. Third, the study involved only two AI translation models: ChatGPT and DeepSeek. While these models are prominent and state-of-the-art, other platforms, such as Grok 3, Felo, and Meta, could also be used in future studies to widen the breadth of comparative analysis. Fourth, we used isolated words only, which may affect the models’ ability to contextualize the meaning. Specifically, LLMs are reported to be more accurate on contextualization than in single words. Thus, the two models’ performance may change if we use phrases and sentences. Fifth, our prompt in asking ChatGPT and DeepSeek to translate the 50 terms could have been more explicit if we give some explanations of what we need or employing COMET (see e.g. Peng et al., 2023). Finally, AI models such as ChatGPT and DeepSeek are frequently updated; hence, the performance observed during this study may not remain static over time, potentially affecting the reproducibility or relevance of the results for future users.

Ethics and consent

No ethics and consent statements are required for this study.

Data availability statement

The data underlying the results of this study are available on figshare.com , entitled Translating dialects DOI: https://doi.org/10.6084/m9.figshare.29251088 (Shormani & Al-Samki, 2025).

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

References

Alafnan M: Large Language Models as Computational Linguistics Tools: A Comparative Analysis of ChatGPT and Google Machine Translations. J. Artif. Intell. Technol. 2024; 5: 20–32. Publisher Full Text
Ali G, Ali N, Syed K: Understanding Shifting Paradigms of Translation Studies in 21^st Century.2023. Publisher Full Text
Al-Mannai K, Sajjad H, Khader A, et al.: Unsupervised word segmentation improves dialectal Arabic to English machine translation. Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). 2014; pp. 207–216.
de Almeida G , O’Brien S: Analysing post-editing performance: correlations with years of translation experience. Proceedings of the 14th Annual Conference of the European Association for Machine Translation, St. Raphaël, France. 2010; pp. 27–28.
Alshawsh H, Shormani MQ: (Un) translatability of Yemeni (Ibbi) Zawaamil and Ballads into English: Ibb University Students as a Case Study. Int. J. Linguist. Lit. Transl. 2025; 8(3): 188–205. Publisher Full Text
Alwagieh N, Shormani MQ: Translating Arabic free poetry texts into English by ChatGPT: Success and Failure. Int. J. Linguist. Lit. Transl. 2024; 7(9): 183–198. Publisher Full Text
Aransa W: Statistical Machine Translation of the Arabic Dialect. Ph.D. thesis. University of Maine, doctoral school STIM, 2015.
Arzu A, Issa T: An effect on cultural identity: Dialect. Procedia Soc. Behav. Sci. 2014; 143: 555–562. Publisher Full Text
Bamunusinghe K, Bamunusinghe S: The Importance of the Knowledge on Dialects for a Translator.2014. Reference Source
Bassnett S: Translation studies. Routledge; 2013.
Brown PF, Della Pietra SA, Della Pietra VJ, et al.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 1993; 19(2): 263–311.
Cao Y, Kementchedjhieva Y, Cui R, et al.: Cultural Adaptation of Recipes. Trans. Assoc. Comput. Linguist. 2024; 12: 80–99. Publisher Full Text
Castellani B: Automatic generation of morpheme level reordering rules for Korean to English machine translation. MA thesis, Seoul National University; 2017.
Çetin Ȍ, Duran A: A Comparative Analysis of the Performances of ChatGPT, DeepL, Google Translate and a Human Translator in Community Based Settings. Amasya Universitesi Sosyal Bilimler Dergisi. 2024; 9(15): 120–173.
Datta S: Investigating English-Language Dialect-Adjusted Models. Computer Science Senior Theses.2023; 11. Reference Source
Daulay E: The social meaning of language and dialect. VISION. 2017; 12(12).
Deilen S, Garrido S, Lapshinova-Koltunski E, et al.: Using ChatGPT as a CAT Tool in Easy Language Translation. Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability Associated with RANLP. 2023; pp. 1–10. Publisher Full Text
Ding L, Wu D, Tao D: Improving neural machine translation by bidirectional training. arXiv preprint arXiv: 2109.07780. 2021.
Durrani N, Al-Onaizan Y, Ittycheriah A: Improving Egyptian-to-English SMT by mapping Egyptian into MSA.International Conference on Intelligent Text Processing and Computational Linguistics.Berlin, Heidelberg: Springer Berlin Heidelberg; 2014; pp. 271–282. Publisher Full Text
Elkaffash SM: Corpus-Based Quality Evaluation of Ar-En Neural Machine Translation: Google Translate as a Case Study. Master’s thesis, Hamad Bin Khalifa University (Qatar); 2020.
Fairclough N: Analysing discourse. London: Routledge; 2003.
Federici FM: Translating Dialects and Languages of Minorities: Challenges and Solutions. Die Deutsche Nationalbibliothek; 2011.
Ferguson CA: Diglossia. Word. 1959; 15: 325–340. Publisher Full Text
Gill S, Kaur R: ChatGPT: Vision and Challenges. TCPS. Elsevier; 2023.
Groves D, Dag S: Identification and analysis of post-editing patterns for MT. Proceedings of the Twelfth Machine Translation Summit, August 26–30, Ottawa. 2009; 429–436.
Guo D, Zhu Q, Yang D, et al.: DeepSeek-Coder: When the Large Language Model Meets Programming- the Rise of Code Intelligence.2024. Reference Source
Hofmann H, Sakti S, Isotani R, et al.: Sequence-based pronunciation modeling using a noisy-channel approach. Spoken Dialogue Systems for Ambient Environments: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS 2010, Gotemba, Shizuoka, Japan, October 1-2, 2010. Proceedings. Berlin Heidelberg: Springer; 2010; pp. 156–162.
Hutchins WJ: Machine translation: past, present, future. Chichester: Ellis Horwood; 1986.
Jiang L, Jiang Y, Han L: The Potential of ChatGPT in Translation Evaluation: A Case Study of the Chinese-Portuguese Machine Translation. Casernos de Traduçao. 2024; 44: 1–22. Publisher Full Text
Jiao W, Wang W, Huang J, et al.: Is ChatGPT a good translator? Yes with GPT-4 as the engine.2023. Reference Source
Joshi S: A Comprehensive Review of DeepSeek: Performance, Architecture and Capabilities.2025. Publisher Full Text
Kadaoui K, Magdy SM, Waheed A, et al.: Tarjamat: Evaluation of bard and chatgpt on machine translation of ten arabic varieties. arXiv preprint arXiv:2308.03051. 2023.
Koehn P: Statistical Machine Translation. Cambridge: Cambridge University Press; 2009.
Koehn P, Och FJ, Marcu D: Statistical phrase-based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North Ameri can Chapter of the Association of Computational Linguistics (HLT-NAACL).2003. Reference Source
Kong JW: Translating Dialects and Languages of Minorities: Challenges and Solutions. Review. Babel. 2013; 59(1): 121–124. Federici, F. M. (ed.). (2011). Publisher Full Text
Krings H: Repairing texts: empirical investigations of machine translation post-editing processes. Kent, OH: The Kent State University Press; 2001.
Kumar Y, Gordon Z, Alabi O, et al.: ChatGPT Translation of Program Code for Image Sketch Abstraction. Appl. Sci. 2024; 14(3): 992. Publisher Full Text
Lee TK: Artificial intelligence and Posthumanist translation: ChatGPT vs the translator. Appl. Linguist. Rev. 2023; 15: 2351–2372. Publisher Full Text
Lilli S: ChatGPT-4 and Italian dialects: assessing linguistic competence. Umanistica Digit. 2023; 16: 235–263.
Macken L: Machine Translation Meets Large Language Models: Evaluating ChatGPT’s Ability to Aautomatically Post-Edit Literary Texts. Proceedings of the 1st Workshop on Creative-Text Translation and Technology. 2024; pp. 65–81.
Neubig G, Akita Y, Mori S, et al.: Improved statistical models for SMT-based speaking style transformation. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2010, March; pp. 5206–5209.
Newmark P: A textbook of translation. New York: Prentice Hall; 1988.
Peng K, Ding L, Zhong Q, et al.: Towards making the most of chatgpt for machine translation. arXiv preprint arXiv: 2303.13780. 2023.
Peng Y, Chen Q, Shih G: DeepSeek is Open-Access and the Next AI Disrupter for Radiology. Radiol. Adv. 2025; 2: 1. Publisher Full Text
Postigo M: ChatGPT and MT-Systems: Advantages and Limitations when Translating English to Spanish and Portuguese. Vol. 28. . Lengua Y Habla; 2024.
Puppel M, Borg C: Evaluating ChatGPT’s Performance in Creative Text Translation for Communication: A Case Study from English into German. Media and Intercultural Communication: A Multidisciplinary Journal. 2024; 3(1): 1–27. Publisher Full Text
Saito D, Watanabe S, Nakamura A, et al.: Statistical voice conversion based on noisy channel model. IEEE Trans. Audio Speech Lang. Process. 2012; 20(6): 1784–1794. Publisher Full Text
Salloum W, Habash N: Elissa: A dialectal to standard Arabic machine translation system. Proceedings of COLING 2012: Demonstration papers. 2012; pp. 385–392.
Sawaf H: Arabic dialect handling in hybrid machine translation. Pro-ceedings of the Conference of the Association for Machine Translation in the Americas (AMTA). Denver, Colorado; 2010.
Shormani MQ: An introduction to English syntax. A generative approach. LAP Lambert Academic Publishing; 2013.
Shormani MQ: Vocatives in Yemeni (ibbi) Arabic: Functions, types and approach. J. Semit. Stud. 2019; 64(1): 221–250. Publisher Full Text
Shormani MQ: Does culture translate? Evidence from translating proverbs. Babel, John Benjamins. 2020; 66(6): 902–927. Publisher Full Text
Shormani MQ: Can ChatGPT capture swearing nuances? Evidence from translating Arabic oaths.2024a. Reference Source
Shormani MQ: Linguistics contribution to artificial intelligence Where this contribution lies.2024b. Publisher Full Text
Shormani MQ: Introducing minimalism: A parametric variation. Lincom Europa Press; 2024c.
Shormani MQ: Non-native speakers of English or ChatGPT: Who thinks better? F1000Res. 2025a; 14. Publisher Full Text
Shormani MQ: AI translation and culture-based expressions. A lecture given at AlQalam University, held on 16/01/2025.2025b.
Shormani MQ, Al-samki AA: Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani Arabic terms as a case-in-point.2025.
Shormani MQ, Alfahd A: Artificial Intelligence or Human: The use of ChatGPT in the academic translation for religious texts (To appear in Sage Open).2025.
Shormani MQ, AlSohbani YA: Artificial intelligence contribution to translation industry: looking back and forward. Discov. Artif. Intell. 2025; 5: 389. Publisher Full Text
Shormani MQ, Watson JC, Dickins J: Poems from Ibb and Hadramawt. In Morano R, Watson J, Dickins, editors. Yemeni Poetry on the Frontline: love and conflict. Routledge; 2025; pp. 14–31.
Sindhuja R: Translation Theory and Practice.2021. Reference Source
Siu SC: ChatGPT and GPT-4 for professional translators: exploring the potential of large language models in translation. Preprint. 2023; 1–36. Reference Source
Vaswani A, Shazeer N, Parmar N, et al.: Attention is all you need. Adv. Neural Inf. Proces. Syst. 2017; 30.
Wang C, Kantarcioglu M: A Review of DeepSeek Models’ Key Innovative Techniques. arXiv:2503.11486v1. 2025.
Wu Y, Schuster M, Chen Z, et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. 2016.
Zan C, Peng K, Ding L, et al.: Vega-mt: The jd explore academy translation system for wmt22. arXiv preprint arXiv: 2209.09444. 2022
Zbib R, Malchiodi E, Devlin J, et al.: Machine translation of Arabic dialects. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012: 49–59.

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 16 Jul 2025

Author details Author details

¹ Department of English Studies, Ibb University, Ibb, Ibb Governorate, Yemen

Mohammed Q. Shormani
Roles: Conceptualization, Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Alia. Ali Al-Samki
Roles: Data Curation, Resources, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 16 Jan 2026, 14:694

https://doi.org/10.12688/f1000research.165879.2

version 1

Published: 16 Jul 2025, 14:694

https://doi.org/10.12688/f1000research.165879.1

© 2026 Shormani MQ and Al-Samki AA. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Shormani MQ and Al-Samki AA. Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani Arabic terms as a case-in-point [version 2; peer review: 1 approved, 1 approved with reservations, 1 not approved]. F1000Research 2026, 14:694 (https://doi.org/10.12688/f1000research.165879.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 16 Jan 2026

Revised

Views

Reviewer Report 06 Mar 2026

Hend S Al-Khalifa, King Saud University, Riyadh, Saudi Arabia

Not Approved

https://doi.org/10.5256/f1000research.194488.r460322

Here is the refined review rewritten in coherent paragraph form:
The introduction should clearly and explicitly state the research objectives and research questions at the beginning of the manuscript, rather than introducing them later in the paper. Presenting these elements upfront would help readers better understand the scope and motivation of the study from the outset.
Section 2 contains only one subsection (2.1), making this subdivision unnecessary in the absence of additional subsections. Furthermore, Section 2.1 is overly verbose and includes repeated information. Much of this content could be streamlined and more effectively presented using figures and tables. Similarly, Section 3 also contains only one subsection (3.1), which is unnecessary given the lack of further subdivisions. In addition, although this section is titled “Translating Dialects,” its content primarily focuses on translation between languages rather than dialectal translation. This mismatch reduces the clarity and coherence of the section. It would be more appropriate to include a dedicated section that specifically addresses interlanguage translation.
The exclusive use of ChatGPT and DeepSeek in the study is not sufficiently justified. Evaluating a broader range of large language models would strengthen the empirical foundation of the work and enhance its contribution. Moreover, the dataset used in the study is very limited in size and does not appear to support a substantial scientific contribution. The criteria for selecting the categories and lexical items are also not clearly explained, which further weakens the methodological rigor.
The methodological procedure lacks sufficient detail and transparency. In particular, the prompts used for generating translations are not described. It remains unclear whether zero-shot or few-shot prompting strategies were employed and whether the prompts were formulated in English or Arabic. In addition, the evaluation process is insufficiently documented. The number of human evaluators is not reported, and no information is provided regarding inter-annotator agreement, making it difficult to assess the reliability of the results.
The discussion section largely reiterates the findings already presented in the Results section, rather than providing deeper interpretation, critical analysis, or theoretical engagement. Similarly, the section on “Linguistic Subtleties” is superficial and would benefit from a more rigorous engagement with established linguistic theories and relevant scholarly literature.
The conclusion of the paper is largely generic and reflects observations that could apply to most studies on low-resource or dialectal Arabic. As a result, it does not sufficiently highlight the specific contributions or implications of the present study.
Overall, the manuscript is excessively verbose, with repeated details appearing across multiple sections and subsections that often stand alone. This structural imbalance negatively affects the coherence, readability, and flow of the paper. Greater concision and more careful organization would substantially improve the quality and impact of the manuscript.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Arabic NLP

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 20 Jan 2026

Liang Ding, The University of Sydney, Sydney, Australia

Approved

https://doi.org/10.5256/f1000research.194488.r450599

Thank the authors for ... Continue reading

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 16 Jul 2025

Views

Reviewer Report 25 Sep 2025

Liang Ding, The University of Sydney, Sydney, Australia

Approved with Reservations

https://doi.org/10.5256/f1000research.182657.r403515

This study evaluates the performance of two large language models, ChatGPT-4o and DeepSeek v3, on the task of translating 50 culturally specific dialectical terms from Yemeni Sana'ani Arabic (YSA) into English. The authors perform a qualitative and quantitative analysis, comparing the models' outputs against a human-provided translation. The core findings indicate that both models struggle significantly with the nuances of the YSA dialect, frequently defaulting to Standard Arabic (SA) interpretations or producing literal, contextually incorrect translations. The study concludes that ChatGPT performs marginally better than DeepSeek, but both are inadequate for reliable dialect translation, highlighting a critical need for dialect-aware data and model development.

Pros:
1) The translation of low-resource and non-standard dialects is a critical frontier for machine translation. This work addresses a significant gap by focusing on YSA, an understudied variety, providing valuable initial insights into the limitations of current state-of-the-art LLMs.
2) To my knowledge, this is one of the first academic papers to benchmark the DeepSeek model against ChatGPT on a dialectal translation task, making a timely contribution to the comparative analysis of emerging LLMs.
3) The paper excels in its qualitative discussion. The term-by-term breakdown in Table 2 and the subsequent discussion of cultural, linguistic, and contextual subtleties (Sections 5.2 and 6) provide clear, compelling evidence of the models' failures and the reasons behind them (e.g., mistaking شركة for 'company' instead of 'meat').

Cons:
The paper, while valuable for its topic, is constrained by significant experimental and contextual limitations that temper its conclusions.
1) The analysis relies solely on the authors' classification of translations as "correct," "incorrect," or "appropriate". This is highly subjective and lacks the rigor of standard MT evaluation. State-of-the-art research has moved beyond lexical-overlap metrics like BLEU to model-based metrics such as COMET, which better capture semantic fidelity. The absence of such metrics makes the quantitative claims (e.g., 34% correct for ChatGPT) difficult to verify and compare against other work.
2) The study's methodology does not reflect the current best practices for evaluating LLMs in translation tasks.

No Prompt Engineering: The paper fails to specify the prompts used, but the results suggest a simple, zero-shot directive (e.g., "Translate X"). This is a critical flaw. It is well-documented that LLM performance is highly sensitive to prompting strategies. The models were not given a fair chance to perform well.
Term-Level Translation: The evaluation is conducted on isolated terms. This is not representative of real-world usage and ignores the crucial role of context in disambiguation. Translating the terms within full sentences would have provided a more realistic and challenging test.

The recommendations for AI developers are generic (e.g., "expand the dialectal data coverage"). The paper would be substantially strengthened by engaging with more specific, high-impact research in machine translation to offer more sophisticated solutions.

To address these shortcomings, the authors should incorporate and cite the following highly relevant papers:

Peng, K. et al. (2023) [1]. Towards Making the Most of ChatGPT for Machine Translation: This paper is essential reading. It explicitly demonstrates that simple prompting limits ChatGPT's translation ability. The authors should have experimented with the
Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP) are proposed in this work to see if defining the task more clearly ("You are a machine translation system translating the Yemeni Sana'ani dialect") improves performance. Furthermore, this paper advocates for using
COMET as a primary metric, a practice this study should adopt.
Zan, C. et al. (2022) [2]. Vega-MT: The JD Explore Academy Translation System for WMT22: While focused on a traditional NMT system, this paper offers a blueprint for building high-quality multilingual models. Its concept of
"multi-directional pretraining"—using data from all language pairs to exploit common knowledge—is a concrete technical strategy that goes beyond the paper's generic recommendation to simply "add more data". The authors could propose a "multi-dialectal" pre-training approach inspired by this work as a specific path forward.
Ding, L. et al. (2021) [3].
Improving Neural Machine Translation by Bidirectional Training: This paper introduces Bidirectional Training (BiT), a simple yet effective strategy of pre-training a model on both src -> tgt and tgt -> src data simultaneously. This method was shown to improve performance, especially in low-resource settings. Suggesting a BiT-based fine-tuning approach (using both YSA-to-English and English-to-YSA data) would be a specific, testable hypothesis for improving the models' grasp of YSA's linguistic structure and improving bilingual alignment.

Ref:
[1] Peng et al. Towards Making the Most of ChatGPT for Machine Translation. In Findings of EMNLP 2023.
[2] Zan et al., Vega-MT: The JD Explore Academy Translation System for WMT22. In WMT 2022.
[3] Ding et al., Improving Neural Machine Translation by Bidirectional Training. In EMNLP 2021.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: natural language processing, machine learning, large language models, machine translation

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 02 Jan 2026

Mohammed Q. Shormani, English Studies, Ibb University, Ibb, Yemen

02 Jan 2026

Author Response
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find ... Continue reading
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point as follows.

This study evaluates the performance of two large language models, ChatGPT-4o and DeepSeek v3, on the task of translating 50 culturally specific dialectical terms from Yemeni Sana'ani Arabic (YSA) into English. The authors perform a qualitative and quantitative analysis, comparing the models' outputs against a human-provided translation. The core findings indicate that both models struggle significantly with the nuances of the YSA dialect, frequently defaulting to Standard Arabic (SA) interpretations or producing literal, contextually incorrect translations. The study concludes that ChatGPT performs marginally better than DeepSeek, but both are inadequate for reliable dialect translation, highlighting a critical need for dialect-aware data and model development.

Response
Thank you very much for your valuable remark.

1)The translation of low-resource and non-standard dialects is a critical frontier for machine translation. This work addresses a significant gap by focusing on YSA, an understudied variety, providing valuable initial insights into the limitations of current state-of-the-art LLMs.

Response
Thank you very much for your valuable remark.

2) To my knowledge, this is one of the first academic papers to benchmark the DeepSeek model against ChatGPT on a dialectal translation task, making a timely contribution to the comparative analysis of emerging LLMs.

Response
Thank you very much for your valuable remark.

3) The paper excels in its qualitative discussion. The term-by-term breakdown in Table 2 and the subsequent discussion of cultural, linguistic, and contextual subtleties (Sections 5.2 and 6) provide clear, compelling evidence of the models' failures and the reasons behind them (e.g., mistaking شركة for 'company' instead of 'meat').

Response
Thank you very much for your valuable remark.

The analysis relies solely on the authors' classification of translations as "correct," "incorrect," or "appropriate". This is highly subjective and lacks the rigor of standard MT evaluation. State-of-the-art research has moved beyond lexical-overlap metrics like BLEU to model-based metrics such as COMET, which better capture semantic fidelity. The absence of such metrics makes the quantitative claims (e.g., 34% correct for ChatGPT) difficult to verify and compare against other work.

Response
Thank you very much for your valuable comment. Please note that our use of the categories correct, incorrect, and appropriate was intentionally motivated by the nature of the task: translating Sana’ni Arabic, a dialect for which no standardized reference corpora or validated automatic evaluation benchmarks currently exist. As a result, commonly used model-based metrics such as COMET—which require high-quality reference translations and have been trained predominantly on Standard Arabic and high-resource language pairs—are not yet reliable indicators of translation quality for this dialect. I would like also to clarify that our quantitative results as indicative rather than absolute. I have also developed this part by incoporating the following text

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses."

2) The study's methodology does not reflect the current best practices for evaluating LLMs in translation tasks.

No Prompt Engineering: The paper fails to specify the prompts used, but the results suggest a simple, zero-shot directive (e.g., "Translate X"). This is a critical flaw. It is well-documented that LLM performance is highly sensitive to prompting strategies. The models were not given a fair chance to perform well.

Term-Level Translation: The evaluation is conducted on isolated terms. This is not representative of real-world usage and ignores the crucial role of context in disambiguation. Translating the terms within full sentences would have provided a more realistic and challenging test.

Response
Thank you very much for your valuable comment. I have addressed this aspect, and acknowledged it as one limitation of the study. Please section 7.

The recommendations for AI developers are generic (e.g., "expand the dialectal data coverage"). The paper would be substantially strengthened by engaging with more specific, high-impact research in machine translation to offer more sophisticated solutions.......

Response
Thank you very much for your valuable comment. I have addressed these outstanding points and referred to these references. Please see section 7

[1] Peng et al. Towards Making the Most of ChatGPT for Machine Translation. In Findings of EMNLP 2023.
[2] Zan et al., Vega-MT: The JD Explore Academy Translation System for WMT22. In WMT 2022.
[3] Ding et al., Improving Neural Machine Translation by Bidirectional Training. In EMNLP 2021.

Finally, thank you very much for your valuable comments and for the efforts you exerted to improving the article/
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point as follows.

This study evaluates the performance of two large language models, ChatGPT-4o and DeepSeek v3, on the task of translating 50 culturally specific dialectical terms from Yemeni Sana'ani Arabic (YSA) into English. The authors perform a qualitative and quantitative analysis, comparing the models' outputs against a human-provided translation. The core findings indicate that both models struggle significantly with the nuances of the YSA dialect, frequently defaulting to Standard Arabic (SA) interpretations or producing literal, contextually incorrect translations. The study concludes that ChatGPT performs marginally better than DeepSeek, but both are inadequate for reliable dialect translation, highlighting a critical need for dialect-aware data and model development.

Response
Thank you very much for your valuable remark.

1)The translation of low-resource and non-standard dialects is a critical frontier for machine translation. This work addresses a significant gap by focusing on YSA, an understudied variety, providing valuable initial insights into the limitations of current state-of-the-art LLMs.

Response
Thank you very much for your valuable remark.

2) To my knowledge, this is one of the first academic papers to benchmark the DeepSeek model against ChatGPT on a dialectal translation task, making a timely contribution to the comparative analysis of emerging LLMs.

Response
Thank you very much for your valuable remark.

3) The paper excels in its qualitative discussion. The term-by-term breakdown in Table 2 and the subsequent discussion of cultural, linguistic, and contextual subtleties (Sections 5.2 and 6) provide clear, compelling evidence of the models' failures and the reasons behind them (e.g., mistaking شركة for 'company' instead of 'meat').

Response
Thank you very much for your valuable remark.

The analysis relies solely on the authors' classification of translations as "correct," "incorrect," or "appropriate". This is highly subjective and lacks the rigor of standard MT evaluation. State-of-the-art research has moved beyond lexical-overlap metrics like BLEU to model-based metrics such as COMET, which better capture semantic fidelity. The absence of such metrics makes the quantitative claims (e.g., 34% correct for ChatGPT) difficult to verify and compare against other work.

Response
Thank you very much for your valuable comment. Please note that our use of the categories correct, incorrect, and appropriate was intentionally motivated by the nature of the task: translating Sana’ni Arabic, a dialect for which no standardized reference corpora or validated automatic evaluation benchmarks currently exist. As a result, commonly used model-based metrics such as COMET—which require high-quality reference translations and have been trained predominantly on Standard Arabic and high-resource language pairs—are not yet reliable indicators of translation quality for this dialect. I would like also to clarify that our quantitative results as indicative rather than absolute. I have also developed this part by incoporating the following text

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses."

2) The study's methodology does not reflect the current best practices for evaluating LLMs in translation tasks.

No Prompt Engineering: The paper fails to specify the prompts used, but the results suggest a simple, zero-shot directive (e.g., "Translate X"). This is a critical flaw. It is well-documented that LLM performance is highly sensitive to prompting strategies. The models were not given a fair chance to perform well.

Term-Level Translation: The evaluation is conducted on isolated terms. This is not representative of real-world usage and ignores the crucial role of context in disambiguation. Translating the terms within full sentences would have provided a more realistic and challenging test.

Response
Thank you very much for your valuable comment. I have addressed this aspect, and acknowledged it as one limitation of the study. Please section 7.

The recommendations for AI developers are generic (e.g., "expand the dialectal data coverage"). The paper would be substantially strengthened by engaging with more specific, high-impact research in machine translation to offer more sophisticated solutions.......

Response
Thank you very much for your valuable comment. I have addressed these outstanding points and referred to these references. Please see section 7

[1] Peng et al. Towards Making the Most of ChatGPT for Machine Translation. In Findings of EMNLP 2023.
[2] Zan et al., Vega-MT: The JD Explore Academy Translation System for WMT22. In WMT 2022.
[3] Ding et al., Improving Neural Machine Translation by Bidirectional Training. In EMNLP 2021.

Finally, thank you very much for your valuable comments and for the efforts you exerted to improving the article/
Competing Interests: No competing interest to disclose. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 02 Jan 2026

Mohammed Q. Shormani, English Studies, Ibb University, Ibb, Yemen

02 Jan 2026

Author Response
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find ... Continue reading
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point as follows.

This study evaluates the performance of two large language models, ChatGPT-4o and DeepSeek v3, on the task of translating 50 culturally specific dialectical terms from Yemeni Sana'ani Arabic (YSA) into English. The authors perform a qualitative and quantitative analysis, comparing the models' outputs against a human-provided translation. The core findings indicate that both models struggle significantly with the nuances of the YSA dialect, frequently defaulting to Standard Arabic (SA) interpretations or producing literal, contextually incorrect translations. The study concludes that ChatGPT performs marginally better than DeepSeek, but both are inadequate for reliable dialect translation, highlighting a critical need for dialect-aware data and model development.

Response
Thank you very much for your valuable remark.

1)The translation of low-resource and non-standard dialects is a critical frontier for machine translation. This work addresses a significant gap by focusing on YSA, an understudied variety, providing valuable initial insights into the limitations of current state-of-the-art LLMs.

Response
Thank you very much for your valuable remark.

2) To my knowledge, this is one of the first academic papers to benchmark the DeepSeek model against ChatGPT on a dialectal translation task, making a timely contribution to the comparative analysis of emerging LLMs.

Response
Thank you very much for your valuable remark.

3) The paper excels in its qualitative discussion. The term-by-term breakdown in Table 2 and the subsequent discussion of cultural, linguistic, and contextual subtleties (Sections 5.2 and 6) provide clear, compelling evidence of the models' failures and the reasons behind them (e.g., mistaking شركة for 'company' instead of 'meat').

Response
Thank you very much for your valuable remark.

The analysis relies solely on the authors' classification of translations as "correct," "incorrect," or "appropriate". This is highly subjective and lacks the rigor of standard MT evaluation. State-of-the-art research has moved beyond lexical-overlap metrics like BLEU to model-based metrics such as COMET, which better capture semantic fidelity. The absence of such metrics makes the quantitative claims (e.g., 34% correct for ChatGPT) difficult to verify and compare against other work.

Response
Thank you very much for your valuable comment. Please note that our use of the categories correct, incorrect, and appropriate was intentionally motivated by the nature of the task: translating Sana’ni Arabic, a dialect for which no standardized reference corpora or validated automatic evaluation benchmarks currently exist. As a result, commonly used model-based metrics such as COMET—which require high-quality reference translations and have been trained predominantly on Standard Arabic and high-resource language pairs—are not yet reliable indicators of translation quality for this dialect. I would like also to clarify that our quantitative results as indicative rather than absolute. I have also developed this part by incoporating the following text

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses."

2) The study's methodology does not reflect the current best practices for evaluating LLMs in translation tasks.

No Prompt Engineering: The paper fails to specify the prompts used, but the results suggest a simple, zero-shot directive (e.g., "Translate X"). This is a critical flaw. It is well-documented that LLM performance is highly sensitive to prompting strategies. The models were not given a fair chance to perform well.

Term-Level Translation: The evaluation is conducted on isolated terms. This is not representative of real-world usage and ignores the crucial role of context in disambiguation. Translating the terms within full sentences would have provided a more realistic and challenging test.

Response
Thank you very much for your valuable comment. I have addressed this aspect, and acknowledged it as one limitation of the study. Please section 7.

The recommendations for AI developers are generic (e.g., "expand the dialectal data coverage"). The paper would be substantially strengthened by engaging with more specific, high-impact research in machine translation to offer more sophisticated solutions.......

Response
Thank you very much for your valuable comment. I have addressed these outstanding points and referred to these references. Please see section 7

[1] Peng et al. Towards Making the Most of ChatGPT for Machine Translation. In Findings of EMNLP 2023.
[2] Zan et al., Vega-MT: The JD Explore Academy Translation System for WMT22. In WMT 2022.
[3] Ding et al., Improving Neural Machine Translation by Bidirectional Training. In EMNLP 2021.

Finally, thank you very much for your valuable comments and for the efforts you exerted to improving the article/
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point as follows.

This study evaluates the performance of two large language models, ChatGPT-4o and DeepSeek v3, on the task of translating 50 culturally specific dialectical terms from Yemeni Sana'ani Arabic (YSA) into English. The authors perform a qualitative and quantitative analysis, comparing the models' outputs against a human-provided translation. The core findings indicate that both models struggle significantly with the nuances of the YSA dialect, frequently defaulting to Standard Arabic (SA) interpretations or producing literal, contextually incorrect translations. The study concludes that ChatGPT performs marginally better than DeepSeek, but both are inadequate for reliable dialect translation, highlighting a critical need for dialect-aware data and model development.

Response
Thank you very much for your valuable remark.

1)The translation of low-resource and non-standard dialects is a critical frontier for machine translation. This work addresses a significant gap by focusing on YSA, an understudied variety, providing valuable initial insights into the limitations of current state-of-the-art LLMs.

Response
Thank you very much for your valuable remark.

2) To my knowledge, this is one of the first academic papers to benchmark the DeepSeek model against ChatGPT on a dialectal translation task, making a timely contribution to the comparative analysis of emerging LLMs.

Response
Thank you very much for your valuable remark.

3) The paper excels in its qualitative discussion. The term-by-term breakdown in Table 2 and the subsequent discussion of cultural, linguistic, and contextual subtleties (Sections 5.2 and 6) provide clear, compelling evidence of the models' failures and the reasons behind them (e.g., mistaking شركة for 'company' instead of 'meat').

Response
Thank you very much for your valuable remark.

The analysis relies solely on the authors' classification of translations as "correct," "incorrect," or "appropriate". This is highly subjective and lacks the rigor of standard MT evaluation. State-of-the-art research has moved beyond lexical-overlap metrics like BLEU to model-based metrics such as COMET, which better capture semantic fidelity. The absence of such metrics makes the quantitative claims (e.g., 34% correct for ChatGPT) difficult to verify and compare against other work.

Response
Thank you very much for your valuable comment. Please note that our use of the categories correct, incorrect, and appropriate was intentionally motivated by the nature of the task: translating Sana’ni Arabic, a dialect for which no standardized reference corpora or validated automatic evaluation benchmarks currently exist. As a result, commonly used model-based metrics such as COMET—which require high-quality reference translations and have been trained predominantly on Standard Arabic and high-resource language pairs—are not yet reliable indicators of translation quality for this dialect. I would like also to clarify that our quantitative results as indicative rather than absolute. I have also developed this part by incoporating the following text

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses."

2) The study's methodology does not reflect the current best practices for evaluating LLMs in translation tasks.

No Prompt Engineering: The paper fails to specify the prompts used, but the results suggest a simple, zero-shot directive (e.g., "Translate X"). This is a critical flaw. It is well-documented that LLM performance is highly sensitive to prompting strategies. The models were not given a fair chance to perform well.

Term-Level Translation: The evaluation is conducted on isolated terms. This is not representative of real-world usage and ignores the crucial role of context in disambiguation. Translating the terms within full sentences would have provided a more realistic and challenging test.

Response
Thank you very much for your valuable comment. I have addressed this aspect, and acknowledged it as one limitation of the study. Please section 7.

The recommendations for AI developers are generic (e.g., "expand the dialectal data coverage"). The paper would be substantially strengthened by engaging with more specific, high-impact research in machine translation to offer more sophisticated solutions.......

Response
Thank you very much for your valuable comment. I have addressed these outstanding points and referred to these references. Please see section 7

[1] Peng et al. Towards Making the Most of ChatGPT for Machine Translation. In Findings of EMNLP 2023.
[2] Zan et al., Vega-MT: The JD Explore Academy Translation System for WMT22. In WMT 2022.
[3] Ding et al., Improving Neural Machine Translation by Bidirectional Training. In EMNLP 2021.

Finally, thank you very much for your valuable comments and for the efforts you exerted to improving the article/
Competing Interests: No competing interest to disclose. Close
Report a concern

Views

Reviewer Report 24 Sep 2025

Cao-Tuong DINH, FPT University, Can Tho city, Vietnam

Approved with Reservations

https://doi.org/10.5256/f1000research.182657.r409822

Recommendation: Major revisions Required

For author and editor
Thank you for the opportunity to review this paper. I appreciate the effort that went into this research and the attempt to apply computational methods to analyze diplomatic discourse. The topic is timely and valuable for MT/LLM evaluation beyond standard Arabic. The manuscript is generally clear, and the dataset is shared. However, methodological transparency and rigor need strengthening (sampling, prompting, validation, and statistics). A few internal inconsistencies and copy-editing issues also require attention. With moderate revisions, the paper will provide a useful exploratory baseline, hence Approved with major revision.
However, as a researcher, I have some concerns as below:
1) Clarity & literature context
Mostly clear, with a good high-level MT background. Literature is broadly cited, but several claims (e.g., model capabilities, architecture details, and comparative statements) would benefit from primary/technical citations and tighter focus on dialect MT work. In particular, these points need improvement:

Tighten the background to emphasize prior work on Arabic dialect MT and dialectal evaluation, and separate general MT history from directly relevant work.
Add citations for specific DeepSeek details you reference (Mixture-of-Experts, training corpora) and for Arabic dialect evaluation benchmarks/tools where applicable.
Copy-edit for typos/wording (examples in Minor comments).

2) Study design & academic merit:
The exploratory design (50 terms) is a reasonable pilot, but sampling and gold-standard construction need more rigor to support conclusions. However, these points need modification:

Sampling: Clarify how the 50 terms were selected (criteria, sources, representativeness across categories; frequency in real usage). Consider expanding to include contextualized uses (sentences) alongside isolated terms to reduce ambiguity.
Gold standard: Specify the annotation protocol (number of native speakers, expertise, independence, adjudication process). Report inter-annotator agreement (e.g., Cohen’s κ) if multiple raters were used or describe how disagreements were resolved.
Task framing: Define what counts as “correct,” “appropriate,” and “incorrect” a priori, with examples. Consider an error taxonomy (literal, SA-bias, cultural connotation loss, POS confusion, etc.).

3) Methods detail & reproducibility: Important details are currently missing for full replication. In particular, these aspects need to be clarified:
Model versions & dates: Report exact model versions/variants, query timestamps (LLMs change over time), and any API/app settings.
Prompts: Publish the exact prompts, instructions (e.g., “translate to English; provide multiple senses?”), temperature/decoding parameters, number of attempts/retries, and whether any post-processing was applied.
Text normalization: Specify Unicode normalization, diacritic handling, tokenization, and whether Arabic script was normalized before translation.
4) Statistics & interpretation: Counts and percentages are reported, but statistical comparison is minimal. Below are suggestions for improvement:

Add 95% CIs for accuracy/“correct” rates by category and overall.
For paired categorical outcomes (ChatGPT vs DeepSeek on the same items), apply a McNemar test (or exact variant) to test whether differences are statistically significant.
Report per-category performance with CIs and consider a simple mixed-effects model (random intercept for term) to account for item variability.
If you keep “appropriate” as a middle category, consider ordinal models or report separate binary analyses (correct vs not; correct+appropriate vs incorrect).

5) Conclusions vs results: Broadly aligned, but some statements verge on over-generalization given sample size and item selection. Below are suggestions for improvement:

Temper general claims about “SA bias” and cross-dialect performance; frame as evidence from this 50-term sample and invite replication on larger, balanced sets.
Highlight that performance may change with contextual sentences and few-shot prompting; consider adding a small follow-up experiment (even in supplementary) to show sensitivity to prompt design.

6) Minor comments (clarity, style, presentation)

Terminology consistency: Use “YSA” consistently (a few instances seem to vary).
Typos & phrasing (examples): “concisderable” → considerable; “corp” → crops (re: silo); “wording”/“orthograpgical” → orthographical; ensure consistent capitalization of model names and sections.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: self-regulated learning, EMI, technology-based teaching & learning in higher education

CITE

Report a concern

Author Response 02 Jan 2026

Mohammed Q. Shormani, English Studies, Ibb University, Ibb, Yemen

02 Jan 2026

Author Response
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find ... Continue reading
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point below.

Thank you for the opportunity to review this paper. I appreciate the effort that went into this research and the attempt to apply computational methods to analyze diplomatic discourse. The topic is timely and valuable for MT/LLM evaluation beyond standard Arabic. The manuscript is generally clear, and the dataset is shared.

Response
Thank you very much for your valuable remark.

Mostly clear, with a good high-level MT background. Literature is broadly cited, but several claims (e.g., model capabilities, architecture details, and comparative statements) would benefit from primary/technical citations and tighter focus on dialect MT work. In particular, these points need improvement:

Tighten the background to emphasize prior work on Arabic dialect MT and dialectal evaluation, and separate general MT history from directly relevant work.

Add citations for specific DeepSeek details you reference (Mixture-of-Experts, training corpora) and for Arabic dialect evaluation benchmarks/tools where applicable.

Copy-edit for typos/wording (examples in Minor comments)

Response
Thank you very much for your valuable comment. I have addressed these aspects, adding a subsection dubbed as "3.1. Translating Arabic dialects", and added several references as recommended by you and the other reviewer.

The exploratory design (50 terms) is a reasonable pilot, but sampling and gold-standard construction need more rigor to support conclusions. However, these points need modification:

Sampling: Clarify how the 50 terms were selected (criteria, sources, representativeness across categories; frequency in real usage). Consider expanding to include contextualized uses (sentences) alongside isolated terms to reduce ambiguity.

Gold standard: Specify the annotation protocol (number of native speakers, expertise, independence, adjudication process). Report inter-annotator agreement (e.g., Cohen’s κ) if multiple raters were used or describe how disagreements were resolved.

Task framing: Define what counts as “correct,” “appropriate,” and “incorrect” a priori, with examples. Consider an error taxonomy (literal, SA-bias, cultural connotation loss, POS confusion, etc.).

Response
Thank you very much for your valuable comment. I have addressed these aspects, detailing the criteria adopted, defining "what counts as “correct,” “appropriate,” and “incorrect”, and other related aspects. For example, I have added the following text describing the criteria

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses. "

As for what counts as “correct,” “appropriate,” and “incorrect”, I have the following text

"We assessed the translation (in)correctness and appropriateness of dialectical terms between the two AI models, ChatGPT and DeepSeek, and human translation. We considered the translation of an item correct if cultural and linguistic aspects are both maintained in the translation of this term. However, if one of these aspects is violated in the translation we consider it appropriate, and if both the cultural and linguistic aspects are violate in the translation, we consider it incorrect. "

Important details are currently missing for full replication. In particular, these aspects need to be clarified: Model versions & dates: Report exact model versions/variants, query timestamps (LLMs change over time), and any API/app settings. Prompts: Publish the exact prompts, instructions (e.g., “translate to English; provide multiple senses?”), temperature/decoding parameters, number of attempts/retries, and whether any post-processing was applied. Text normalization: Specify Unicode normalization, diacritic handling, tokenization, and whether Arabic script was normalized before translation.

Response
Thank you very much for your valuable comment. I have addressed these aspects, but for the word limit I couldn't cover them all, which need a full-fledged new paper.

Temper general claims about “SA bias” and cross-dialect performance; frame as evidence from this 50-term sample and invite replication on larger, balanced sets.

Highlight that performance may change with contextual sentences and few-shot prompting; consider adding a small follow-up experiment (even in supplementary) to show sensitivity to prompt design.

Response
Thank you very much for your valuable comment. I have addressed these aspects, please see section 7.

Terminology consistency: Use “YSA” consistently (a few instances seem to vary).

Typos & phrasing (examples): “concisderable” → considerable; “corp” → crops (re: silo); “wording”/“orthograpgical” → orthographical; ensure consistent capitalization of model names and sections.

Response
Thank you very much for your valuable comment. I have addressed these aspects, revising the paper carefully for these issues.

Finally, thank you very much once again for your valuable comments.
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point below.

Thank you for the opportunity to review this paper. I appreciate the effort that went into this research and the attempt to apply computational methods to analyze diplomatic discourse. The topic is timely and valuable for MT/LLM evaluation beyond standard Arabic. The manuscript is generally clear, and the dataset is shared.

Response
Thank you very much for your valuable remark.

Mostly clear, with a good high-level MT background. Literature is broadly cited, but several claims (e.g., model capabilities, architecture details, and comparative statements) would benefit from primary/technical citations and tighter focus on dialect MT work. In particular, these points need improvement:

Tighten the background to emphasize prior work on Arabic dialect MT and dialectal evaluation, and separate general MT history from directly relevant work.

Add citations for specific DeepSeek details you reference (Mixture-of-Experts, training corpora) and for Arabic dialect evaluation benchmarks/tools where applicable.

Copy-edit for typos/wording (examples in Minor comments)

Response
Thank you very much for your valuable comment. I have addressed these aspects, adding a subsection dubbed as "3.1. Translating Arabic dialects", and added several references as recommended by you and the other reviewer.

The exploratory design (50 terms) is a reasonable pilot, but sampling and gold-standard construction need more rigor to support conclusions. However, these points need modification:

Sampling: Clarify how the 50 terms were selected (criteria, sources, representativeness across categories; frequency in real usage). Consider expanding to include contextualized uses (sentences) alongside isolated terms to reduce ambiguity.

Gold standard: Specify the annotation protocol (number of native speakers, expertise, independence, adjudication process). Report inter-annotator agreement (e.g., Cohen’s κ) if multiple raters were used or describe how disagreements were resolved.

Task framing: Define what counts as “correct,” “appropriate,” and “incorrect” a priori, with examples. Consider an error taxonomy (literal, SA-bias, cultural connotation loss, POS confusion, etc.).

Response
Thank you very much for your valuable comment. I have addressed these aspects, detailing the criteria adopted, defining "what counts as “correct,” “appropriate,” and “incorrect”, and other related aspects. For example, I have added the following text describing the criteria

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses. "

As for what counts as “correct,” “appropriate,” and “incorrect”, I have the following text

"We assessed the translation (in)correctness and appropriateness of dialectical terms between the two AI models, ChatGPT and DeepSeek, and human translation. We considered the translation of an item correct if cultural and linguistic aspects are both maintained in the translation of this term. However, if one of these aspects is violated in the translation we consider it appropriate, and if both the cultural and linguistic aspects are violate in the translation, we consider it incorrect. "

Important details are currently missing for full replication. In particular, these aspects need to be clarified: Model versions & dates: Report exact model versions/variants, query timestamps (LLMs change over time), and any API/app settings. Prompts: Publish the exact prompts, instructions (e.g., “translate to English; provide multiple senses?”), temperature/decoding parameters, number of attempts/retries, and whether any post-processing was applied. Text normalization: Specify Unicode normalization, diacritic handling, tokenization, and whether Arabic script was normalized before translation.

Response
Thank you very much for your valuable comment. I have addressed these aspects, but for the word limit I couldn't cover them all, which need a full-fledged new paper.

Temper general claims about “SA bias” and cross-dialect performance; frame as evidence from this 50-term sample and invite replication on larger, balanced sets.

Highlight that performance may change with contextual sentences and few-shot prompting; consider adding a small follow-up experiment (even in supplementary) to show sensitivity to prompt design.

Response
Thank you very much for your valuable comment. I have addressed these aspects, please see section 7.

Terminology consistency: Use “YSA” consistently (a few instances seem to vary).

Typos & phrasing (examples): “concisderable” → considerable; “corp” → crops (re: silo); “wording”/“orthograpgical” → orthographical; ensure consistent capitalization of model names and sections.

Response
Thank you very much for your valuable comment. I have addressed these aspects, revising the paper carefully for these issues.

Finally, thank you very much once again for your valuable comments.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 02 Jan 2026

Mohammed Q. Shormani, English Studies, Ibb University, Ibb, Yemen

02 Jan 2026

Author Response
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find ... Continue reading
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point below.

Thank you for the opportunity to review this paper. I appreciate the effort that went into this research and the attempt to apply computational methods to analyze diplomatic discourse. The topic is timely and valuable for MT/LLM evaluation beyond standard Arabic. The manuscript is generally clear, and the dataset is shared.

Response
Thank you very much for your valuable remark.

Mostly clear, with a good high-level MT background. Literature is broadly cited, but several claims (e.g., model capabilities, architecture details, and comparative statements) would benefit from primary/technical citations and tighter focus on dialect MT work. In particular, these points need improvement:

Tighten the background to emphasize prior work on Arabic dialect MT and dialectal evaluation, and separate general MT history from directly relevant work.

Add citations for specific DeepSeek details you reference (Mixture-of-Experts, training corpora) and for Arabic dialect evaluation benchmarks/tools where applicable.

Copy-edit for typos/wording (examples in Minor comments)

Response
Thank you very much for your valuable comment. I have addressed these aspects, adding a subsection dubbed as "3.1. Translating Arabic dialects", and added several references as recommended by you and the other reviewer.

The exploratory design (50 terms) is a reasonable pilot, but sampling and gold-standard construction need more rigor to support conclusions. However, these points need modification:

Sampling: Clarify how the 50 terms were selected (criteria, sources, representativeness across categories; frequency in real usage). Consider expanding to include contextualized uses (sentences) alongside isolated terms to reduce ambiguity.

Gold standard: Specify the annotation protocol (number of native speakers, expertise, independence, adjudication process). Report inter-annotator agreement (e.g., Cohen’s κ) if multiple raters were used or describe how disagreements were resolved.

Task framing: Define what counts as “correct,” “appropriate,” and “incorrect” a priori, with examples. Consider an error taxonomy (literal, SA-bias, cultural connotation loss, POS confusion, etc.).

Response
Thank you very much for your valuable comment. I have addressed these aspects, detailing the criteria adopted, defining "what counts as “correct,” “appropriate,” and “incorrect”, and other related aspects. For example, I have added the following text describing the criteria

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses. "

As for what counts as “correct,” “appropriate,” and “incorrect”, I have the following text

"We assessed the translation (in)correctness and appropriateness of dialectical terms between the two AI models, ChatGPT and DeepSeek, and human translation. We considered the translation of an item correct if cultural and linguistic aspects are both maintained in the translation of this term. However, if one of these aspects is violated in the translation we consider it appropriate, and if both the cultural and linguistic aspects are violate in the translation, we consider it incorrect. "

Important details are currently missing for full replication. In particular, these aspects need to be clarified: Model versions & dates: Report exact model versions/variants, query timestamps (LLMs change over time), and any API/app settings. Prompts: Publish the exact prompts, instructions (e.g., “translate to English; provide multiple senses?”), temperature/decoding parameters, number of attempts/retries, and whether any post-processing was applied. Text normalization: Specify Unicode normalization, diacritic handling, tokenization, and whether Arabic script was normalized before translation.

Response
Thank you very much for your valuable comment. I have addressed these aspects, but for the word limit I couldn't cover them all, which need a full-fledged new paper.

Temper general claims about “SA bias” and cross-dialect performance; frame as evidence from this 50-term sample and invite replication on larger, balanced sets.

Highlight that performance may change with contextual sentences and few-shot prompting; consider adding a small follow-up experiment (even in supplementary) to show sensitivity to prompt design.

Response
Thank you very much for your valuable comment. I have addressed these aspects, please see section 7.

Terminology consistency: Use “YSA” consistently (a few instances seem to vary).

Typos & phrasing (examples): “concisderable” → considerable; “corp” → crops (re: silo); “wording”/“orthograpgical” → orthographical; ensure consistent capitalization of model names and sections.

Response
Thank you very much for your valuable comment. I have addressed these aspects, revising the paper carefully for these issues.

Finally, thank you very much once again for your valuable comments.
Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point below.

Thank you for the opportunity to review this paper. I appreciate the effort that went into this research and the attempt to apply computational methods to analyze diplomatic discourse. The topic is timely and valuable for MT/LLM evaluation beyond standard Arabic. The manuscript is generally clear, and the dataset is shared.

Response
Thank you very much for your valuable remark.

Mostly clear, with a good high-level MT background. Literature is broadly cited, but several claims (e.g., model capabilities, architecture details, and comparative statements) would benefit from primary/technical citations and tighter focus on dialect MT work. In particular, these points need improvement:

Tighten the background to emphasize prior work on Arabic dialect MT and dialectal evaluation, and separate general MT history from directly relevant work.

Add citations for specific DeepSeek details you reference (Mixture-of-Experts, training corpora) and for Arabic dialect evaluation benchmarks/tools where applicable.

Copy-edit for typos/wording (examples in Minor comments)

Response
Thank you very much for your valuable comment. I have addressed these aspects, adding a subsection dubbed as "3.1. Translating Arabic dialects", and added several references as recommended by you and the other reviewer.

The exploratory design (50 terms) is a reasonable pilot, but sampling and gold-standard construction need more rigor to support conclusions. However, these points need modification:

Sampling: Clarify how the 50 terms were selected (criteria, sources, representativeness across categories; frequency in real usage). Consider expanding to include contextualized uses (sentences) alongside isolated terms to reduce ambiguity.

Gold standard: Specify the annotation protocol (number of native speakers, expertise, independence, adjudication process). Report inter-annotator agreement (e.g., Cohen’s κ) if multiple raters were used or describe how disagreements were resolved.

Task framing: Define what counts as “correct,” “appropriate,” and “incorrect” a priori, with examples. Consider an error taxonomy (literal, SA-bias, cultural connotation loss, POS confusion, etc.).

Response
Thank you very much for your valuable comment. I have addressed these aspects, detailing the criteria adopted, defining "what counts as “correct,” “appropriate,” and “incorrect”, and other related aspects. For example, I have added the following text describing the criteria

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses. "

As for what counts as “correct,” “appropriate,” and “incorrect”, I have the following text

"We assessed the translation (in)correctness and appropriateness of dialectical terms between the two AI models, ChatGPT and DeepSeek, and human translation. We considered the translation of an item correct if cultural and linguistic aspects are both maintained in the translation of this term. However, if one of these aspects is violated in the translation we consider it appropriate, and if both the cultural and linguistic aspects are violate in the translation, we consider it incorrect. "

Important details are currently missing for full replication. In particular, these aspects need to be clarified: Model versions & dates: Report exact model versions/variants, query timestamps (LLMs change over time), and any API/app settings. Prompts: Publish the exact prompts, instructions (e.g., “translate to English; provide multiple senses?”), temperature/decoding parameters, number of attempts/retries, and whether any post-processing was applied. Text normalization: Specify Unicode normalization, diacritic handling, tokenization, and whether Arabic script was normalized before translation.

Response
Thank you very much for your valuable comment. I have addressed these aspects, but for the word limit I couldn't cover them all, which need a full-fledged new paper.

Temper general claims about “SA bias” and cross-dialect performance; frame as evidence from this 50-term sample and invite replication on larger, balanced sets.

Highlight that performance may change with contextual sentences and few-shot prompting; consider adding a small follow-up experiment (even in supplementary) to show sensitivity to prompt design.

Response
Thank you very much for your valuable comment. I have addressed these aspects, please see section 7.

Terminology consistency: Use “YSA” consistently (a few instances seem to vary).

Typos & phrasing (examples): “concisderable” → considerable; “corp” → crops (re: silo); “wording”/“orthograpgical” → orthographical; ensure consistent capitalization of model names and sections.

Response
Thank you very much for your valuable comment. I have addressed these aspects, revising the paper carefully for these issues.

Finally, thank you very much once again for your valuable comments.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 16 Jul 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 16 Jan 26		read	read
Version 1 16 Jul 25	read	read

Cao-Tuong DINH, FPT University, Can Tho city, Vietnam
Liang Ding, The University of Sydney, Sydney, Australia
Hend S Al-Khalifa, King Saud University, Riyadh, Saudi Arabia

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

6 Views

06 Mar 2026 | for Version 2

Hend S Al-Khalifa, King Saud University, Riyadh, Saudi Arabia

6 Views Cite this report Responses(0)

Not Approved

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Arabic NLP

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

21 Views

20 Jan 2026 | for Version 2

Liang Ding, The University of Sydney, Sydney, Australia

21 Views Cite this report Responses(0)

Approved

Thank the authors for addressing most of my concerns.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

natural language processing, machine learning, large language models, machine translation

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

11 Views

25 Sep 2025 | for Version 1

Liang Ding, The University of Sydney, Sydney, Australia

11 Views Cite this report Responses(1)

Approved With Reservations

No Prompt Engineering: The paper fails to specify the prompts used, but the results suggest a simple, zero-shot directive (e.g., "Translate X"). This is a critical flaw. It is well-documented that LLM performance is highly sensitive to prompting strategies. The models were not given a fair chance to perform well.
Term-Level Translation: The evaluation is conducted on isolated terms. This is not representative of real-world usage and ignores the crucial role of context in disambiguation. Translating the terms within full sentences would have provided a more realistic and challenging test.

The recommendations for AI developers are generic (e.g., "expand the dialectal data coverage"). The paper would be substantially strengthened by engaging with more specific, high-impact research in machine translation to offer more sophisticated solutions.

To address these shortcomings, the authors should incorporate and cite the following highly relevant papers:

Peng, K. et al. (2023) [1]. Towards Making the Most of ChatGPT for Machine Translation: This paper is essential reading. It explicitly demonstrates that simple prompting limits ChatGPT's translation ability. The authors should have experimented with the
Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP) are proposed in this work to see if defining the task more clearly ("You are a machine translation system translating the Yemeni Sana'ani dialect") improves performance. Furthermore, this paper advocates for using
COMET as a primary metric, a practice this study should adopt.
Zan, C. et al. (2022) [2]. Vega-MT: The JD Explore Academy Translation System for WMT22: While focused on a traditional NMT system, this paper offers a blueprint for building high-quality multilingual models. Its concept of
"multi-directional pretraining"—using data from all language pairs to exploit common knowledge—is a concrete technical strategy that goes beyond the paper's generic recommendation to simply "add more data". The authors could propose a "multi-dialectal" pre-training approach inspired by this work as a specific path forward.
Ding, L. et al. (2021) [3].
Improving Neural Machine Translation by Bidirectional Training: This paper introduces Bidirectional Training (BiT), a simple yet effective strategy of pre-training a model on both src -> tgt and tgt -> src data simultaneously. This method was shown to improve performance, especially in low-resource settings. Suggesting a BiT-based fine-tuning approach (using both YSA-to-English and English-to-YSA data) would be a specific, testable hypothesis for improving the models' grasp of YSA's linguistic structure and improving bilingual alignment.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

natural language processing, machine learning, large language models, machine translation

Respond to this report

Responses (1)

Author Response

02 Jan 2026

Mohammed Q. Shormani, English Studies, Ibb University, Ibb, Yemen

Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point as follows.

This study evaluates the performance of two large language models, ChatGPT-4o and DeepSeek v3, on the task of translating 50 culturally specific dialectical terms from Yemeni Sana'ani Arabic (YSA) into English. The authors perform a qualitative and quantitative analysis, comparing the models' outputs against a human-provided translation. The core findings indicate that both models struggle significantly with the nuances of the YSA dialect, frequently defaulting to Standard Arabic (SA) interpretations or producing literal, contextually incorrect translations. The study concludes that ChatGPT performs marginally better than DeepSeek, but both are inadequate for reliable dialect translation, highlighting a critical need for dialect-aware data and model development.

Response
Thank you very much for your valuable remark.

1)The translation of low-resource and non-standard dialects is a critical frontier for machine translation. This work addresses a significant gap by focusing on YSA, an understudied variety, providing valuable initial insights into the limitations of current state-of-the-art LLMs.

Response
Thank you very much for your valuable remark.

2) To my knowledge, this is one of the first academic papers to benchmark the DeepSeek model against ChatGPT on a dialectal translation task, making a timely contribution to the comparative analysis of emerging LLMs.

Response
Thank you very much for your valuable remark.

3) The paper excels in its qualitative discussion. The term-by-term breakdown in Table 2 and the subsequent discussion of cultural, linguistic, and contextual subtleties (Sections 5.2 and 6) provide clear, compelling evidence of the models' failures and the reasons behind them (e.g., mistaking شركة for 'company' instead of 'meat').

Response
Thank you very much for your valuable remark.

The analysis relies solely on the authors' classification of translations as "correct," "incorrect," or "appropriate". This is highly subjective and lacks the rigor of standard MT evaluation. State-of-the-art research has moved beyond lexical-overlap metrics like BLEU to model-based metrics such as COMET, which better capture semantic fidelity. The absence of such metrics makes the quantitative claims (e.g., 34% correct for ChatGPT) difficult to verify and compare against other work.

Response
Thank you very much for your valuable comment. Please note that our use of the categories correct, incorrect, and appropriate was intentionally motivated by the nature of the task: translating Sana’ni Arabic, a dialect for which no standardized reference corpora or validated automatic evaluation benchmarks currently exist. As a result, commonly used model-based metrics such as COMET—which require high-quality reference translations and have been trained predominantly on Standard Arabic and high-resource language pairs—are not yet reliable indicators of translation quality for this dialect. I would like also to clarify that our quantitative results as indicative rather than absolute. I have also developed this part by incoporating the following text

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses."

2) The study's methodology does not reflect the current best practices for evaluating LLMs in translation tasks.

No Prompt Engineering: The paper fails to specify the prompts used, but the results suggest a simple, zero-shot directive (e.g., "Translate X"). This is a critical flaw. It is well-documented that LLM performance is highly sensitive to prompting strategies. The models were not given a fair chance to perform well.
Term-Level Translation: The evaluation is conducted on isolated terms. This is not representative of real-world usage and ignores the crucial role of context in disambiguation. Translating the terms within full sentences would have provided a more realistic and challenging test.

Response
Thank you very much for your valuable comment. I have addressed this aspect, and acknowledged it as one limitation of the study. Please section 7.

The recommendations for AI developers are generic (e.g., "expand the dialectal data coverage"). The paper would be substantially strengthened by engaging with more specific, high-impact research in machine translation to offer more sophisticated solutions.......

Response
Thank you very much for your valuable comment. I have addressed these outstanding points and referred to these references. Please see section 7

[1] Peng et al. Towards Making the Most of ChatGPT for Machine Translation. In Findings of EMNLP 2023.
[2] Zan et al., Vega-MT: The JD Explore Academy Translation System for WMT22. In WMT 2022.
[3] Ding et al., Improving Neural Machine Translation by Bidirectional Training. In EMNLP 2021.

Finally, thank you very much for your valuable comments and for the efforts you exerted to improving the article/

View more View less

Competing Interests

No competing interest to disclose.

Back to all reports

Reviewer Report

13 Views

24 Sep 2025 | for Version 1

Cao-Tuong DINH, FPT University, Can Tho city, Vietnam

13 Views Cite this report Responses(1)

Approved With Reservations

Tighten the background to emphasize prior work on Arabic dialect MT and dialectal evaluation, and separate general MT history from directly relevant work.
Add citations for specific DeepSeek details you reference (Mixture-of-Experts, training corpora) and for Arabic dialect evaluation benchmarks/tools where applicable.
Copy-edit for typos/wording (examples in Minor comments).

Sampling: Clarify how the 50 terms were selected (criteria, sources, representativeness across categories; frequency in real usage). Consider expanding to include contextualized uses (sentences) alongside isolated terms to reduce ambiguity.
Gold standard: Specify the annotation protocol (number of native speakers, expertise, independence, adjudication process). Report inter-annotator agreement (e.g., Cohen’s κ) if multiple raters were used or describe how disagreements were resolved.
Task framing: Define what counts as “correct,” “appropriate,” and “incorrect” a priori, with examples. Consider an error taxonomy (literal, SA-bias, cultural connotation loss, POS confusion, etc.).

Add 95% CIs for accuracy/“correct” rates by category and overall.
For paired categorical outcomes (ChatGPT vs DeepSeek on the same items), apply a McNemar test (or exact variant) to test whether differences are statistically significant.
Report per-category performance with CIs and consider a simple mixed-effects model (random intercept for term) to account for item variability.
If you keep “appropriate” as a middle category, consider ordinal models or report separate binary analyses (correct vs not; correct+appropriate vs incorrect).

5) Conclusions vs results: Broadly aligned, but some statements verge on over-generalization given sample size and item selection. Below are suggestions for improvement:

Temper general claims about “SA bias” and cross-dialect performance; frame as evidence from this 50-term sample and invite replication on larger, balanced sets.
Highlight that performance may change with contextual sentences and few-shot prompting; consider adding a small follow-up experiment (even in supplementary) to show sensitivity to prompt design.

6) Minor comments (clarity, style, presentation)

Terminology consistency: Use “YSA” consistently (a few instances seem to vary).
Typos & phrasing (examples): “concisderable” → considerable; “corp” → crops (re: silo); “wording”/“orthograpgical” → orthographical; ensure consistent capitalization of model names and sections.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

self-regulated learning, EMI, technology-based teaching & learning in higher education

Respond to this report

Responses (1)

Author Response

02 Jan 2026

Mohammed Q. Shormani, English Studies, Ibb University, Ibb, Yemen

Dear reviewer,

First, let me express my sincere appreciation and thanks for the insightful comments you have provided, which have improved the paper in form and content. Please find our responses to your comments point-by-point below.

Thank you for the opportunity to review this paper. I appreciate the effort that went into this research and the attempt to apply computational methods to analyze diplomatic discourse. The topic is timely and valuable for MT/LLM evaluation beyond standard Arabic. The manuscript is generally clear, and the dataset is shared.

Response
Thank you very much for your valuable remark.

Mostly clear, with a good high-level MT background. Literature is broadly cited, but several claims (e.g., model capabilities, architecture details, and comparative statements) would benefit from primary/technical citations and tighter focus on dialect MT work. In particular, these points need improvement:

Tighten the background to emphasize prior work on Arabic dialect MT and dialectal evaluation, and separate general MT history from directly relevant work.
Add citations for specific DeepSeek details you reference (Mixture-of-Experts, training corpora) and for Arabic dialect evaluation benchmarks/tools where applicable.
Copy-edit for typos/wording (examples in Minor comments)

Response
Thank you very much for your valuable comment. I have addressed these aspects, adding a subsection dubbed as "3.1. Translating Arabic dialects", and added several references as recommended by you and the other reviewer.

The exploratory design (50 terms) is a reasonable pilot, but sampling and gold-standard construction need more rigor to support conclusions. However, these points need modification:

Sampling: Clarify how the 50 terms were selected (criteria, sources, representativeness across categories; frequency in real usage). Consider expanding to include contextualized uses (sentences) alongside isolated terms to reduce ambiguity.
Gold standard: Specify the annotation protocol (number of native speakers, expertise, independence, adjudication process). Report inter-annotator agreement (e.g., Cohen’s κ) if multiple raters were used or describe how disagreements were resolved.
Task framing: Define what counts as “correct,” “appropriate,” and “incorrect” a priori, with examples. Consider an error taxonomy (literal, SA-bias, cultural connotation loss, POS confusion, etc.).

Response
Thank you very much for your valuable comment. I have addressed these aspects, detailing the criteria adopted, defining "what counts as “correct,” “appropriate,” and “incorrect”, and other related aspects. For example, I have added the following text describing the criteria

"We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ni Arabic. The dataset contained 50 terms. Our criteria of selecting these 50 terms include: i) they should be representative, viz., belonging to places, clothes, animals, stuff, households., ii) they should give us enough room to have a representative sample on the aspects of YSA culture, iii) they should represent all lexical items, viz., nouns, verbs, adjectives, and adverbs, and iv) they should allow us both linguistic and cultural analyses. "

As for what counts as “correct,” “appropriate,” and “incorrect”, I have the following text

"We assessed the translation (in)correctness and appropriateness of dialectical terms between the two AI models, ChatGPT and DeepSeek, and human translation. We considered the translation of an item correct if cultural and linguistic aspects are both maintained in the translation of this term. However, if one of these aspects is violated in the translation we consider it appropriate, and if both the cultural and linguistic aspects are violate in the translation, we consider it incorrect. "

Important details are currently missing for full replication. In particular, these aspects need to be clarified: Model versions & dates: Report exact model versions/variants, query timestamps (LLMs change over time), and any API/app settings. Prompts: Publish the exact prompts, instructions (e.g., “translate to English; provide multiple senses?”), temperature/decoding parameters, number of attempts/retries, and whether any post-processing was applied. Text normalization: Specify Unicode normalization, diacritic handling, tokenization, and whether Arabic script was normalized before translation.

Response
Thank you very much for your valuable comment. I have addressed these aspects, but for the word limit I couldn't cover them all, which need a full-fledged new paper.

Temper general claims about “SA bias” and cross-dialect performance; frame as evidence from this 50-term sample and invite replication on larger, balanced sets.
Highlight that performance may change with contextual sentences and few-shot prompting; consider adding a small follow-up experiment (even in supplementary) to show sensitivity to prompt design.

Response
Thank you very much for your valuable comment. I have addressed these aspects, please see section 7.

Terminology consistency: Use “YSA” consistently (a few instances seem to vary).
Typos & phrasing (examples): “concisderable” → considerable; “corp” → crops (re: silo); “wording”/“orthograpgical” → orthographical; ensure consistent capitalization of model names and sections.

Response
Thank you very much for your valuable comment. I have addressed these aspects, revising the paper carefully for these issues.

Finally, thank you very much once again for your valuable comments.

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Alafnan M: Large Language Models as Computational Linguistics Tools: A Comparative Analysis of ChatGPT and Google Machine Translations. J. Artif. Intell. Technol. 2024; 5: 20–32. Publisher Full Text

[2] Ali G, Ali N, Syed K: Understanding Shifting Paradigms of Translation Studies in 21^st Century.2023. Publisher Full Text

[3] Al-Mannai K, Sajjad H, Khader A, et al.: Unsupervised word segmentation improves dialectal Arabic to English machine translation. Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). 2014; pp. 207–216.

[4] de Almeida G , O’Brien S: Analysing post-editing performance: correlations with years of translation experience. Proceedings of the 14th Annual Conference of the European Association for Machine Translation, St. Raphaël, France. 2010; pp. 27–28.

[5] Alshawsh H, Shormani MQ: (Un) translatability of Yemeni (Ibbi) Zawaamil and Ballads into English: Ibb University Students as a Case Study. Int. J. Linguist. Lit. Transl. 2025; 8(3): 188–205. Publisher Full Text

[6] Alwagieh N, Shormani MQ: Translating Arabic free poetry texts into English by ChatGPT: Success and Failure. Int. J. Linguist. Lit. Transl. 2024; 7(9): 183–198. Publisher Full Text

[7] Aransa W: Statistical Machine Translation of the Arabic Dialect. Ph.D. thesis. University of Maine, doctoral school STIM, 2015.

[8] Arzu A, Issa T: An effect on cultural identity: Dialect. Procedia Soc. Behav. Sci. 2014; 143: 555–562. Publisher Full Text

[9] Bamunusinghe K, Bamunusinghe S: The Importance of the Knowledge on Dialects for a Translator.2014. Reference Source

[10] Bassnett S: Translation studies. Routledge; 2013.

[11] Brown PF, Della Pietra SA, Della Pietra VJ, et al.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 1993; 19(2): 263–311.

[12] Cao Y, Kementchedjhieva Y, Cui R, et al.: Cultural Adaptation of Recipes. Trans. Assoc. Comput. Linguist. 2024; 12: 80–99. Publisher Full Text

[13] Castellani B: Automatic generation of morpheme level reordering rules for Korean to English machine translation. MA thesis, Seoul National University; 2017.

[14] Çetin Ȍ, Duran A: A Comparative Analysis of the Performances of ChatGPT, DeepL, Google Translate and a Human Translator in Community Based Settings. Amasya Universitesi Sosyal Bilimler Dergisi. 2024; 9(15): 120–173.

[15] Datta S: Investigating English-Language Dialect-Adjusted Models. Computer Science Senior Theses.2023; 11. Reference Source

[16] Daulay E: The social meaning of language and dialect. VISION. 2017; 12(12).

[17] Deilen S, Garrido S, Lapshinova-Koltunski E, et al.: Using ChatGPT as a CAT Tool in Easy Language Translation. Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability Associated with RANLP. 2023; pp. 1–10. Publisher Full Text

[18] Ding L, Wu D, Tao D: Improving neural machine translation by bidirectional training. arXiv preprint arXiv: 2109.07780. 2021.

[19] Durrani N, Al-Onaizan Y, Ittycheriah A: Improving Egyptian-to-English SMT by mapping Egyptian into MSA.International Conference on Intelligent Text Processing and Computational Linguistics.Berlin, Heidelberg: Springer Berlin Heidelberg; 2014; pp. 271–282. Publisher Full Text

[20] Elkaffash SM: Corpus-Based Quality Evaluation of Ar-En Neural Machine Translation: Google Translate as a Case Study. Master’s thesis, Hamad Bin Khalifa University (Qatar); 2020.

[21] Fairclough N: Analysing discourse. London: Routledge; 2003.

[22] Federici FM: Translating Dialects and Languages of Minorities: Challenges and Solutions. Die Deutsche Nationalbibliothek; 2011.

[23] Ferguson CA: Diglossia. Word. 1959; 15: 325–340. Publisher Full Text

[24] Gill S, Kaur R: ChatGPT: Vision and Challenges. TCPS. Elsevier; 2023.

[25] Groves D, Dag S: Identification and analysis of post-editing patterns for MT. Proceedings of the Twelfth Machine Translation Summit, August 26–30, Ottawa. 2009; 429–436.

[26] Guo D, Zhu Q, Yang D, et al.: DeepSeek-Coder: When the Large Language Model Meets Programming- the Rise of Code Intelligence.2024. Reference Source

[27] Hofmann H, Sakti S, Isotani R, et al.: Sequence-based pronunciation modeling using a noisy-channel approach. Spoken Dialogue Systems for Ambient Environments: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS 2010, Gotemba, Shizuoka, Japan, October 1-2, 2010. Proceedings. Berlin Heidelberg: Springer; 2010; pp. 156–162.

[28] Hutchins WJ: Machine translation: past, present, future. Chichester: Ellis Horwood; 1986.

[29] Jiang L, Jiang Y, Han L: The Potential of ChatGPT in Translation Evaluation: A Case Study of the Chinese-Portuguese Machine Translation. Casernos de Traduçao. 2024; 44: 1–22. Publisher Full Text

[30] Jiao W, Wang W, Huang J, et al.: Is ChatGPT a good translator? Yes with GPT-4 as the engine.2023. Reference Source

[31] Joshi S: A Comprehensive Review of DeepSeek: Performance, Architecture and Capabilities.2025. Publisher Full Text

[32] Kadaoui K, Magdy SM, Waheed A, et al.: Tarjamat: Evaluation of bard and chatgpt on machine translation of ten arabic varieties. arXiv preprint arXiv:2308.03051. 2023.

[33] Koehn P: Statistical Machine Translation. Cambridge: Cambridge University Press; 2009.

[34] Koehn P, Och FJ, Marcu D: Statistical phrase-based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North Ameri can Chapter of the Association of Computational Linguistics (HLT-NAACL).2003. Reference Source

[35] Kong JW: Translating Dialects and Languages of Minorities: Challenges and Solutions. Review. Babel. 2013; 59(1): 121–124. Federici, F. M. (ed.). (2011). Publisher Full Text

[36] Krings H: Repairing texts: empirical investigations of machine translation post-editing processes. Kent, OH: The Kent State University Press; 2001.

[37] Kumar Y, Gordon Z, Alabi O, et al.: ChatGPT Translation of Program Code for Image Sketch Abstraction. Appl. Sci. 2024; 14(3): 992. Publisher Full Text

[38] Lee TK: Artificial intelligence and Posthumanist translation: ChatGPT vs the translator. Appl. Linguist. Rev. 2023; 15: 2351–2372. Publisher Full Text

[39] Lilli S: ChatGPT-4 and Italian dialects: assessing linguistic competence. Umanistica Digit. 2023; 16: 235–263.

[40] Macken L: Machine Translation Meets Large Language Models: Evaluating ChatGPT’s Ability to Aautomatically Post-Edit Literary Texts. Proceedings of the 1st Workshop on Creative-Text Translation and Technology. 2024; pp. 65–81.

[41] Neubig G, Akita Y, Mori S, et al.: Improved statistical models for SMT-based speaking style transformation. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2010, March; pp. 5206–5209.

[42] Newmark P: A textbook of translation. New York: Prentice Hall; 1988.

[43] Peng K, Ding L, Zhong Q, et al.: Towards making the most of chatgpt for machine translation. arXiv preprint arXiv: 2303.13780. 2023.

[44] Peng Y, Chen Q, Shih G: DeepSeek is Open-Access and the Next AI Disrupter for Radiology. Radiol. Adv. 2025; 2: 1. Publisher Full Text

[45] Postigo M: ChatGPT and MT-Systems: Advantages and Limitations when Translating English to Spanish and Portuguese. Vol. 28. . Lengua Y Habla; 2024.

[46] Puppel M, Borg C: Evaluating ChatGPT’s Performance in Creative Text Translation for Communication: A Case Study from English into German. Media and Intercultural Communication: A Multidisciplinary Journal. 2024; 3(1): 1–27. Publisher Full Text

[47] Saito D, Watanabe S, Nakamura A, et al.: Statistical voice conversion based on noisy channel model. IEEE Trans. Audio Speech Lang. Process. 2012; 20(6): 1784–1794. Publisher Full Text

[48] Salloum W, Habash N: Elissa: A dialectal to standard Arabic machine translation system. Proceedings of COLING 2012: Demonstration papers. 2012; pp. 385–392.

[49] Sawaf H: Arabic dialect handling in hybrid machine translation. Pro-ceedings of the Conference of the Association for Machine Translation in the Americas (AMTA). Denver, Colorado; 2010.

[50] Shormani MQ: An introduction to English syntax. A generative approach. LAP Lambert Academic Publishing; 2013.

[51] Shormani MQ: Vocatives in Yemeni (ibbi) Arabic: Functions, types and approach. J. Semit. Stud. 2019; 64(1): 221–250. Publisher Full Text

[52] Shormani MQ: Does culture translate? Evidence from translating proverbs. Babel, John Benjamins. 2020; 66(6): 902–927. Publisher Full Text

[53] Shormani MQ: Can ChatGPT capture swearing nuances? Evidence from translating Arabic oaths.2024a. Reference Source

[54] Shormani MQ: Linguistics contribution to artificial intelligence Where this contribution lies.2024b. Publisher Full Text

[55] Shormani MQ: Introducing minimalism: A parametric variation. Lincom Europa Press; 2024c.

[56] Shormani MQ: Non-native speakers of English or ChatGPT: Who thinks better? F1000Res. 2025a; 14. Publisher Full Text

[57] Shormani MQ: AI translation and culture-based expressions. A lecture given at AlQalam University, held on 16/01/2025.2025b.

[58] Shormani MQ, Al-samki AA: Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani Arabic terms as a case-in-point.2025.

[59] Shormani MQ, Alfahd A: Artificial Intelligence or Human: The use of ChatGPT in the academic translation for religious texts (To appear in Sage Open).2025.

[60] Shormani MQ, AlSohbani YA: Artificial intelligence contribution to translation industry: looking back and forward. Discov. Artif. Intell. 2025; 5: 389. Publisher Full Text

[61] Shormani MQ, Watson JC, Dickins J: Poems from Ibb and Hadramawt. In Morano R, Watson J, Dickins, editors. Yemeni Poetry on the Frontline: love and conflict. Routledge; 2025; pp. 14–31.

[62] Sindhuja R: Translation Theory and Practice.2021. Reference Source

[63] Siu SC: ChatGPT and GPT-4 for professional translators: exploring the potential of large language models in translation. Preprint. 2023; 1–36. Reference Source

[64] Vaswani A, Shazeer N, Parmar N, et al.: Attention is all you need. Adv. Neural Inf. Proces. Syst. 2017; 30.

[65] Wang C, Kantarcioglu M: A Review of DeepSeek Models’ Key Innovative Techniques. arXiv:2503.11486v1. 2025.

[66] Wu Y, Schuster M, Chen Z, et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. 2016.

[67] Zan C, Peng K, Ding L, et al.: Vega-mt: The jd explore academy translation system for wmt22. arXiv preprint arXiv: 2209.09444. 2022

[68] Zbib R, Malchiodi E, Devlin J, et al.: Machine translation of Arabic dialects. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012: 49–59.

Translating dialects between ChatGPT and DeepSeek: Yemeni San’ani Arabic terms as a case-in-point

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 1

1. Introduction

2. Conceptual framework

2.1 Machine translation systems

3. Translating dialects

3.1 Translating Arabic dialects

4. Methods

4.1 Data collection

Table 1. Categories Yemeni San’ani Arabic terms.

4.2 Procedure

4.3 Methods of analysis

5. Results and discussion

5.1 Results

Table 2. AI Translation (ChatGPT and DeepSeek) and Human translation.

Table 3. Summary of the results.

5.2 Discussion

6. ChatGPT and DeepSeek: Convergence and divergence

6.1 Dialectal and cultural subtleties

6.2 Linguistic subtleties

6.3 Contextual subtleties

6.4 Standard Arabic biases

7. Conclusions, recommendations and limitations

Ethics and consent

Data availability statement

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated