ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Translating dialects between ChatGPT and DeepSeek: Yemeni Sana’ani Arabic terms as a case-in-point

[version 1; peer review: awaiting peer review]
PUBLISHED 16 Jul 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background

This study aims to detect the efficiency of two Artificial Intelligence (AI) translation models, ChatGPT and DeepSeek, in the translation of Yemeni Sana’ani Arabic (YSA) dialectical terms into English. As dialectal Arabic presents significant linguistic variability and cultural specificity, accurate translation remains a major challenge for the current ChatGPT and DeepSeek (and perhaps other AI models).

Methods

Fifty Sana’ani Arabic terms were involved in the translation process, assessing the ability of both models to capture their semantic fidelity, cultural relevance, and contextual accuracy.

Results

The study findings reveal that, while both models demonstrate a foundational understanding of Standard Arabic (SA), their performance diminishes considerably when faced with the nuances and idiomatic expressions of the Sana’ani Arabic dialect. ChatGPT displays a relatively better performance in certain cases, particularly when translating terms with dialectical connotations. However, both models exhibit limitations, such as literal translation, misinterpretation, or complete ignorance of the intended meaning.

Conclusions

The study concludes by highlighting the critical need for dialect-aware AI development and provides recommendations for improving the dialectical accuracy and cultural sensitivity of AI model translation.

Keywords

Dialect translation, ChatGPT, DeepSeek, Sana’ani Arabic, dialectical and cultural nuances

1. Introduction

In the era of global communication, technological and digital advancements, and for the sake of ease and building varied cultural relationships among nations of different languages, conventional manual translation has been revolutionized by the integration of automated machine translation artificial intelligence (AI) tools, which have largely taken a significant part in translation procedures (Alafnan, 2024; Koehn, 2009; Ali et al., 2023; Çetin & Duran, 2024; Puppel & Borg, 2024). Several AI electronic tools have been employed in the execution of translation, some of the most prominent of which are ChatGPT, DeepSeek, DeepL, Google Translate, Babylon, Amazon Translate, Yandex Translate, Systran, and Bing Microsoft Translator. There is a huge argument regarding the efficient use of such tools and their potential to replace the roles of humans in the process of translation, and millions use them daily without scrutinization and with no evaluation (Çetin & Duran, 2024; Postigo, 2024). The output of such intelligent tools, however, is not error-free; and thus, many researches have studied the effectiveness of such tools not for getting perfect translation models but for minimizing the rate of inadequacy in translation outcomes and enhancing their outputs.

As an elaboration of such studies, this study focuses on the efficiency of two notable AI models, viz. ChatGPT and DeepSeek (cf. Çetin & Duran, 2024), and compares their performances in translating 50 Sana’ani dialectical terms since dialects are considered to be of great importance in linguistic studies and culturally global communication (Bamunusinghe & Bamunusinghe, 2014). The insights of this study are used to identify their strengths and limitations. They are expected to be an appropriate overview used for optimizing the efficiency of AI translation tools by their designers as well as human translators. Simply put, this study aims to evaluate and compare the efficiency of ChatGPT and DeepSeek AI tools in the translation of Sana’ani dialectical terms into English. It also aims to identify their strengths and weaknesses, hoping to optimize the efficiency of AI translation tools by their designers and human translators. Thus, the significance of this study lies in evaluating and comparing the efficiency of ChatGPT and DeepSeek in the translation of Sana’ani dialectical terms into English. To the best of our knowledge, this is the first academic study to address the efficiency of DeepSeek model and compare it to ChatGPT in terms of the Sana’ani dialect. Sana’ani Arabic is spoken in Sana’a governorate, the capital of Yemen. Like other Yemeni Arabic varieties, Sana’ani Arabic is an understudied language variety that necessitates and urges linguists to address its distinct linguistic features (cf. Shormani, 2019).

The remainder of this paper is organized as follows. In section 2, the study posits the conceptual framework, tackling a historical overview of machine translation and two of its currently notable AI models. Section 3 outlines the translating of dialects and focuses on related studies. Section 4 tackles the study methodology. Sections 5 analyzes and discuss the results, respectively. Section 6 presents the study conclusions, recommendations, and limitations.

2. Conceptual framework

2.1 Machine translation systems

The implementation of technology in translation has a long history; however, it has become highly remarkable in recent decades (Postigo, 2024). Machine Translation (MT) systems “have been at the forefront of translation technology since the 1950s” and has crossed through major developments (Postigo, 2024: 1; Alafnan, 2024; Çetin & Duran, 2024) and by the time, they got more directed towards better mirroring of natural language processing representing a surprising shift in translation technology (Koehn, 2009; Postigo, 2024). MT is “one of the applications studied in computational linguistics” (Alafnan, 2024, p. 21). The processes are based on codes, encoding, and decoding.

MT has three main approaches categorized according to its functionality: Rule-based Machine Translation (RMT), which extended from the nearly 1960s to the early 1990s; Statistical Machine Translation (SMT), which extended from the early 1990s to the 2010s; and Neural Machine Translation (NMT), which was shaped nearly in 2010 and extended up to now (Hutchins, 1986; Koehn et al., 2003; Hofmann et al., 2010; Wu et al., 2016; Castellani, 2017; Vaswani et al. 2017; Elkaffash, 2020; Shormani, 2024a, b & c; Alafnan, 2024; Çetin & Duran, 2024; Postigo, 2024; Sindhuja, 2021; Shormani & Alfahd, 2025). The first is based on predefined forms, rules, and lexemes posited by expert linguists, according to which translations are generated. It revolves around a linguistic equation of word forms, and phrase structure rules in the source language (SL) that can be rendered into counterpart word forms and phrase structures in the target language (TL). This approach is best represented by Systran and is suitable for translating simple texts rather than complex linguistic constructions and idioms (Shormani, 2024b). The second approach is based on statistics of phrase-strings corpora rather than just word forms, whereby a large amount of data and texts in two languages are analyzed in alternative forms, patterned, and best modeled for usage in fore-coming texts that need to be translated (Shormani, 2024b). However, it still has grammatical and semantic weaknesses, as well as contextual limitations. The third approach, which is the most widely used nowadays, is distinguished from the other two approaches in that it uses deep neural networks and is more contextualized. That is, when translating a text, it looks at the context of the text as a whole, providing better translation than RMT and SMT (Shormani, 2024b). The stages of the three approaches are discussed further in the following sections.

Rule-based machine translation was the earliest approach to MT, which emerged in the 1960s as a solution to linguistic barriers in cross-border communication. Initially, the RMT was developed to assist the US Air Forces in translating Russian documents during the Cold War, aiming to facilitate intelligence gathering, diplomatic efforts, and communication among people (Hutchins, 1986). The term “rule-based” refers to the application of grammatical rules, syntactic parsing, and predefined linguistic structures to convert text from one language to another (see e.g., Hutchins, 1986; Koehn et al., 2003; Shormani, 2024b). RMT relies on explicit human-crafted rules rather than learning from data. Linguists and computational experts manually defined syntax, morphology, and semantics for both source and target languages, creating structured frameworks for MT (cf. Shormani, 2024b & c). This approach requires extensive linguistic expertise, making it both resource-intensive and time-consuming. However, the deterministic nature of RMT ensures predictable translations for specific language pairs, making it a valuable tool for structured and well-defined text types such as legal and technical documents (Hutchins, 1986).

RMT systems performed well when translating from and into languages with similar syntactic structures or typologically similar languages, as in the case of English and German, where word order and syntax followed relatively similar patterns. However, these tools struggle with languages that have significant structural differences, such as English and Korean, owing to variations in sentence formation and morphological complexity (see also Castellani, 2017). Additionally, RMT found it challenging to process idiomatic and proverbial expressions (cf. Shormani, 2020), which often do not have direct equivalents in the target language. Cultural nuances, figurative language, and polysemic words (i.e., words with multiple meanings) pose major obstacles, leading to awkward or inaccurate translations. For example, the English idiom kick the bucket, meaning to die, would be translated into Arabic by RMT literally as يركل الدلو which is far away from “to die,” thus losing its intended meaning in Arabic. Another structure causing difficulty for RMT involves center-embedding sentence structures, where clauses are nested within other clauses, resulting in a complex sentence (cf. Shormani, 2025a). This nesting often results in poor translations due to the difficulty in processing these structures by RMT. These issues highlight the limitations of a purely rule-based approach and demonstrate the need for more flexible translation methods.

Another major problem with RMT is its scalability. Because linguistic rules had to be manually crafted and refined for each language pair, developing RMT for multiple languages was both effort- and time-consuming, and costly. Maintaining and updating rule sets requires continuous input from expert linguists, which makes it difficult to adapt to new linguistic changes or variations (see also Castellani, 2017; Elkaffash, 2020). Additionally, RMT is unable to effectively handle the vast diversity of natural language, as language evolves with time, context, and usage. Despite these limitations, RMT has laid the groundwork for future advancements in machine translation by emphasizing the importance of linguistic structure. Recognizing the inefficiencies of RMT, researchers have sought alternative approaches that can handle translation tasks more dynamically and efficiently (Brown et al., 1993; Elkaffash, 2020; Shormani, 2024b). This has led to the emergence of Statistical Machine Translation (SMT), which has shifted from manually defined rules to a probabilistic, data-driven approach to translation.

Statistical machine translation emerged in the 1990s as a probabilistic alternative to RMT, leveraging large bilingual corpora and statistical models to improve translation accuracy (see e.g., Hutchins, 1986; Shormani, 2024a & b). Unlike RMT, which relies on predefined linguistic rules, SMT introduces a probabilistic approach using vast amounts of parallel texts to predict the most accurate translations (Brown et al., 1993; Elkaffash, 2020). At its core, SMT operates by analyzing bilingual corpora, where sentences in one language are aligned with their corresponding translations in another. The system then applies statistical models to determine the probability of a given target language sentence based on source language input. One of the earliest and most influential SMT frameworks was IBM’s model series, which introduced word-alignment techniques and phrase-based translation methods (Koehn et al., 2003). Another major innovation in SMT is the noisy-channel model (see e.g. Neubig et al., 2010; Hofmann et al., 2010; Saito et al., 2012), which treats translation as a decoding process in which the most probable output is selected based on statistical similarity (Koehn et al., 2003). These statistical methods significantly improve translation fluency compared to RMT by allowing the system to adapt dynamically to large datasets.

However, SMT faces several challenges, particularly in handling complex syntactic structures and long-range dependencies (Hofmann et al., 2010). Since SMT relies on phrase-level probabilities rather than deep linguistic understanding, it often struggles with sentence coherence and grammatical correctness. Thus, while SMT can accurately translate isolated words or short phrases, it often fails to maintain the logical flow of longer sentences, resulting in disjointed or incoherently structured outputs. Additionally, SMT models require extensive training data, that is, low-resource languages such as Hindi, Marathi, and Irish are often inadequately represented (Shormani, 2024b & c). As a result, translations of these languages tended to be less reliable, with significant errors in syntax and semantics. Early versions of Google Translate, which originally relied on SMT (Elkaffash, 2020), demonstrated these weaknesses by frequently generating grammatically inconsistent and contextually inaccurate translations. Although SMT improved translation accuracy compared to RMT, it was still far from achieving human-level accuracy.

One of the major limitations of SMT is its dependence on large high-quality bilingual corpora. The availability of these corpora varied greatly between languages, with high-resource languages such as English, Spanish, and French benefiting from better-trained models, while other languages, specifically low-resource languages, remained underrepresented. Moreover, as noted by Elkaffash (2020), SMT requires significant human interference to refine translations, which is known as the post-editing process (cf. Groves & Dag, 2009; Krings, 2001; de Almeida & O’Brien, 2010). A post-editing process is often necessary to correct errors in grammar and meaning. Recognizing these shortcomings, researchers have sought more advanced techniques that can better capture the nuances of natural language and improve its contextual accuracy. In response to these challenges, neural AI translation was introduced, as we will see below, marking a significant leap forward in machine translation utilizing deep learning and artificial neural networks.

NMT aims to overcome the limitations of SMT by modeling entire sentences as continuous representations, thereby allowing for greater contextual understanding and fluency. The NMT revolutionized the field of MT in the 2010s by introducing deep learning and artificial neural networks. Unlike SMT, which translates text at the phrase level using probabilistic models, NMT considers entire sentences and their broader contextual relationship. This approach allows for more fluent, coherent, and natural-sounding translations, reducing the disjointed outputs that are common in SMT. The fundamental shift brought about by NMT was its ability to process words not as isolated units but as part of a continuous representation, using deep learning techniques to capture the semantic and syntactic structures of a sentence. Early NMT models were based on Neural Networking Algorithms (NNAs) and Recurrent Neural Networks (RNNs), which helped maintain sequential dependencies during translation. This is particularly true with the introduction of Transformers and Vectors. However, RNNs struggle with long-range dependencies, meaning that the quality of translation decreases for longer and more complex sentences. This limitation prompted the search for more effective architectures that could process language in a nonsequential, context-aware manner.

A major breakthrough came in 2017 when Vaswani and colleagues introduced the transformer architecture, which replaced RNNs in many NMT models, including ChatGPT (see also Lee, 2023; Siu, 2023; Kumar et al., 2024; Shormani, 2024a, b & c). Unlike RNNs, transformers process entire sequences of words, phrases, and sentences simultaneously, making them significantly more efficient and capable of handling long-range dependencies (Vaswani et al., 2017). The self-attention mechanism in transformers allows models to weigh the importance of different words in a sentence, ensuring that translations preserve their meaning across languages. This innovation has led to the rise of Large Language Models (LLMs), such as OpenAI’s Generative Pre-trained Transformer (GPT), which leveraged transformers to produce more contextually aware and fluent translations. NMT dramatically improved translation accuracy, minimized common issues such as word-for-word errors, and enhanced contextual understanding. As a result, major translation platforms, including Google Translate and DeepL, have adopted NMT to replace older SMT-based models. The continued refinement of transformer-based architectures has brought machine translation closer to the human-level sound transition output (see also Jiao et al., 2023; Lee, 2023; Siu, 2023; Kumar et al., 2024). Another major advancement of the AI translation industry is the incorporation of neural vectors.

However, with its advancements, NMT still faces challenges, particularly when dealing with culture- and religion-based texts (see e.g., Shormani, 2024b, 2025b). Unlike other types of texts, such as technical and scientific texts, cultural- and religion-based texts require more than simply rendering wordings (Shormani, 2020); they demand a deep understanding of cultural, religious, and historical contexts (Shormani, 2025b). Many NMT models, including ChatGPT, struggle to accurately translate idiomatic expressions, culturally specific texts, and religious terminology (cf. Shormani, 2020). This is because AI models primarily learn from vast amounts of (Internet) data, which may not include data containing these cultural nuances or the sensitivities of religious discourse. Additionally, ethical concerns arise when translating sensitive materials, as different cultures have varying interpretations of words and concepts (Shormani, 2024b & c). Addressing these limitations requires further advancements in AI translation training data, methods, dataset diversification, and post-editing to refine NMT’s ability to handle complex cultural and religious texts effectively (see e.g., Shormani, 2024a, b & c).

ChatGPT was introduced in 2022 by OpenAI. It is based on the syntactic, morphological, logical, and algorithmic transformation of a large number of language samples, patterns, dictionaries, information, and models, known as Large Language Models, through which it is enabled to generate new outputs and more distinctly adhere to contextual backgrounds for further usage when performing new tasks (Gill & Kaur, 2023; Çetin & Duran, 2024; Puppel & Borg, 2024). Included within Generative Pre-Trained Transformer (GPT) systems, specifically GPT-3.5 and GPT-4, ChatGPT is enabled to generate new outputs and perform many tasks, some of which are transforming data, translating texts, post-editing, translation evaluation, summarizing, and responding to users’ inquiries and conversations (Gill & Kaur, 2023; Jiang, et al., 2024; Macken, 2024; Puppel & Borg, 2024). It can perform various tasks and generate different outcomes, such as written, visual, or auditory, that almost resemble those produced by humans. It essentially benefits from deep learning and Natural Language Processing, which is a branch of AI that teaches machines to comprehend human language and produce similar output models (Gill & Kaur, 2023).

As a result of sustainable technological advancements as well as an alternative novel tool competing with other AI tools in the market with its high scalability and efficient performance, accessible open-source DeepSeek emerged in January 2025 (Guo, et al. 2024; Joshi, 2025; Peng, et al. 2025; Wang & Kantarcioglu, 2025). Like ChatGPT, DeepSeek is a transformative, Large Language Model (LLM) that performs various tasks, such as mathematical and structural reasoning, language processing, problem solving, finance, and healthcare diagnosis. It is comprehensively pretrained on high-quality linguistic and syntactic corpora of codes (Guo, et al. 2024). It features the architecture of Mixture of Experts and Multi-Head Latent Innovation (Joshi, 2025; Peng, et al. 2025; Wang & Kantarcioglu, 2025). DeepSeek, despite having challenges in innovative tasks and the safety of users, has many advantages and an inspiring future that exceeds those of other AI tools, including ChatGPT and Codex (Guo, et al. 2024; Joshi, 2025). Some of its major advantages are grammatical accuracy and contextual evaluation (Joshi, 2025).

3. Translating dialects

Translating nonstandard dialects presents significant challenges for human translators because of their lack of formal codification, regional variation, and deep cultural embedding. Unlike standardized languages, which have established grammatical rules and extensive documentation, non-standard dialects often rely on oral traditions and are shaped by local customs, idioms, and phonetic shifts that may not have direct equivalents in the target language (Federici, 2011; Kong, 2013). Additionally, dialects often carry the connotations of social class, regional identity, and cultural heritage. Translating these aspects requires more than linguistic proficiency; cultural fluency is required to ensure that the translation resonates with the target audience while preserving the source’s authenticity. For example, when translating Swedish dialects into English, preserving local identity and authenticity is crucial, as dialects are expressions of local identity and community (Federici, 2011).

Thus, if translating dialects is difficult for human translators, then it is expected that MT tools face more difficulty in this regard because these dialects are not within the training data. Put simply, these tools struggle with non-standard dialects; their training data are typically standard language corpora and may not accurately process vernacular speech, leading to errors and loss of meaning (see also Puppel & Borg, 2024). Arabic is a diglossic language with several dialects and vernaculars (see also Ferguson, 1959). In this study, we examine this aspect using both ChatGPT and DeepSeek to determine the extent to which they can translate Yemeni San’ani Arabic (YSA).

For a decade or so, a number of studies have examined and evaluated the workflow of AI translation tools, seeking to identify their efficiency and usefulness, as well as their weaknesses and limitations in translation. Taking the English-German pair as a case study, Puppel and Borg (2024) evaluated the performance of ChatGPT and showed its strengths and weaknesses based on prompts. The most prominent strength of ChatGPT translation is the coherence of the translated output. However, the prevailing limitation of ChatGPT exhibited in this study is related to style. Other limitations include fluency and accuracy. This study highlighted the need for post-edition of ChatGPT translated output to detect errors and edit them. Therefore, “the intervention of trained translators is required to correct errors and fine-tune the machine-translated text” (Puppel & Borg, 2024, p. 22).

Applying his study on three machine-translated short stories from English into Dutch, Macken (2024) evaluated the ability of ChatGPT 4-o, in a post-editing machine, to translate literary texts automatically in comparison to the post-edition of experienced professional translators in the field of translating literary works from English into Dutch. The study showed that the automatic changes made by ChatGPT were at the level of words and that it made more lexical changes than those made by human editors. In contrast, professional editors made changes not only at the word level but also at the style level. Overall, the study concluded that although ChatGPT could actually correct a number of errors, it still provided edited texts with more problems than texts post-edited by professional translators.

In a comparative study of the advantages and limitations of conventional MT systems, new chatbots, and ChatGPT tool in the translation from English into Spanish and Portuguese, Postigo (2024) stated that there are differences between the translation output obtained by MT and those gained by AI tools. Nevertheless, she found that all MT and AI tools present several grammatical and semantic challenges. Additionally, Jiang et al. (2024) conducted a quantitative, qualitative study on the potentiality of the automated evaluation of machine translation from Chinese into Portuguese by means of two ChatGPT models, i.e., 3.5 and 4.0, in addition to five human raters, for the sake of analytic comprehensiveness. The sample consisted of 20 sentences translated from Chinese into Portuguese. The study concluded that the capability of ChatGPT, especially the 4.0 model, reached the efficiency level of conventional human evaluation and that it has a more inspiring future if it gets more enriched with further balanced capabilities.

To detect the cultural integration of AI tools in their translations, Cao et al. (2024) addressed the adaptation of nuanced cultural contexts by translating Chinese recipes to English. These recipes are both automatically formed and human-enriched. The study concluded that, although GPT-4 showed mesmerizing capability in the adoption of cultural recipes, it still remained behind the scale of human ability and expertise. Additionally, Deilen et al. (2023) conducted an intralingual German-Easy Language machine translation and translated it using ChatGPT. The analysis was based on readability, correctness, and syntactic complexity. The results showed that the output content was not completely correctly intra-lingually translated, and some content was missing, while syntactic constructions were rendered easier but not as required. The paper concluded that there is an inevitable need for professional human translators to pre-instruct the tool and edit the tool’s translated output.

In his study, Alafnan (2024) detected the effectiveness of ChatGPT and Google Translate in the translation of selected Arabic and English speeches of the King of Jordan, Abdullah II, from Arabic to English, and from English to Arabic. This study revealed that Google Translate’s translated outputs were inadequate and required major editing. Translated outputs of ChatGPT, specially from Arabic into English, on the other hand, though needing some sort of adjustment, were acceptable. However, the study emphasized that machine translation complements human professional translators and does not substitute them, since human mediation is indispensable for the adequacy of translation.

Thus, the study has three questions to answer:

  • 1. Can AI models, specifically ChatGPT and DeepSeek, capture the YSA dialectical nuances?

  • 2. Which is better in translating YSA terms, ChatGPT or DeepSeek?

  • 3. What are the problematic nuances that ChatGPT and DeepSeek face in translating such terms?

4. Methods

4.1 Data collection

We collected the study data from different sources on the web, such as Wikipedia and the first author’s relatives, who are native speakers of Sana’ani Arabic. The dataset contained 50 terms. After collecting the data, the terms were classified into categories. The results are presented in Table 1.

Table 1. Categories Yemeni Sana’ani Arabic terms.

NominalsNonnominals
PlacesClothesAnimalsStuffHouseholdsVerbs Adj & Advs
شاقوص مُغْمُق دِمِّة شِرْكِة طاسة نِبْدَع مطنن
صومعة مَحْزَق حلباني جَعالة بالدي وخّر شوعة
دَيْمِة صُماطة تيس قِرِّيْح بُورِي قوٌّى حالي
قاع أدوان معزة تـُتنْ مَعْشَرَة كوزر بحين
غُرقة مشمع شقري بيعة مَدَق سير أيحين
دِرجان بَرْدِة مَلَتْ بيشقص فيسع
حانوت نصلة قوطي يلغّج لَلْمه
زنة المدعي سمخ سَعْما
محواش

As shown in Table 1, the data were divided into seven types: Place, Clothes, Animals, Stuff, Household, Verbs, and Adj & Advs. The idea behind this categorization is twofold: (i) easy to expose and (ii) to understand the nature of each category. The category Place contained seven terms: Clothes 8, Animals 5 Stuff 5, Households 9, Verbs 8, and Adjs & Advs 8 terms.

4.2 Procedure

The study has passed through three stages. Stage 1 concerns data collection and classification, in which we collected the data from different sources, as alluded to above, and classified these data. Stage 2 consists in translating the collected data into English using ChatGPT-4o and DeepSeek v3. After translating the data using both AI models, we translated the terms. Our translation depended heavily on two factors: i) our knowledge of YSA and ii) the first author’s relatives who are native speakers of YSA. If we did not know what a term meant, we asked them to give us the meaning in SA or explain its meaning by giving us an example, for instance. When we understand the meaning, the translation task becomes easier. We then tabulated the results in terms of ChatGPT, DeepSeek, and human translations. Stage 3 deals with the data analysis and discussion.

4.3 Methods of analysis

The methodology employed was a qualitative and quantitative approach to evaluate the accuracy and contextual appropriateness of translations. In our analysis, we followed a simple count and percentage for quantitative analysis. For the qualitative part, we adopted a linguistic and sociocultural analysis framework, known as micro and macro analysis (see e.g. Fairclough, 2003). We assessed the translation (in) correctness and appropriateness of dialectical terms between the two AI models, ChatGPT and DeepSeek, and human translation. We set the following evaluation criteria: i) dialectal and cultural subtleties, in which we examined whether both AI models capture these nuances and which are more successful, and ii) linguistic subtleties. Using this criterion, we assessed whether both models capture linguistic subtleties that lie within morphophonology/orthography and syntax) contextual subtleties. Here, we investigated whether the models’ translations reflect the intended meaning in different contexts, and iv) Standard Arabic (SA) biases, where we evaluated whether the translations produced by both models are biased towards SA.

5. Results and discussion

5.1 Results

We first present the results in the form of ChatGPT, DeepSeek, and human translations. Table 2 presents these findings.

Table 2. AI Translation (ChatGPT and DeepSeek) and Human translation.

NoTermChatGPTDeepSeekHuman
1شاقوص Hook (or a type of tool with a hooked end)A type of fish (likely a local or dialectal term)A small and narrow window
2 صومعة Silo/Minaret/MonasterySilo or minaretminaret
3 دَيْمِة Continuous rain (light drizzle)Light rain or drizzleKitchen
4 قاع Valley/Lowland/Open landBottom or baseBottom
5 غُرقة RoomDrowning or floodhole
6 دِرجان StairsStairs or stepsStairs
7 حانوت Shop/StoreShop or small storeshop
8 مُغْمُق Dark/DimConfused or muddledVeil
9 مَحْزَق Tight place/Constricted areaNarrow or tight spotbelt
10 صُماطة Leftover food/Small food scrapsA small bag or pouchshawl
11 أدوان EnemiesTools or utensilsclothes
12 مشمع Waxed cloth/Plastic coverOiled cloth or tarpaulinPlastic cover
13 بَرْدِة CurtainA type of cloak or garmentCurtain
14 نصلة Blade/Tip of a knife or swordArrowhead or sharp pointDjambia
15 زنة WeightWeight or measuregarment
16 دِمِّة Silence/StillnessA type of plant or herb (context-dependent)cat
17 حلباني MilkmanRelated to milk or milkingMillipede
18 تيس Male goatMale goatMale goat
19 معزة Female goatFemale goatFemale goat
20 شقري Blond/Fair-skinned Chestnut-colored (often used for horses)Cock
21 شِرْكِة Company/PartnershipPartnership or companymeat
22 جَعالة A small reward or treatReward or compensationSweet presents
23 قِرِّيْح Wound/InjuryA type of plant or herbfirecrackers
24 تـُتنْ TobaccoTobaccoTobacco
25 بيعة Sale/TransactionOath of allegiance or saleleek
26 نِبْدَع We start/We beginWe innovate or createWe start
27 وخّر Move away/Step asideHe delayed or postponedStep aside
28 قوٌّى Strengthened/MadestrongerHe strengthened or empoweredwelcome/please
29 كوزر Drank excessivelyA type of plant or herbsit
30 سير Go/Walk/TravelWalk or journeygo
31 بيشقص Cuts into pieces/TrimsHe cuts or slicesoversee
32 يلغج Chews noisily/Mumbles while eatingHe combs or untanglesTalkative
33 طاسة Bowl/Metal cupA small bowl or cupbowl
34 بالدي Cold/ChilledCold or chillyBucket
35 بُورِي Pipe/Exhaust pipeA type of mat or strawTobacco container
36 مَعْشَرَة Gathering/AssemblyCompanionship or groupBig and wide bowl
37 مَدَق Mortar and pestle (used for grinding)He pounded or crushedpestle
38 مَلَتْ Got bored/Fed upShe tilted or inclinedWooden pot used to mix tobacco
39 قوطي Can/TinA type of plant or herbCan
40 مدعي Pretender/FakerClaimant or plaintiffhookah
41 محواش Farmyard/EnclosureHe erased or removedSmall wooden mixer used for Aseed
42 مطنن Distracted/Not paying attentionHe exaggerated or embellishedupset
43 شوعة Bright light/FlashA small branch or twigugly
44 حالي Sweet/DeliciousMy condition or statebeautiful
45 بحين When/At the timeAt the time or meanwhilefast
46 أيحين Right now/At this momentWhen or at what timewhen
47 فيسع Quickly/FastHe expands or makes roomquickly
48 لَلْمه Gather it up/Collect itA type of plant or herbwhy
49 سَعْما SometimesA type of plant or herblike
50 سمخ Jumped/LeapedHe raised or elevatedbrave

In Table 2, there are several types of translations resulting from the translation of both ChatGPT and DeepSeek. Both models provided correct translations, incorrect translations, and appropriate translations. For some terms, they provide both correct and incorrect translations simultaneously. As for ChatGPT translations, we have correct translations including translating term 1, viz. شاقوص which was translated by ChatGPT as ‘Hook (or a type of tool with a hooked end)’. The incorrectness of translation here lies in rendering the Yemeni Sana’ani Arabic term with Hook which is far away from the correct translation. However, humans render it a small and narrow window which is correct. The YSA term شاقوص is found in old houses, specifically those in the Old Sana’a City. It is located on the stairs (or even in rooms) and is used by a person to see through, but without being noticed by another person. For example, if someone knocks on the door, and only women in the house, they used شاقوص to see who is the knocker before opening the door. To exemplify the correct translations by ChatGPT, take the term دِرجان which means in SA دِرج , a plural form of دِرجة ‘stair.’ As for appropriate translations of ChatGPT, take, for example, the term جَعالة which is a mixed sweet present including biscuit, sweets, brought by a father, mother, or older brother for children. ChatGPT translates this term as a small reward or treat, which is acceptable but not absolutely correct. This is rendered through human translation.

Additionally, DeepSeek translations result in several types of renderings. We obtained seven correct translations, 34 incorrect translations, and eight appropriate translations. Like ChatGPT, DeepSeek provides both correct and incorrect translations simultaneously.

DeepSeek has several correct translations. For example, the term أيحين was translated as When or at what time. أيحين is in fact a wh-word in YSA which is used for asking about time (see also Shormani et al., 2025). To exemplify the incorrect translations by DeepSeek, the term شاقوص was translated as ‘a type of fish.’ This translation is far from the correct translation, which is ‘a small and narrow window.’ As for appropriate translations by DeepSeek take, for instance, the term بَرْدِة which means ‘curtain’ as in the human translation, but DeepSeek translates it as ‘a type of cloak or garment.’ This translation is acceptable due to the fact that بَرْدِة is a piece of cloak or garment used to cover windows. DeepSeek also provides both correct and incorrect translations as in the case of صومعة which was translated as Silo and minaret, the first of which is incorrect while the latter is correct. That the rendered term ‘Silo’ is incorrect is due to the fact that ‘Silo’ is used for keeping corps after harvesting, while ‘minaret’ (of a mosque) is what is meant here.

Table 3 displays the frequency and percentage of correct, incorrect, and appropriate translations of both ChatGPT and DeepSeek. It also presents the frequency and percentage where both models provide correct and incorrect translations simultaneously. For ChatGPT, there were 17 (34%) correct translations. It translated 29 terms incorrectly and 58% of the total number of terms involved. There are three appropriate translations by ChatGPT. The term for which ChatGPT provides correct and incorrect translations at the same time is only 1, namely صومعة as has been discussed above ( Table 3). DeepSeek differs from ChatGPT. For example, it has only seven terms that were translated correctly, that is, less than ChatGPT, amounting to 14% of the total number of terms involved in the study. There are 34 terms that were incorrectly translated by DeepSeek (68%). Appropriate translations scored eight terms, that is, 16%. Finally, DeepSeek provides both correct and incorrect translations for term 1, namely صومعة as just noted with regard to ChatGPT. Interestingly, both AI models provided correct and incorrect translations for the same term.

Table 3. Summary of the results.

ChatGPTDeepSeek
Tra. categoryFreq%Freq %
Correct1734714
Incorrect29583468
Appropriate36816
Co&inc1212

5.2 Discussion

5.2.1 ChatGPT translations

As shown in Table 2, ChatGPT yielded 18 correct translations. These correct translations include دِرجان , translated as ‘stairs.’ This is accurate because in Standard Arabic, دِرجان is the plural form of دِرجة , meaning ‘stair.’ This term is widely used in YSA to refer to a set of stairs leading to another floor of a multifloored building. Another accurate translation is that of قاع , rendered as ‘valley/lowland/open land’. In YSA, قاع refers to a flat, low-lying area of land, often used for agriculture. The term is commonly used in regions where such landscapes are found, where ChatGPT translation is correct. Similarly, حانوت was correctly translated as ‘shop/store’. This word, originating from “older” Yemeni Arabic, remains in use in YSA to describe a small store or shop selling various goods. This aligns well with ChatGPT translation. Another well-rendered term is تـُتنْ , translated by ChatGPT as ‘tobacco.’ In YSA, تـُتنْ is a commonly used word in YSA. تـُتنْ is used as the substance for smoking hookah, making this translation highly accurate. The term بَرْدِة was also correctly translated by ChatGPT as ‘curtain.’ The term is widely used in YSA; it is part of households referring to a piece of cloth covering a window or doorway.

Additionally, the term نِبْدَع was accurately translated by ChatGPT as ‘we start/we begin’. In YSA, it is used to indicate the initiation of an action or event, such as starting work or journey. Another correct translation is سير , translated as ‘go/walk/travel’ though ‘go’ is the best translation, and this is what Sana’anis mean when using it. The term ‘walk’ has another term in YSA, which is إخطى . In YSA, سير is a commonly used verb to tell somebody ‘to go,’ specifically by walking. Also, طاسة was correctly translated by ChatGPT as ‘bowl/metal cup’. In YSA, it refers to a small, often metallic container used for drinking (soup) or eating. The term بُورِي was incorrectly rendered as ‘pipe/exhaust pipe’. The term بُورِي is a word for the container of تـُتنْ which is put on hookah. Likewise, مَدَق was correctly translated as ‘mortar and pestle’ (used for grinding). It is often made of copper or metal, an essential kitchen tool in Yemen used for grinding spices or grains. The term قوطي was also correctly translated as ‘can/tin’.

We now turn to ChatGPT incorrect translations. This category included 29 terms. For example, ChatGPT translated شاقوص as ‘Hook (or a type of tool with a hooked end),’ which is incorrect. In YSA, شاقوص refers to a small window found in traditional houses, particularly in the Old Sana’a City. These windows are strategically placed on stairs or rooms, allowing residents to see outside without being noticed. For example, if someone knocks on the door and only women are home, they use the شاقوص to identify the visitor before opening the door, as noted to above. This highlights the cultural and architectural significance of the term that ChatGPT missed. ChatGPT translated دَيْمِة as ‘Continuous rain (light drizzle),’ which is incorrect. The human translation, ‘kitchen,’ is accurate. Though the term مطبخ , is used in other parts of Yemen, there are also old Yemeni terms used for ‘kitchen’ in some parts of Yemen, as in the case of سقيفة which is used in Ibb region., For instance, the term دَيْمِة is widely used in YSA to mean ‘kitchen’. This term is deeply tied to daily life in YSA, and this ChatGPT translation is unrelated to its actual meaning. ChatGPT translated غُرقة as ‘room,’ which is incorrect. The human translation, ‘hole,’ is accurate due to the fact that غُرقة refers to a hole in YSA. ChatGPT translation of this term reflects a lack of understanding of the term’s true meaning. It translated مُغْمُق as ‘dark/dim,’ which is incorrect, compared to the human translation, ‘veil,’ which is what this term in YSA refers to. مُغْمُق refers to a woman’s veil, which Sana’ani women used to cover their faces. It is a piece of cloth to covering women’s faces, having cultural and religious significance in Yemeni society. It seems that ChatGPT translation of this term does not capture the term’s cultural and religious connotations. ChatGPT translated مَحْزَق as ‘tight place/constricted area,’ which is incorrect. The incorrectness of this translation lies in capturing neither the cultural nor the dialectical nuances. In YSA, the term مَحْزَق refers to a belt decorated with gun bullets’ holes. In Sana’a, and northern places in Yemen, مَحْزَق is used as a sign of “manhood,” courage and a signal of fighting. مَحْزَق is often worn with a gun. All these features of مَحْزَق are reflected in the human translation.

ChatGPT translated شِرْكِة as ‘company/partnership,’ which is far from the dentation of this term in YSA. The term شِرْكِة simply means ‘meat’. refers to meat, a staple in yemeni cuisine, and daily life. ChatGPT translation reflects a misunderstanding of the term, likely due to the word’s similarity to the Arabic term for ‘partnership’ ( شركة ), but with orthograpgical differences. The latter is written as شَرٍكَة , note the different ‘harakat’, as we will see shortly.

Finally, the appropriate (acceptable) translations by ChatGPT included only three terms. In this category, ChatGPT provides a translation that is somewhat acceptable but not fully accurate as in translating the term جَعالة , translated by ChatGPT as ‘a small reward or treat’. While this conveys a general idea, the specific meaning in YSA culture refers to a mix of sweets, biscuits, and treats brought by a father, mother, or older sibling for children as a gesture of care and love. It is a traditional and cultural practice rather than just a generic ‘treat.’ Another example is بحين , translated as ‘when/at the time’. While this is an acceptable translation, the term in YSA more precisely means ‘as soon as’ or even ‘fast’. Similarly, أيحين was translated as ‘right now/at this moment’, which is mostly correct. However, in SYA, أيحين carries a sense of questioning, meaning ‘when’ as can be observed in human translation.

5.2.2 DeepSeek translations

Recall that DeepSeek has seven correct translations, 34 incorrect and eight appropriate translations, and both correct and incorrect translations. We exemplify some of these correct translations. For example, DeepSeek translated the term قاع , as ‘bottom or base.’ This is accurate because in YSA, قاع also refers to the lowest part of something such as the bottom of a container. However, as we have seen in ChatGPT translation, the term قاع refers also to a valley which is fertile for agriculture. There are also many well-known قيعان ‘plural of قاع ” in Yemen such as قاع البون، قاع جهران (Jahran valley, Bawn valley, respectively), etc. There are several well-known fertile valleys in Yemen. Almost all types of crops are grown in these fertile valleys. Another term translated correctly by DeepSeek is دِرجان ‘stairs.’ DeepSeek adds ‘steps’ as an alternative translation which does not reflect the actual context like ‘stairs.’ حانوت was also accurately translated as ‘shop or small store,’ reflecting its meaning in the local dialect. Note that this translation aligns with ChatGPT translation. DeepSeek adds ‘small’ which is true; in Sana’ani dialect, حانوت is a small shop in a traditional old market like Souq almilħ, a famous Souq in Sana’a Old City. Additionally, تيس was correctly rendered as ‘male goat,’ and معزة as ‘female goat,’ both of which match the human translations. Another accurate translation is تُتن , which DeepSeek translated as ‘tobacco,’ a common term in YSA for tobacco products.

However, several incorrect translations were made. The term شاقوص , for example, which DeepSeek translated as ‘a type of fish (likely a local or dialectal term)’ is a good case in point here. This is incorrect, as شاقوص in YSA dialect refers to a small and narrow window in old houses, particularly in Old Sana’a City, as we have noted earlier in relation to ChatGPT translation. The term صومعة in addition was mistranslated as ‘light rain or drizzle,’ while in reality, it means ‘minaret,’ i.e. a tall circled narrow tower of a mosque. Another significant mistranslation is دَيْمِة , which was rendered as ‘confused or muddled,’ whereas in YSA, it actually refers to kitchen.

Unlike ChatGPT, DeepSeek translated the term صُماطة as ‘a small bag or pouch,’ which is not correct. It refers to some sort of ‘shawl,’ a traditional Yemeni piece of clothes that covers men’s heads, or put on their shoulders. Another incorrect translation concerns the term زنة , which DeepSeek rendered as ‘weight or measure,’ while its actual meaning in YSA is ‘garment’ like robe. دِمِّة was mistranslated as by DeepSeek as ‘a type of plant or herb,’ but it actually means ‘cat’ as in the human translation of the term. The term بُورِي was translated as ‘a type of mat or straw,’ which is not correct. The correct translation is the container of تـُتنْ which is put on hookah, as we have noted so far. DeepSeek also mistranslated يلغج as ‘he combs or untangles,’ whereas it actually means to repeat telling something several times or simply ‘talkative’ in YSA. بالدي was translated as ‘cold or chilly,’ but it means ‘bucket’ in Sana’ani dialect. مَعْشَرَة was mistranslated as ‘companionship or group,’ but its actual meaning is ‘big and wide bowl,’ which is used for serving Aseed in Sana’ani dialect. مَدَق was translated as ‘he pounded or crushed,’ but it refers to a small wooden or copper mixer used for grinding especially spices as indicated in the human translation. مَلَتْ was rendered as ‘she tilted or inclined,’ whereas its correct meaning is a wooden or aluminum container in which tobacco is mixed.

Regarding appropriate translations, DeepSeek translated غُرقة as ‘drowning or flood,’ which is somewhat close but not fully accurate, as its proper meaning is ‘hole. أدوان was translated as ‘tools or utensils,’ which is acceptable but not entirely a precise translation, as its actual meaning is clothes. مشمع was translated as ‘oiled cloth or tarpaulin’, which is reasonable but slightly different from the correct meaning of ‘plastic cover. بَرْدِة was translated as’ a type of cloak or garment’, while it actually means ‘curtain’. نصلة was translated as’arrowhead or sharp point’, which is somewhat close but does not fully convey its actual meaning, Djambia (the traditional Yemeni dagger). جَعالة was translated as’reward or compensation’, but it should be understood as a mixture of sweets, biscuits, and treats given to children. سير was translated as ‘walk or journey,’ which is close but does not capture its everyday use in YSA as ‘go.’ بحين was acceptably translated as ‘at the time or meanwhile,’ though the human translation conveys the meaning more naturally.

6. ChatGPT and DeepSeek: Convergence and divergence

6.1 Dialectal and cultural subtleties

Language reflects culture and vice versa (see e.g. Newmark, 1988; Bassnett, 2013; Shormani, 2020), and given that dialect is a regional variety of a language, one could argue that dialect could reflect culture more precisely than language because it represents a regional identity (Arzu & Issa, 2014; Daulay, 2017). A dialect carries regional variations mirroring everyday affairs, people’s needs, and emotions. Thus, the dialect is closer to people than to language. In our case, YSA is closer than SA to Sana’ anis, as it is their mother tongue. A dialect may also include lexes that are not found in a standard language. For example, the term جَعالة belongs from YSA, but not SA. Given these dialectical peculiarities, we examined how ChatGPT and DeepSeek deal with such dialectical subtleties and whether they both understand them. For instance, the term جَعالة refers to a gift of sweets for children brought by father, mother, or relatives. ChatGPT translates it as ‘a small reward or treat,’ which is close to the meaning, whereas DeepSeek mistranslates it as ‘reward or compensation,’ losing the cultural sense. The term نِبْدَع was accurately translated by ChatGPT as ‘we start/we begin’. However, DeepSeek was not able to translate and capture this dialectical aspect, hence providing an incorrect translation, ‘we innovate or create.’

Both ChatGPT and DeepSeek incorrectly translated حلباني as ‘milkman,’ and ‘elated to milk or milking,’ respectively, which shows their failure to capture the cultural and dialectical nuances of this term. The term حلباني in YSA refers to an insect commonly found in Yemen, specifically in rain seasons, whose English equivalent is ‘millipede. Both ChatGPT and DeepSeek also translated شقري as ‘blond/fair-skinned’ and ‘chestnut-colored’. Neither term reflects the correct dialectical or the San’ani Yemeni culture. شقري refers to a living cock, specifically when sold/bought in the market. Another term that has been mistranslated by both models is مدعي . ChatGPT translated it as ‘retender/faker’ while DeepSeek ‘claimant or plaintiff.’ Both translations are neither correct nor appropriate, as can be observed from contrasting them with the human translation, viz., hookah. Another example is محواش which manifests YSA both dialect and culture. As for dialect, the term محواش refers to a wooden tool used for mixing Aseed (see here). However, the same tool is referred to as مجحي ‘majhi’ in Ibbi dialect, for instance. Here lies the concept of dialectical difference. Additionally, this word may not belong to SA because, as far as we can tell, there is nothing called “Aseed” in SA. “Aseed” is a Yemeni dish; no other Arab country has it, and here lies the cultural peculiarities of the term “Aseed,” in general, and محواش , in particular. A further term reflecting Sana’ani dialect and culture is لَلْمه , which is a Sana’ani Arabic-specific term, and Sana’anis are sometimes mocked by other Yemenis belonging to other Yemeni regions. This term is an adverb, specifically a wh-word meaning ‘why.’ A final term that reflects both Sana’ani dialect and culture that can be highlighted here is مغمق . As we have discussed so far, the term مغمق simply means ‘veil,’ though the “veil” is used in all Yemen, مغمق has a specific (dialectical) cultural connotation. In Sana’a, ‘veil’ is different from any ‘veil’ used in other Yemeni regions; it is piece of cloth worn by women only in Sana’a governorate (and some parts of Amran, a governorate which was part of Sana’a (which has recently become an independent governorate).

6.2 Linguistic subtleties

In this category, we examine the ability of both models to capture linguistic nuances in terms of syntax and morphophonology/orthography. Regarding the former, there are four syntactic categories in our data: nouns, verbs, adjectives, and adverbs. The category nouns includes most of the terms involved, such as Place, Clothes, Animals, Stuff, and Households. The latter is discussed in terms of the morphophonology/orthography features that these terms involve.

6.2.1 Syntax

In our corpus, the syntactic category, which includes eight verbal terms, seems to be the most difficult syntactic category for both models. For example, while ChatGPT has 3 verbal terms (out of 7), viz., وخّر , نِبْدَع , and سير translated correctly, DeepSeek has no correctly translated verbs. It has only one verb which was translated appropriately, namely سير . Additionally, the most translated category by both models seems to be nouns, although ChatGPT scores more correct translations than DeepSeek. While the former translated 15 nouns and one adjective and one adverb correctly (as marked in black), the latter translated six nouns and only one adverb correctly.

6.2.2 Morphophonology/orthography

To correctly provide the dialectical representation of some words in our data, we provided diacritics, known in Arabic as harakat to differentiate them from their SA equivalents or to mark the Sana’ ani-specific peculiarities. These include َ , ِ , ُ , and ْ (fataha, kasrah, dhumah, and sukun, respectively) (see e.g. Shormani, 2013). These harakats, except sukun can be equalized to the English short vowels a, i, u, respectively. They are placed on letters to indicate how they should be pronounced. Words having these harakats include تـُتنْ , شِرْكِة , بَرْدِة , مُغْمُق and مَعْشَرَة . For example, in SA, we have the term نُبْدِع which means’we innovate’, which in turn is different from the YSA نِبْدَع . Linguistically, the two words have different pronunciations. Most of these words were mistranslated by both models ( Table 2). However, ChatGPT seems to capture the morphophonology/orthography features of these words more than DeepSeek.

6.3 Contextual subtleties

The SL context in which a word is used is considered a crucial issue in translation, as it transfers the intended meaning to the TL audience. The fact that a specific term is used in more than one context is conveyed by the operator “or” or “/”. ChatGPT used “or”/ “/” 39 times, but DeepSeek used “or” 47 times ( Table 2). Considering this aspect, in addition to the (in) correctness of translations provided by both models, there seem to be two contradictory aspects: i) regardless of the (in) correctness of translations, it seems that both models are “aware” of the context of terms, providing more than one translation for a considerable number of terms, that is, 39 vs. 47. In this very aspect, DeepSeek seems to capture context better than ChatGPT does. And ii) if, however, we consider the (in) correctness of the translations, it seems that ChatGPT captures dialectical and cultural nuances more than DeepSeek. The number of correct (and incorrect) translations by ChatGPT and DeepSeek gives us a clear clue that ChatGPT demonstrates a stronger grasp of the cultural and dialectical contexts of YSA than DeepSeek, as it has 17 correct translations, while DeepSeek has only seven correct translations.

6.4 Standard Arabic biases

Given that both models’ incorrect translations are more than their correct ones, although ChatGPT succeeds in capturing YSA nuances to some extent, it seems that both models are SA-biased. Both models often seem to default on SA or broad literal meanings (cf. Alwagieh & Shormani, 2024). For instance, both models translate صومعة incorrectly. While DeepSeek translates it as ‘light rain or drizzle,’ ChatGPT translates it as ‘continuous rain,’ both missing the true meaning ‘minaret.’ The term شقري , which means ‘cock’ in YSA, is mistranslated by DeepSeek as ‘chestnut-colored,’ a more generic SA meaning, perhaps from the SA term أشقر ” blond’. ChatGPT makes a similar mistake with ‘blond/fair-skinned’. It seems that both models are SA-biased; both retain a link to color-based descriptions of the SA meaning, showing a clear SA bias. Another term to be considered here showing SA bias is قوّى translated as ‘strengthened/made stronger’ and ‘he strengthened or empowered’ by ChatGPT and DeepSeek, respectively. Apart from its correct dialectical meaning ‘welcome,’ both models’ translations, though somehow different, seem to take SA meaning considerably in translating this term. Note that the term قوّى can also mean ‘please’ in YSA, as reflected in the human translation. The SA term قوي ‘strong’ seems to influence both models’ translations, again demarcating their SA bias. Thus, this SA bias could be ascribed to the training data, that is, both models appear to be trained only on SA data.

7. Conclusions, recommendations and limitations

The findings demonstrate a clear disparity in the models’ abilities, with ChatGPT consistently outperforming DeepSeek in capturing the YSA-specific dialectical features. Several conclusions could be drawn from this study. First, dialectal and cultural subtleties pose concisderable challenges. Both models struggle significantly with dialectal and culturally embedded terms, although ChatGPT demonstrates a better approximation of meaning in several instances. This is most evident in culturally loaded terms such as نِبْدَع , جَعالة , and محواش , where ChatGPT translations are closer to the intended meanings, while DeepSeek often fails to capture cultural connotations. However, in several cases such as مدعي , شقري , and مغمق , both model fail to provide culturally and contextually appropriate translations, highlighting a systemic challenge in handling regionally specific lexicon. Difficulty is particularly centered around the “stuff” category, which includes highly dialectical and culturally specific items. Culture-based terms are reported as difficult to translate, even for advanced English students (see Alshawsh & Shormani, 2025). None of the five terms in this category were correctly translated by either model, emphasizing a significant gap in the ability of current AI models to handle deeply localized cultural elements (cf. Lilli, 2023; Datta, 2023). Second, it is clear that ChatGPT outperforms DeepSeek in linguistic subtleties. For example, in terms of syntactic categorization, both models struggled most with verbs, which are often morphologically complex and context dependent. While ChatGPT translated three out of the seven verbs correctly, DeepSeek succeeded in only one instance, showing a stark contrast in performance. Third, ChatGPT outperforms DeepSeek concerning the morphophonological and orthographic distinctions. ChatGPT exhibits high performance over DeepSeek. The use of diacritics (harakat) in the dataset helped distinguish YSA terms from their SA counterparts (e.g., نِبْدَع vs. نُبْدِع ), yet most of these marked terms were still mistranslated by both models. However, ChatGPT handles these distinctions with relatively higher accuracy, suggesting a more nuanced internalization of orthographic and morphophonological cues. Fourth, while both models exhibit some degree of contextual awareness, as seen in their use of multiple translation options (i.e., use of “or” & “/”, ChatGPT: 39 instances, DeepSeek: 47), this does not necessarily translate into correct translations. DeepSeek’s higher frequency of “or” usage suggests greater surface-level awareness of polysemy or ambiguity, but this does not correlate with translation accuracy. The final conclusion concerns the SA bias exhibited by both models. This is particularly visible in the mistranslations of terms such as قوى , شقري , and صومعة , where both models defaulted to SA meanings that do not reflect the YSA usage. This SA bias can be attributed to the composition of the model training data, which are likely dominated by SA sources. Consequently, dialectical terms that deviate from SA norms are either misinterpreted or forcibly aligned with their SA counterparts, leading to semantic distortions and a lack of cultural and contextual fidelity.

Thus, these aspects require careful attention from the AI developers. AI developers should expand the dialectal data coverage in the training data to improve the performance of both AI models in translating dialects. Put simply, AI models, including ChatGPT and DeepSeek, should be trained on larger and more diverse linguistic datasets that include significant representations of nonstandard dialects. For example, in the Sana’ani dialect, data should be collected from various sources such as recorded conversations, social media content, and dialect-specific literature or oral history (see e.g. Morano et al., 2025). AI developers are advised to incorporate cultural and social contexts into training data. Many dialectal expressions are deeply rooted in the local culture and social norms. Therefore, AI models should be designed to consider these contexts to avoid literal or inaccurate translations that miss the implied meanings or connotations. Additionally, AI models, here ChatGPT and DeepSeek, should include or improve features that can automatically detect the dialect used in the input text. This would enable a more precise adaptation of translation strategies specific to the YSA dialect (or other dialects across several languages), improving the overall output quality. Finally, AI developers should consider enabling community- or researcher-led fine-tuning of models in niche dialects. They could also encourage (and perhaps fund) studies that tackle dialectical varieties (see also Kadaoui et al., 2023).

However, this study has some limitations. The first limitation is that it focuses exclusively on the Sana’ani dialect. While this dialect is widely spoken, the results may not generalize to other Yemeni dialects, such as Ibbi, Adeni, Hadhrami, or Arabic dialects from other regions, which may exhibit distinct lexical, morphophonological, or syntactic features. The second limitation concerns the sample of the dialectal terms and phrases involved. It was relatively limited in size. A larger and more diverse dataset can improve the reliability of the findings and better reflect the full linguistic complexity of the dialect. Third, the study involved only two AI translation models: ChatGPT and DeepSeek. While these models are prominent and state-of-the-art, other platforms, such as Grok 3, Felo, and Meta, could also be used in future studies to widen the breadth of comparative analysis. Finally, AI models such as ChatGPT and DeepSeek are frequently updated; hence, the performance observed during this study may not remain static over time, potentially affecting the reproducibility or relevance of the results for future users.

Ethics and consent

No ethics and consent statements are required for this study.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 16 Jul 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Shormani MQ and Al-Samki AA. Translating dialects between ChatGPT and DeepSeek: Yemeni Sana’ani Arabic terms as a case-in-point [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:694 (https://doi.org/10.12688/f1000research.165879.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 16 Jul 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.