Introduction

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.180408.1

Data Note

Articles

Human-Reviewed Uzbek Legal Named Entity Recognition Dataset

[version 1; peer review: awaiting peer review]

Saidov

Bobur

Conceptualization Formal Analysis Funding Acquisition Investigation Methodology Project Administration Software Supervision Validation Visualization Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0009-0000-5540-2013 a 1 Fayzullaeva

Zarnigor

Data Curation Formal Analysis Resources Writing – Original Draft Preparation Writing – Review & Editing 2 Bazarova

Umida

Data Curation Resources Writing – Original Draft Preparation Writing – Review & Editing 3 Narkabilova

Gulnoza

Data Curation Resources Writing – Original Draft Preparation Writing – Review & Editing 4 Azizova

Nasiba

Data Curation Resources Validation Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0001-8579-197X 5 Rustamova

Feruzakhon

Data Curation Resources Writing – Review & Editing 6 Halimova

Firuza

Data Curation Resources 7 1Urgench State University named after Abu Rayhan Biruni, Urgench, Khorezm Province, Uzbekistan 2Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, Tashkent, Tashkent Province, Uzbekistan 3Navoi State University, Navoi, Uzbekistan 4Fergana State University, Fergana, Uzbekistan 5Karshi State University, Qarshi, Kashkadarya Province, Uzbekistan 6Andijan State Institute of Foreign Languages, Andijan, Uzbekistan 7Samarkand State Institute of Foreign Languages, Samarkand, Uzbekistan

a saidovboburbek9629@gmail.com

No competing interests were disclosed.

10 6 2026

2026

909

23 5 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article describes a human-reviewed Uzbek legal-domain named entity recognition (NER) dataset developed as a reusable resource for low-resource legal NLP. The release contains 12 entity categories: PER, ORG, LOC, DATE, MONEY, POSITION, DOCNO, LAW, COURT, BANK, TIN, and CADASTRE. The dataset is provided in XLSX, CSV, JSON, and JSONL formats and is structured into two complementary layers: a core subset of manually reviewable source-grounded records and an extended augmented subset used to support lower-frequency labels in training-oriented settings. The package also includes supporting documentation, split guidance, a data dictionary, and review-related metadata, including provenance, verification status, and quality flags. Character-level start and end offsets are included where recoverable. The release is intended to facilitate Uzbek legal NER research, resource curation, and transparent reuse under provenance-aware conditions.

Uzbek language; legal named entity recognition; legal NLP; low-resource NLP; dataset; information extraction; synthetic augmentation; sequence labeling.

The author(s) declared that no grants were involved in supporting this work.

Introduction

Named entity recognition (NER) is an important task in natural language processing and information extraction, especially in domains where entities carry legal, administrative, and operational value. In Uzbek legal and quasi-legal texts, entities such as persons, organizations, locations, dates, monetary amounts, document identifiers, legal references, courts, banks, tax identifiers, and cadastral identifiers are important for document understanding, indexing, retrieval, and downstream language technology applications. ¹ However, Uzbek remains a low-resource language in legal-domain NER, and publicly reusable resources for this setting are still limited in both label coverage and release design. ²

The dataset described in this article was prepared as a reusable Uzbek legal-domain NER resource intended to support data curation, controlled reuse, and future resource development. ³ The release covers 12 entity categories: PER, ORG, LOC, DATE, MONEY, POSITION, DOCNO, LAW, COURT, BANK, TIN, and CADASTRE. In addition to the data records themselves, the package includes supporting documentation, provenance-aware metadata, review-related status fields, and synchronized exports in XLSX, CSV, JSON, and JSONL formats. Character-level start and end offsets are included where recoverable. ⁴

A central feature of the release is its layered structure. The package distinguishes between a core subset of manually reviewable source-grounded records and an extended augmented subset intended for training support in lower-frequency labels. ⁵ This separation was introduced to improve transparency and to make it easier for future users to distinguish source-grounded material from synthetic support data. ⁶ The release should therefore be interpreted as a human-reviewed, gold-ready resource rather than as a fully finalized gold-standard benchmark. ⁷

The present article focuses on describing the dataset, its construction logic, package organization, and validation-oriented release structure. The resource is intended to support Uzbek legal NER research, dataset organization, and transparent reuse under provenance-aware conditions.

Materials and methods Dataset design and scope

The dataset was designed as a reusable Uzbek legal-domain named entity recognition (NER) resource for low-resource information extraction research. ⁸ The release covers 12 entity categories: PER, ORG, LOC, DATE, MONEY, POSITION, DOCNO, LAW, COURT, BANK, TIN, and CADASTRE. The main goal of the dataset construction process was to create a provenance-aware and human-reviewable resource that can support dataset curation, transparent reuse, and training support under provenance-aware conditions in Uzbek legal NLP. ⁹

The release was organized as a multi-layer package rather than as a single flat table. In particular, the dataset distinguishes between: 1.

A core benchmark-oriented subset containing the most reviewable and source-grounded records, and

An extended augmented subset containing additional examples reserved for training support, especially for lower-frequency labels. ¹⁰

This layered design was adopted to preserve methodological transparency and to prevent benchmark-oriented records from being mixed with augmentation-oriented material without explicit provenance tracking.

Data sources

The dataset was compiled from Uzbek legal and quasi-legal texts collected from publicly accessible and reusable sources. Source selection was guided by legal relevance, practical reusability, and the need to cover both standard entity classes (such as persons, organizations, and locations) and legal-administrative entity classes (such as document numbers, legal references, tax identifiers, and cadastral identifiers). ¹¹

Because the target schema includes specialized labels that are not uniformly represented in public texts, the collection process was label-aware. High-frequency entity classes such as PER, ORG, and LOC were gathered from broader institutional and formal texts, whereas lower-frequency and domain-specific classes such as BANK, COURT, TIN, and CADASTRE required more targeted retrieval. ¹² Source-level provenance was preserved wherever possible through metadata fields such as Source_Name, Source_URL, Data_Origin, and Source_Group.

Record assembly and preprocessing

After source collection, candidate records were assembled in a label-wise manner rather than through a single uniform pipeline. Intermediate label-specific tables were created first and then merged into a unified release structure. ¹³ This approach made it possible to monitor class coverage, identify low-resource labels early, and perform targeted refinement where required.

During preprocessing, the dataset underwent several harmonization steps: •

Alignment of label-wise tables into a common schema;

•

Sentence cleaning and normalization;

•

Standardization of entity-bearing fields;

•

Preservation of provenance metadata;

•

Integration of review-related fields;

•

Consolidation of repeated or redundant rows.

Each record in the final package was represented as a row containing a text context (Sentence), an associated entity mention (Extracted_Text), a label (Label), and supporting provenance, verification, and usage metadata. Whenever recoverable, character-level offsets (Start_Char and End_Char) were also included to facilitate later conversion into stricter span-based sequence-labeling formats. The principal record-level fields included in the released dataset are summarized in Table 1. These fields describe not only the text and entity content of each record, but also its provenance, review status, and intended use within the release structure.

Table 1. Main fields provided in each dataset record.

Field	Type	Description
Record_ID	string	unique record identifier
Sentence	string	sentence or text snippet
Extracted_Text	string	target entity mention
Label	string	assigned entity category
Start_Char	integer/null	start offset of entity
End_Char	integer/null	end offset of entity
Source_Name	string	source identifier
Source_URL	string/null	source link when available
Data_Origin	string	provenance tag
Source_Group	string	open-source or synthetic
Is_Verified	string	verification indicator
Verification_Status	string	current review state
Gold_Status	string	benchmark role
Quality_Flag	string	warning/quality note
Recommended_Split	string	suggested subset assignment

The table summarizes the principal text, provenance, review, and release-management fields included in the dataset package.

Annotation schema

The dataset uses a 12-class entity schema designed specifically for Uzbek legal and quasi-legal documents. The selected labels were intended to balance practical legal relevance with annotation interpretability. In addition to general-domain categories such as PER, ORG, LOC, and DATE, the schema includes legal-administrative categories such as DOCNO, LAW, COURT, BANK, TIN, and CADASTRE, which are important for legal document analysis, structured extraction, and retrieval-oriented NLP applications.

To reduce ambiguity, the release documentation distinguishes between labels that may be superficially similar but functionally different in legal texts, such as ORG vs BANK, ORG vs COURT, DOCNO vs TIN, and DOCNO vs CADASTRE. ¹⁴ The full set of entity categories included in the release is summarized in Table 2. This overview clarifies the scope of the annotation schema and highlights the legal-domain relevance of the selected labels.

Table 2. Entity labels included in the dataset.

Label	Description	Typical legal use
PER	Person names	parties, signatories, representatives
ORG	Organizations and institutions	companies, agencies, ministries
LOC	Locations and administrative places	cities, regions, districts
DATE	Explicit date expressions	agreement dates, deadlines
MONEY	Monetary amounts	payments, contract values
POSITION	Official positions and roles	director, manager
DOCNO	Document identifiers	contract numbers, decree IDs
LAW	Legal references	laws, codes, regulations
COURT	Judicial institutions	court names
BANK	Banking institutions	payment banks
TIN	Tax/personal identifiers	STIR, INN, JSHSHIR
CADASTRE	Property/cadastral identifiers	cadastral numbers

The table lists the 12 named entity categories included in the release together with their general interpretation and typical legal-domain use. Abbreviations: PER, person; ORG, organization; LOC, location; DOCNO, document number; LAW, legal reference; COURT, judicial institution; BANK, banking institution; TIN, tax identification number; CADASTRE, cadastral identifier.

Review and validation workflow

The release was prepared through a review-oriented refinement workflow rather than as a fully finalized gold-standard benchmark. After merging and harmonization, the records were screened using a staged quality-control procedure. Priority was given to: •

Rows with missing extracted entity text;

•

Lower-frequency and structurally sensitive labels;

•

Rows with incomplete provenance;

•

Rows reserved for augmentation-oriented use.

Review decisions were recorded explicitly using categories such as keep, edit, drop, and move_to_augmented_only. The release also preserves row-level review metadata through fields such as Is_Verified, Verification_Status, Gold_Status, Eligibility, and Quality_Flag. This design makes the current version more transparent and easier to refine collaboratively in future iterations. ¹⁵

The present release should therefore be interpreted as a human-reviewed, gold-ready resource, rather than as a fully adjudicated final gold benchmark.

Synthetic augmentation

To support labels with limited naturally available coverage in public Uzbek legal texts, a controlled synthetic augmentation strategy was applied selectively. Synthetic rows were introduced only for lower-frequency classes where source-grounded examples were insufficient for practical experimental support. The augmentation process followed a template-based generation strategy, designed to preserve legal-domain plausibility and label clarity. ¹⁶

Synthetic records were explicitly marked in the metadata and were kept separate from benchmark-oriented material. These rows are intended for training support only and should not be treated as equivalent to manually reviewable source-grounded examples in evaluation settings.

Export formats and package structure

The final dataset was released in four synchronized formats: XLSX, CSV, JSON, and JSONL. These formats were chosen to support different user needs: •

XLSX for manual inspection and metadata review,

•

CSV for tabular processing and descriptive analysis,

•

JSON for structured record-based storage,

•

JSONL for NLP pipelines and line-based machine processing.

In addition to the data files, the package includes supporting documentation such as a README, data dictionary, split description, known limitations, citation file, changelog, and license information. The main components of the released package and their practical roles are summarized in Table 3. This table makes the multi-format structure of the release explicit and shows how the package supports both manual inspection and machine-readable reuse.

Table 3. Main files and formats included in the released dataset package.

Component	Format	Purpose
Main dataset	XLSX	manual inspection and metadata review
Main dataset	CSV	tabular processing and statistics
Main dataset	JSON	structured record storage
Main dataset	JSONL	NLP pipelines and batch processing
README	TXT/MD	package overview
Data dictionary	CSV/TXT	field explanation
Split description	TXT	train/dev/test usage guidance
Changelog	TXT	version history
License	TXT	legal reuse conditions

The table summarizes the principal package components, their file formats, and their intended practical role in reuse. Abbreviations: XLSX, Microsoft Excel workbook; CSV, comma-separated values; JSON, JavaScript Object Notation; JSONL, line-delimited JSON; TXT, plain text; MD, Markdown.

Software and reproducibility

The dataset preparation, harmonization, and export workflow was carried out using standard spreadsheet and scripting tools. Structured preprocessing and export operations were performed using Python, version 3.10 with standard data-processing libraries such as pandas, version 1.5.3 and json, openpyxl, and re (regular expressions). The scripts and supporting files used for preprocessing, entity extraction, and formatting for the Zenodo release are included in the associated Zenodo repository: https://doi.org/10.5281/zenodo.19682709. ¹⁷

No task-specific model training was required for the generation of the released tabular package itself; however, semi-automatic processing steps such as field normalization, recoverable span localization, and export conversion were performed using the above software environment. Parameters that may affect reproducibility include text normalization rules, row filtering criteria, provenance-based subset separation, and the logic used to assign Recommended_Split, Gold_Status, and Verification_Status fields. ¹⁸ The overall dataset construction and release workflow is illustrated in Figure 1.

Figure 1. Overview of the dataset construction and release workflow.

The workflow proceeds from source collection and label-wise candidate gathering to record assembly and preprocessing, review and validation, provenance separation, creation of the core and augmented subsets, export in XLSX, CSV, JSON, and JSONL formats, and final release via Zenodo. Abbreviations: XLSX, Microsoft Excel workbook; CSV, comma-separated values; JSON, JavaScript Object Notation; JSONL, line-delimited JSON.

Dataset validation

The current release was validated through a staged quality-control and review workflow intended to improve structural consistency, provenance transparency, and practical reusability. Validation did not rely on a single binary accept/reject decision for the entire dataset; instead, it combined structural checks, review-oriented prioritization, and record-level status tracking. ¹⁹

At the structural level, the dataset was checked for consistency across the released XLSX, CSV, JSON, and JSONL formats. These checks included field alignment, label consistency, preservation of provenance metadata, and consistency of release-specific fields such as Gold_Status, Verification_Status, Eligibility, Recommended_Split, and Quality_Flag. Where possible, recoverable character-level offsets (Start_Char and End_Char) were retained to support later conversion into stricter span-based NER formats.

At the record level, validation followed a review-oriented workflow. Priority was given to rows that were more likely to affect downstream benchmark quality, including records with missing extracted entity text, incomplete provenance, lower-frequency labels, and rows reserved for augmentation-oriented use. Review outcomes were tracked explicitly through categories such as keep, edit, drop, and move_to_augmented_only, together with supporting metadata fields such as Is_Verified, Verification_Status, Gold_Status, and Quality_Flag.

As shown in Table 4, the current release provides stronger coverage for high-frequency classes such as PER, ORG, and LOC, whereas lower-frequency classes such as BANK and COURT remain more limited and should therefore be interpreted with additional caution in benchmark-oriented settings.

Table 4. Number of records per entity label in the current release.

Label	Number of records
PER	2000
ORG	2000
LOC	1700
POSITION	1300
DATE	1200
MONEY	1200
DOCNO	1200
LAW	1200
TIN	800
CADASTRE	700
BANK	411
COURT	325
Total	14,036

A further validation principle was explicit provenance separation. The release distinguishes between source-grounded open-source rows and synthetic augmentation rows, and this distinction was preserved through record-level metadata. Synthetic rows were not treated as equivalent to benchmark-oriented source-grounded examples and were retained only for augmentation-aware training use. This separation improves interpretability and reduces the risk of unintentionally mixing evaluation material with training support data.

The released package should therefore be interpreted as a human-reviewed, gold-ready dataset resource rather than as a fully finalized gold-standard benchmark. Its main strengths are its multi-format release design, explicit provenance metadata, and review-aware structure. These features make the dataset immediately usable for exploratory analysis, dataset refinement, and controlled training use under provenance-aware conditions.

At the same time, several limitations should be noted. First, not all records have undergone exhaustive final manual span adjudication. This is especially relevant for label pairs with higher ambiguity potential, such as ORG vs BANK, ORG vs COURT, DOCNO vs TIN, and DOCNO vs CADASTRE. Second, the current release is class-imbalanced, with stronger coverage for higher-frequency classes such as PER, ORG, and LOC, and more limited coverage for lower-frequency classes such as BANK and COURT. Third, the package contains both source-grounded and synthetic material with different evidential status, which requires provenance-aware filtering depending on the intended use case.

For conservative benchmark-oriented evaluation, users should prioritize the most reliable source-grounded rows and apply additional manual verification where necessary. For training-oriented experiments, the augmented subset may also be used, provided that the inclusion of synthetic support is reported explicitly.

Ethical considerations

This work did not involve animal experiments or direct human-subject research. The dataset was compiled from publicly accessible and reusable Uzbek legal and quasi-legal textual materials, together with controlled synthetic augmentation used for training support in lower-frequency labels. No direct participant recruitment, intervention, or experimental data collection was conducted. Therefore, ethical approval and informed consent were not required.

The release was prepared for research reuse with attention to provenance, documentation, and transparency. Users of the dataset are expected to follow applicable legal, ethical, and institutional requirements when handling identifier-like fields and other potentially sensitive legal-domain information.

Data availability

The dataset is publicly available via Zenodo: https://doi.org/10.5281/zenodo.19682709. ²⁰ The Zenodo record includes the released dataset files, supporting documentation, and release metadata required to interpret and reuse the package. Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Extended data

Extended data associated with this article are available in the same Zenodo repository: https://doi.org/10.5281/zenodo.19682709. ²⁰ These materials include the README file, annotation guidelines, label definitions and boundary rules, preprocessing and script notes, data dictionary, split guidance, changelog, license information, and supporting documentation describing provenance, verification, and package interpretation. These files are provided to support transparent reuse and correct interpretation of the dataset.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Acknowledgements

The authors would like to thank the colleagues and reviewers who contributed to the review, organization, and refinement of the dataset and its documentation. Their feedback helped improve the structure and clarity of the released resource. During the preparation of this manuscript, the authors used ChatGPT (GPT-5.2, OpenAI) only for grammar and spelling checks and not for study design, data collection, data labeling, model training, statistical analysis, or interpretation of results. All scientific content, analyses, and conclusions were produced and verified by the authors, who take full responsibility for the publication.

References 1

Saidov

Barakhnin

: Sentiment analysis of Uzbek texts using NER: A comparative study of SVM, LSTM, and BERT models. The Herald of the Siberian State University of Telecommunications and Information Science. 2025;19:3–17. 10.55648/1998-6920-2025-19-4-3-16

Abdullaeva

Khamidov

Iskandarov

: Lexical resources and named entity recognition for low-resource languages: A comparative study. Int J Comput Linguist Appl. 2025;16:45–60.

Bakhtiyarov

Zokirov

Gaybullaev

: Neural architectures for entity-aware sentiment analysis in multilingual corpora. Appl. Artif. Intell. 2025;39:412–428. 10.1080/08839514.2025.2345612

Panjiyeva

Begmatov

Djalilova

: Knowledge-based and neural hybrid models for named entity recognition in educational texts. Educ. Inf. Technol. 2025;30:5011–5030. 10.1007/s10639-025-11890-3

Jumaniyozova

Ravikumar

Aarthi

: Cross-lingual transfer learning for sentiment and entity recognition in low-resource settings. ACM Trans Asian Low-Resour Lang Inf Process. 2025;24:1–23. 10.1145/3678912

Mamadiyarov

Ngongo

Buriev

: Hybrid deep learning models for multilingual sentiment and entity extraction. J Artif Intell Soft Comput Res. 2025;15:201–214.

Saidov

Barakhnin

Saparbaev

: A hybrid NER–sentiment model for Uzbek texts: Integrating lexical, deep learning, and entity-based approaches. Big Data Cogn. Comput. 2026;10:92. 10.3390/bdcc10030092

Mengliev

Barakhnin

Abdurakhmonova

: Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation. Data Brief. 2024;54:110413. 38708296

10.1016/j.dib.2024.110413

PMC11067374

Mengliev

Barakhnin

Eshkulov

: A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language. Data Brief. 2025;58:111249. 39811531

10.1016/j.dib.2024.111249

PMC11732609

Mengliev

: Dataset of Named Entity Recognition for Uzbek language. Mendeley Data. 2024:V1. 10.17632/xf7pyvhb2v.1

Yusufu

Jiang

Ainiwaer

: UZNER: A Benchmark for Named Entity Recognition in Uzbek. Natural Language Processing and Chinese Computing. Cham, Switzerland: Springer;2023; pp.171–183. 10.1007/978-3-031-44693-1_14

TWT

Lampos

Cox

: E-NER—An annotated named entity recognition corpus of legal text. Proceedings of the Natural Legal Language Processing Workshop 2022. Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics;2022; pp.246–255. 10.18653/v1/2022.nllp-1.22

Păiș

Mitrofan

Gasan

: LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain. Semant Web. 2024;15:831–844. 10.3233/SW-233351

Yulianti

Bhary

Abdurrohman

: Named entity recognition on Indonesian legal documents: A dataset and study using transformer-based models. International Journal of Electrical and Computer Engineering (IJECE). 2024;14:5489–5501. 10.11591/ijece.v14i5.pp5489-5501

Ullah

Gelbukh

Zamir

: Enhancement of named entity recognition in low-resource languages with data augmentation and BERT models: A case study on Urdu. Computers. 2024;13:258. 10.3390/computers13100258

Chen

Pei

: Low-resource named entity recognition via the pre-training model. Symmetry. 2021;13:786. 10.3390/sym13050786

Saidov

Fayzullaeva

Bazarova

: Uzbek Legal NER Dataset Package: A Gold-Ready Multi-Format Resource with Core Gold and Extended Augmented Layers. Zenodo. 2026. 10.5281/zenodo.19682709

Torge

Politov

Lehmann

: Named entity recognition for low-resource languages—Profiting from language families. Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Dubrovnik, Croatia: Association for Computational Linguistics;2023; pp.1–10. 10.18653/v1/2023.bsnlp-1.1

Bahad

Mishra

Krishnamurthy

: Fine-tuning pre-trained named entity recognition models for Indian languages. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop). Mexico City, Mexico: Association for Computational Linguistics;2024; pp.75–82. 10.18653/v1/2024.naacl-srw.9

Saidov

Fayzullaeva

Bazarova

: Uzbek Legal NER Dataset Package: A Gold-Ready Multi-Format Resource with Core Gold and Extended Augmented Layers. Zenodo. 2026. 10.5281/zenodo.19682709