論文ID: 2025EDP7098
Tibetan text recognition plays a key role in preserving the Tibetan language, religion, and traditions. While text recognition has made progress for high-resource languages, handwritten Tibetan character recognition remains difficult due to limited data and the lack of public large language models. Most existing datasets focus on printed or historical documents, as well as online handwriting data, but there are still few large offline handwritten Tibetan datasets. To solve this problem, we construct TibHCR, a large-scale offline handwritten character recognition dataset for the Tibetan language. To increase the diversity of the linguistic and font styles, more character categories and participants from 5 provinces in China are included. To collect and label the data efficiently, we introduce a grid sheet design, reducing manual annotation to just 1% of the samples. This design then allows for automatic data processing to extract each character sample and its corresponding label. The resulting TibHCR dataset contains 141,698 samples from 235 Tibetan writers, covering 47 character classes. We evaluate TibHCR using two recognition models: a convolutional recurrent neural network (CRNN) and a cross-lingual fine-tuning method, on a Chinese pretrained model using the PP-OCRv4 architecture to adapt Tibetan data. The results show that both models can recognize handwritten Tibetan characters efficiently, with an accuracy of 99.48% for CRNN and 99.70% for the fine-tuning method. The TibHCR dataset is publicly available at https://huggingface.co/datasets/qixiaoke/TibHCR.