The Bi-Lingual Annotation Text Image Dataset
(BLATID)
The Bi-Lingual Annotation Text Image Dataset (BLATID) was
built for the research scenario that a text image of one language is recognized
to be transcribed to text of another languge, also called as cross-lingual text
image recognition (CLTIR).
The CASIA-BLATID dataset contains text
images in Chinese and annotated with both the Chinese transcript and English
translated text.
In building the dataset, we took advantage
of existing machine translation corpus to save labors. The AI Challenger dataset
(AIC, https://github.com/AIChallenger/AI_Challenger_2018) consists of about 12M
Chinese-English sentence pairs for training and 7.8k for validation. We
synthesized Chinese text images referring to the corpus in AIC. The generated
dataset thus contain triplets of "Chinese text image - Chinese text label -
English text label".
We generated 1M triplet data for training
the text recognition model. Considering the difficulty of collecting text images
in practice, the scale of 1M text images is significant enough. Besides, to
increase the diversity of validation set, we adopt 7.8K sentence pairs in the
AIC corpus to generate 100K triplet data with different graphic
configurations.
To evaluate the CLTIR performance in more
practical scenario, the test set was derived from movies and their related
bilingual subtitles. We collected English-Chinese bilingual subtitles from 50
English animated films, where Chinese sentences are printed on corresponding
frames based on timestamps. There are totally 55K samples in the test
set.
In summary, the BLATID dataset has three
subsets:
Tranining set (TImage_train): 1M synthesized
text line images with Chinese and English translated text labels
Validation
set (TImage_valid): 100K synthesized text line images with Chinese and English
translated text labels
Test set (subtitle_test): 55K text line images
(subtitles of movies) with Chinese and English translated text labels
In addition, we also store a copy of AIC
text corpus for possible use (e.g., training assistive model of text
translation, which can be used to leverage the CLTIR model).
All the above datasets are stored in LMDB
(Lightning Memory-Mapped Database, commonly used in Python), each including a
data file and a lock file.
Data Download
TImage_train (29,554MB)
TImage_valid (3,534MB)
subtitle_test (719MB)
AICtext (539MB)
read_lmdb.py (3KB)
Contact
Information
Cheng-Lin Liu
Institute of
Automation of Chinese Academy of Sciences (CASIA)
Beijing 100190,
China
Email:liucl@nlpr.ia.ac.cn
Website:www.nlpr.ia.ac.cn/pal/