<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML lang=en xmlns:o = "urn:schemas-microsoft-com:office:office"><HEAD><TITLE>Home People Projects Publications Seminar Activities Data&Codes CASIA-AHCDB: Chinese Ancient Handwritten Characters Database</TITLE>
<META content="text/html; charset=unicode" http-equiv=Content-Type>
<META name=viewport content="width=device-width, initial-scale=1"><LINK 
rel=stylesheet href="css/bootstrap.min.css"><LINK rel=stylesheet 
href="css/activities.css"><LINK rel=stylesheet href="css/index.css"><LINK 
rel=stylesheet href="css/projects.css">
<META name=GENERATOR content="MSHTML 11.00.10570.1001"></HEAD>
<BODY>
<DIV class="container text-container">
<DIV class=row>
<DIV class="col-md-6 col-sm-6 logo" align=center><SPAN lang=EN-US><o:p><FONT 
color=#000000 face=Calibri><SPAN lang=EN-US><o:p><FONT color=#000000 size=4 
face=Calibri><STRONG>The Bi-Lingual Annotation Text Image Dataset 
(BLATID)</STRONG></FONT></o:p></SPAN></FONT></o:p></SPAN></DIV>
<DIV class="col-md-6 col-sm-6 logo"><SPAN lang=EN-US><o:p><FONT color=#000000 
size=3 face=Calibri><SPAN 
lang=EN-US><o:p></o:p></SPAN></FONT></o:p></SPAN>&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo"><SPAN lang=EN-US><o:p><FONT color=#000000 
size=3 face=Calibri>The Bi-Lingual Annotation Text Image Dataset (BLATID) was 
built for the research scenario that a text image of one language is recognized 
to be transcribed to text of another languge, also called as cross-lingual text 
image recognition (CLTIR).<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">The CASIA-BLATID dataset contains text 
images in Chinese and annotated with both the Chinese transcript and English 
translated text.<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">In building the dataset, we took advantage 
of existing machine translation corpus to save labors. The AI Challenger dataset 
(AIC, https://github.com/AIChallenger/AI_Challenger_2018) consists of about 12M 
Chinese-English sentence pairs for training and 7.8k for validation. We 
synthesized Chinese text images referring to the corpus in AIC. The generated 
dataset thus contain triplets of "Chinese text image - Chinese text label - 
English text label".<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">We generated 1M triplet data for training 
the text recognition model. Considering the difficulty of collecting text images 
in practice, the scale of 1M text images is significant enough. Besides, to 
increase the diversity of validation set, we adopt 7.8K sentence pairs in the 
AIC corpus to generate 100K triplet data with different graphic 
configurations.<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">To evaluate the CLTIR performance in more 
practical scenario, the test set was derived from movies and their related 
bilingual subtitles. We collected English-Chinese bilingual subtitles from 50 
English animated films, where Chinese sentences are printed on corresponding 
frames based on timestamps. There are totally 55K samples in the test 
set.<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">In summary, the BLATID dataset has three 
subsets:<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">Tranining set (TImage_train): 1M synthesized 
text line images with Chinese and English translated text labels<BR>Validation 
set (TImage_valid): 100K synthesized text line images with Chinese and English 
translated text labels<BR>Test set (subtitle_test): 55K text line images 
(subtitles of movies) with Chinese and English translated text labels<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">In addition, we also store a copy of AIC 
text corpus for possible use (e.g., training assistive model of text 
translation, which can be used to leverage the CLTIR model).<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">All the above datasets are stored in LMDB 
(Lightning Memory-Mapped Database, commonly used in Python), each including a 
data file and a lock file.<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
<DIV class="col-md-6 col-sm-6 logo">Data Download<BR></DIV>
<DIV class="col-md-6 col-sm-6 logo">&nbsp;</DIV>
    &nbsp;&nbsp;&nbsp;&nbsp;<A
href="http://www.nlpr.ia.ac.cn/pal/Dataset/BLATID/TImage_train.rar">TImage_train</A> (29,554MB)<BR>&nbsp;&nbsp;&nbsp;&nbsp;<A 
href="http://www.nlpr.ia.ac.cn/pal/Dataset/BLATID/TImage_valid.rar">TImage_valid</A> (3,534MB)<BR>&nbsp;&nbsp;&nbsp;&nbsp;<A 
href="http://www.nlpr.ia.ac.cn/pal/Dataset/BLATID/subtitle_test.rar">subtitle_test</A> (719MB)<BR>&nbsp;&nbsp;&nbsp;&nbsp;<A 
href="http://www.nlpr.ia.ac.cn/pal/Dataset/BLATID/AICtext.rar">AICtext</A> (539MB)<BR>&nbsp;&nbsp;&nbsp;&nbsp;<A 
href="read_lmdb.py">read_lmdb.py</A> (3KB)<BR>&nbsp;&nbsp;&nbsp;&nbsp;<A 
<P class=MsoNormal style="MARGIN: 0cm; LINE-HEIGHT: 150%">&nbsp;</P>
<P class=MsoNormal 
style="MARGIN: 0cm; LINE-HEIGHT: 150%"></FONT></o:p></SPAN>Contact 
Information</P>
<P class=MsoNormal style="MARGIN: 0cm; LINE-HEIGHT: 150%">Cheng-Lin Liu</P>
<P class=MsoNormal style="MARGIN: 0cm; LINE-HEIGHT: 150%">Institute of 
Automation of Chinese Academy of Sciences (CASIA)</P>
<P class=MsoNormal style="MARGIN: 0cm; LINE-HEIGHT: 150%">Beijing 100190, 
China</P>
<P class=MsoNormal 
style="MARGIN: 0cm; LINE-HEIGHT: 150%">Email:liucl@nlpr.ia.ac.cn</P>
<P class=MsoNormal 
style="MARGIN: 0cm; LINE-HEIGHT: 150%">Website:www.nlpr.ia.ac.cn/pal/</P>
<P class=MsoNormal 
style="MARGIN: 0cm; LINE-HEIGHT: 150%">&nbsp;</P></DIV></DIV></BODY></HTML>