Ancient Handwritten Characters Database
(CASIA-AHCDB) is designed for character recognition research. The database
contains more than 2.2 million annotated character samples of 10,658 classes.
The character samples come from more than 12,000 pages of annotated Chinese
ancient handwritten documents. According to different sources of documents, the
database is mainly divided into two sub-databases: Complete Library in Four
Sections (style1) and Ancient Buddhist Scriptures (style2). Each sub-database
can be divided into three parts based on its applications: basic category set,
enhanced category set and reserved category set. The basic category sets of
style1 and style2 have the same 2,365 classes, and the enhanced category sets of
style1 and style2 have no intersecting classes. For reserved category set,
training and testing set are not divided due to the few samples.
Style1 contains 25 books, numbered “book_01” to
“book_25”. Among them, (book_01, book_02) were written by one person, so did
(book_03, book_04), (book_05, book_06) and (book_07, book_08) , the rest are
written by different people. We make books 01-20 as training set and books 21-25
as testing set.
Style2 contains Buddhist scriptures documents
from 10 different periods. The writer of each volume is no longer verifiable.
The 001 volumes of Buddhist scriptures in the 01 period are numbered
“period_01/volume_001”. We make Buddhist scriptures from period 09-10 as
training set and Buddhist scriptures from period 01-08 as testing
set.
Table I. Structure and Statistic of
CASIA-AHCDB
Database
Structure |
Classes |
Characters | |||
CASIA AHCDB |
Style1 |
Basic
Category |
Train |
2,365 |
832,939 |
Test |
2,365 |
254,162 | |||
Enhanced
Category |
Train |
3,227 |
89,204 | ||
Test |
3,227 |
36,258 | |||
Reserved
Category |
3,819 |
19,763 | |||
Style2 |
Basic
Category |
Train |
2,365 |
728,423 | |
Test |
2,365 |
204,547 | |||
Enhanced
Category |
Train |
783 |
71,179 | ||
Test |
783 |
19,597 | |||
Reserved
Category |
2,450 |
8,213 | |||
Summation |
12,229 |
2,264,285 |
Table II. GNTX Format
Item |
Length |
Comment |
Sample size |
4 bytes |
Number of bytes for one
sample |
Unicode |
4 bytes |
Unicode |
Width |
2 bytes |
Number of pixels in a
row |
Height |
2 bytes |
Number of
rows |
Bitmap |
width * height
bytes |
Store row by
row |
Data Download
style1_basic_test
style1_basic_train_part1
style1_basic_train_part2
style1_basic_train_part3
style1_enhanced
style2
Reference
Yue Xu, Fei Yin, Da-Han Wang, Xu-Yao Zhang, Zhaoxiang Zhang, Cheng-Lin Liu, CASIA-AHCDB: A large-scale Chinese ancient handwritten characters database, Proc. 15th ICDAR, Sydney, Australia, September 20-25, 2019, pp.793-798.
24th International Conference on Pattern Recognition
15th International Conference on Frontiers in Handwriting Recognition
10th IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition
Haidian | Beijing | China
Phone : (+86-10)8254-4797
Fax : (+86-10) 8254-4594
Email:liucl@nlpr.ia.ac.cn
Website:www.nlpr.ia.ac.cn/pal/