CASIA Online and Offline Chinese Handwriting Databases

Offline Touching Characters Dataset

Overview:

For assessing touching character segmentation algorithms, we present a database of touching characters collected from the Chinese handwriting database CASIA-HWDB, called CASIA-HWDB-T. All the touching characters (or strings) are annotated with the character classes, locations of touching points, and auxiliary values like string height (LH) and average stroke width (SW).

According to different language types, we partition the touching strings into four subsets: 2,788 all-digit strings (HWDB-T-allDigits), 328 all-letter ones (HWDB-T-allLetters), 50,157 all-Chinese strings (HWDB-T-allChinese), and 3,196 mixed-character ones (HWDB-T-other).

According to the number of characters and touching points, we partition the dataset into three subsets: 48,536 single-touching pairs (HWDB-ST-P), 6,115 single-touching strings with more than two characters (HWDB-ST-M), and 1,818 multiple-touching pairs (HWDB-MT). More details about the dataset can be found in our paper listed below.

Download:

HWDB-T-allDigits(3.73MB)
HWDB-T-allLetters(0.54MB)
HWDB-T-allChinese(207MB)
HWDB-T-other(8.6MB)

HWDB-ST-P(178MB)
HWDB-ST-M(34.6MB)
HWDB-MT(7.5MB)

All the datasets are stored in tcs file format, as in Table 1. The format of the tcs file is also described in FileFormat-tcs.pdf.

Table 1. Off-line ground-truthed Touching Character String image (*.tcs) file format

Item Type Length (B: Byte) Instance
File Header
Size of Header Long int 4B Number of bytes:
36+strlen(illustr)
Format code ASCII (char*) 8B “tcs”
Illustration Text Arbitrary “#CASIA-HWDB-T”
Code type ASCII (char*) 20B “ASCII” or “GB”.
Code Length (CL) Short int 2B 1 (ASCII), 2 (GB).
Bits per pixel Short int 2B 8 (Gray image)
String Image Records (concatenated)
Stroke Width (SW) Short int 2B
Line Height (LH) Short int 2B
Number of Touching Points (NTP) Short int 2B 1, 2, 3, …
Position sequence of top and bottom terminal of touching point(s) Short int 2 * (2B + 2B) * NTP For each touching point,
first top terminal, then bottom terminal;
For each terminal, first  row,  then column.
*Number of Characters (NC) Short int 2B 2, 3, 4, …
Character labels (code) Code type NC * CL
Height of image (H) Short int 2B
Width of image (W) Short int 2B
bitmap Byte H * W Gray image


*For single-touching character string image, NC = (NTP + 1); For multiple-touching character pair image, NC <= NTP and NC = 2.

Here are three example files (download the files), each one file for allDigits, allLetters and allChinese. You can view them using this software (download the software) developed by us.

Publication:

Liang Xu, Fei Yin, Qiu-Feng Wang, Cheng-Lin Liu, “A Touching Character Database from Chinese Handwriting for Assessing Segmentation Algorithms,” Proceeding of the 13th International Conference on Frontiers in Handwriting Recognition(ICFHR), Bari, Italy, 2012.