Offline Touching Characters Dataset
Overview:
For assessing touching character segmentation algorithms, we present a database of touching characters collected from the Chinese handwriting database CASIA-HWDB, called CASIA-HWDB-T. All the touching characters (or strings) are annotated with the character classes, locations of touching points, and auxiliary values like string height (LH) and average stroke width (SW).
According to different language types, we partition the touching strings into four subsets: 2,788 all-digit strings (HWDB-T-allDigits), 328 all-letter ones (HWDB-T-allLetters), 50,157 all-Chinese strings (HWDB-T-allChinese), and 3,196 mixed-character ones (HWDB-T-other).
According to the number of characters and touching points, we partition the dataset into three subsets: 48,536 single-touching pairs (HWDB-ST-P), 6,115 single-touching strings with more than two characters (HWDB-ST-M), and 1,818 multiple-touching pairs (HWDB-MT). More details about the dataset can be found in our paper listed below.
Download:
HWDB-T-allDigits(3.73MB)
HWDB-T-allLetters(0.54MB)
HWDB-T-allChinese(207MB)
HWDB-T-other(8.6MB)
HWDB-ST-P(178MB)
HWDB-ST-M(34.6MB)
HWDB-MT(7.5MB)
All the datasets are stored in tcs file format, as in Table 1. The format of the tcs file is also described in FileFormat-tcs.pdf.
Table 1. Off-line ground-truthed Touching Character String image (*.tcs) file format
Item | Type | Length (B: Byte) | Instance |
File Header | |||
Size of Header | Long int | 4B | Number of bytes: 36+strlen(illustr) |
Format code | ASCII (char*) | 8B | “tcs” |
Illustration | Text | Arbitrary | “#CASIA-HWDB-T” |
Code type | ASCII (char*) | 20B | “ASCII” or “GB”. |
Code Length (CL) | Short int | 2B | 1 (ASCII), 2 (GB). |
Bits per pixel | Short int | 2B | 8 (Gray image) |
String Image Records (concatenated) | |||
Stroke Width (SW) | Short int | 2B | |
Line Height (LH) | Short int | 2B | |
Number of Touching Points (NTP) | Short int | 2B | 1, 2, 3, … |
Position sequence of top and bottom terminal of touching point(s) | Short int | 2 * (2B + 2B) * NTP | For each touching point, first top terminal, then bottom terminal; For each terminal, first row, then column. |
*Number of Characters (NC) | Short int | 2B | 2, 3, 4, … |
Character labels (code) | Code type | NC * CL | |
Height of image (H) | Short int | 2B | |
Width of image (W) | Short int | 2B | |
bitmap | Byte | H * W | Gray image |
*For single-touching character string image, NC = (NTP + 1); For multiple-touching character pair image, NC <= NTP and NC = 2.
Here are three example files (download the files), each one file for allDigits, allLetters and allChinese. You can view them using this software (download the software) developed by us.
Publication:
Liang Xu, Fei Yin, Qiu-Feng Wang, Cheng-Lin Liu, “A Touching Character Database from Chinese Handwriting for Assessing Segmentation Algorithms,” Proceeding of the 13th International Conference on Frontiers in Handwriting Recognition(ICFHR), Bari, Italy, 2012.