Touching Characters Databases

Offline Touching Characters Dataset

Overview:

For assessing touching character segmentation algorithms, we present a database of touching characters collected from the Chinese handwriting database CASIA-HWDB, called CASIA-HWDB-T. All the touching characters (or strings) are annotated with the character classes, locations of touching points, and auxiliary values like string height (LH) and average stroke width (SW).

According to different language types, we partition the touching strings into four subsets: 2,788 all-digit strings (HWDB-T-allDigits), 328 all-letter ones (HWDB-T-allLetters), 50,157 all-Chinese strings (HWDB-T-allChinese), and 3,196 mixed-character ones (HWDB-T-other).

According to the number of characters and touching points, we partition the dataset into three subsets: 48,536 single-touching pairs (HWDB-ST-P), 6,115 single-touching strings with more than two characters (HWDB-ST-M), and 1,818 multiple-touching pairs (HWDB-MT). More details about the dataset can be found in our paper listed below.

Download:

HWDB-T-allDigits(3.73MB)
HWDB-T-allLetters(0.54MB)
HWDB-T-allChinese(207MB)
HWDB-T-other(8.6MB)

HWDB-ST-P(178MB)
HWDB-ST-M(34.6MB)
HWDB-MT(7.5MB)

All the datasets are stored in tcs file format, as in Table 1. The format of the tcs file is also described in FileFormat-tcs.pdf.

Table 1. Off-line ground-truthed Touching Character String image (*.tcs) file format

Item	Type	Length (B: Byte)	Instance
File Header
Size of Header	Long int	4B	Number of bytes: 36+strlen(illustr)
Format code	ASCII (char*)	8B	“tcs”
Illustration	Text	Arbitrary	“#CASIA-HWDB-T”
Code type	ASCII (char*)	20B	“ASCII” or “GB”.
Code Length (CL)	Short int	2B	1 (ASCII), 2 (GB).
Bits per pixel	Short int	2B	8 (Gray image)
String Image Records (concatenated)
Stroke Width (SW)	Short int	2B
Line Height (LH)	Short int	2B
Number of Touching Points (NTP)	Short int	2B	1, 2, 3, …
Position sequence of top and bottom terminal of touching point(s)	Short int	2 * (2B + 2B) * NTP	For each touching point, first top terminal, then bottom terminal; For each terminal, first row, then column.
*Number of Characters (NC)	Short int	2B	2, 3, 4, …
Character labels (code)	Code type	NC * CL
Height of image (H)	Short int	2B
Width of image (W)	Short int	2B
bitmap	Byte	H * W	Gray image

*For single-touching character string image, NC = (NTP + 1); For multiple-touching character pair image, NC <= NTP and NC = 2.

Here are three example files (download the files), each one file for allDigits, allLetters and allChinese. You can view them using this software (download the software) developed by us.

Publication:

Liang Xu, Fei Yin, Qiu-Feng Wang, Cheng-Lin Liu, “A Touching Character Database from Chinese Handwriting for Assessing Segmentation Algorithms,” Proceeding of the 13th International Conference on Frontiers in Handwriting Recognition(ICFHR), Bari, Italy, 2012.

CASIA Online and Offline Chinese Handwriting Databases

Offline Touching Characters Dataset

Overview:

Download:

Table 1. Off-line ground-truthed Touching Character String image (*.tcs) file format

Publication: