Online Database

Offline Database

　For offline data collection, the handwritten pages were scanned (in resolution of 300DPT) to obtain color images, which were segmented and labeled using annotation tools. After the annotation, the database images have background labeled as 255 and foreground pixels in 255 gray levels (0-254). So, binary images can be obtained by simply changing the foreground pixels to 1 and the background pixels to 0.

Offline data examples

(a) Isolated character samples

(b) Handwritten text sample

CASIA-HWDB1.0-1.2

　There are three datasets of isolated characters in the offline handwriting database. The statistics of these datasets are shown in Table 1. The datasets include 1,020 files, and each file (*.gnt) stores concatenated gray-scale character images of one writer. The file format of *.gnt is specified in Table 2.

Table 1. Statistics of offline isolated character datasets

Dataset	#writers	#character samples
Dataset	#writers	total	Symbol	Chinese/#class
HWDB1.0	420	1,680,258	71,122	1,609,136/3,866
HWDB1.1	300	1,172,907	51,158	1,121,749/3,755
HWDB1.2	300	1,041,970	50,981	990,989/3,319
Total	1,020	3,895,135	173,261	3,721,874/7,185

　HWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols. Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80 level-1 set (which contains 3,755 characters in total).
　HWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and symbols.
　HWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols. The set of Chinese characters in HWDB1.2 (3,319 classes) is a disjoint set of HWDB1.0.
　HWDB1.0 and HWDB1.2 together include 7185 Chinese characters (7,185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.

Table 2. Format of offline isolated character data file (*.gnt)

Item	Type	Length	Instance	Comment
Sample size	unsigned int	4B		Number of bytes for one sample (byte count to next sample)
Tag code (GB)	char	2B	"啊"=0xb0a1 Stored as 0xa1b0
Width	unsigned short	2B		Number of pixels in a row
Height	unsigned short	2B		Number of rows
Bitmap	unsigned char	Width*Height bytes		Stored row by row

Here are three example files (download the files), one file for each dataset, and you can view them using this software (download the software) developed by us. An example C++ code for reading images from *gnt file is given in GntRead.cpp.pdf.
　The full datasets of CASIA-HWDB1.0-1.2 can be downloaded at here.

CASIA-HWDB2.0-2.2

　The offline text databases were produced by the same writers of the isolated character datasets. Each person wrote five pages of given texts. One writer (no.371) and four pages are missing because of data loss. Each page is stored in a *.dgrl file named after the writer index and page number. In addition to the gray-scale image, the data file also includes ground-truths of text line segmentation and character class labels (in GB codes). The statistics of the datasets and the format of *.dgrl file are shown in Table 3 and Table 4, respectively.

Table 3. Statistics of offline handwritten text datasets

Dataset	#writers	#pages	#lines	#character/#class	#out-of-class sample
HWDB2.0	419	2,092	20,495	538,868/1,222	1,106
HWDB2.1	300	1,500	17,292	429,553/2,310	172
HWDB2.2	300	1,499	14,443	380,993/1,331	581
Total	1,019	5,091	52,230	1,349,414/2,703	1,859

　 Out-of-class samples are samples out of the 7,356 classes (all of classes in HWDB1.0-1.2).

A DGRL (*.dgrl) file stores a page of document image. The image has background eliminated (encoded as 255) and foreground (text strokes) encoded in gray level 0-254, one byte per pixel. Each page is stored as a series of lines. Each line has a header denoting the number of characters, sequence of character codes (GBK), top-left position, line height and width, then the block of bitmap (height*width bytes).

For concatenating the lines into page image, it should be noted that different lines may have overlap of plane, because the text strokes of different lines may overlap in vertical axis. So, for restoring the page image, the foreground pixels of different lines should be combined.

Table 4. Format of offline text data file (*.dgrl)

Item	Type	Length	Instance
File Header
Size of Header	int	4B	Number of bytes: 36+strlen(illustr)
Format code	ASCII (char*)	8B	"DGRL"
Illustration	Text	Arbitrary	"#......\0"
Code type	ASCII (char*)	20B	"ASCII", "GB", etc.
Code length	Short	2B	1, 2, 4, etc.
Bits per pixel	Short	2B	Typically 1(B/W image), 8 (Gray image)
Image Records (concatenated)
Image height	int	4B	Height (pixels) of document image
Image width	int	4B	Width (pixels) of document image
Line number	int	4B	Number of lines in the image
Line Records (concatenated)
Char number	int	4B	Number of characters in a line
Label (code)	Code type	Code length* Char number	Each byte is 0xff(-1) for garbage
Top-left coordinates	int	4B + 4B	(top, left) of a line
Height (H)	int	4B	Height (pixels) of a line
Width (W)	int	4B	Width (pixels) of a line
Bitmap	BYTE	H( (W + 7 ) / 8) or HW	Binary or gray image

　DRGL age images can viewed using this software (download the software) developed by us. An example C++ code for reading images from *dgrl file is given in DGRLRead.cpp.pdf.
　The text line data (page images annotated in text lines) of CASIA-OLHWDB2.0-2.2 can be downloaded at here.

CASIA Online and Offline Chinese Handwriting Databases