CASIA Online and Offline Chinese Handwriting Databases

Offline Database

 For offline data collection, the handwritten pages were scanned (in resolution of 300DPT) to obtain color images, which were segmented and labeled using annotation tools. After the annotation, the database images have background labeled as 255 and foreground pixels in 255 gray levels (0-254). So, binary images can be obtained by simply changing the foreground pixels to 1 and the background pixels to 0.

Offline data examples

(a) Isolated character samples

 

(b) Handwritten text sample

CASIA-HWDB1.0-1.2

 There are three datasets of isolated characters in the offline handwriting database. The statistics of these datasets are shown in Table 1. The datasets include 1,020 files, and each file (*.gnt) stores concatenated gray-scale character images of one writer. The file format of *.gnt is specified in Table 2. 

Table 1. Statistics of offline isolated character datasets

Dataset #writers #character samples
total Symbol Chinese/#class
HWDB1.0 420 1,680,258 71,122 1,609,136/3,866
HWDB1.1 300 1,172,907 51,158 1,121,749/3,755
HWDB1.2 300 1,041,970 50,981 990,989/3,319
Total 1,020 3,895,135 173,261 3,721,874/7,185

 

 HWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols. Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80 level-1 set (which contains 3,755 characters in total).
 HWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and symbols.
 HWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols. The set of Chinese characters in HWDB1.2 (3,319 classes) is a disjoint set of HWDB1.0.
 HWDB1.0 and HWDB1.2 together include 7185 Chinese characters (7,185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.

Table 2. Format of offline isolated character data file (*.gnt)

Item Type Length Instance Comment
Sample size unsigned int 4B   Number of bytes for one sample (byte count to next sample)
Tag code (GB) char 2B "啊"=0xb0a1 Stored as 0xa1b0  
Width unsigned short 2B   Number of pixels in a row
Height unsigned short 2B   Number of rows
Bitmap unsigned char Width*Height bytes   Stored row by row

 

    Here are three example files (download the files), one file for each dataset, and you can view them using this software (download the software) developed by us. An example C++ code for reading images from *gnt file is given in GntRead.cpp.pdf.
 The full datasets of CASIA-HWDB1.0-1.2 can be downloaded at here.

CASIA-HWDB2.0-2.2

 The offline text databases were produced by the same writers of the isolated character datasets. Each person wrote five pages of given texts. One writer (no.371) and four pages are missing because of data loss. Each page is stored in a *.dgrl file named after the writer index and page number. In addition to the gray-scale image, the data file also includes ground-truths of text line segmentation and character class labels (in GB codes). The statistics of the datasets and the format of *.dgrl file are shown in Table 3 and Table 4, respectively.

Table 3. Statistics of offline handwritten text datasets

Dataset #writers #pages #lines #character/#class #out-of-class sample
HWDB2.0 419 2,092 20,495 538,868/1,222 1,106
HWDB2.1 300 1,500 17,292 429,553/2,310 172
HWDB2.2 300 1,499 14,443 380,993/1,331 581
Total 1,019 5,091 52,230 1,349,414/2,703 1,859

  Out-of-class samples are samples out of the 7,356 classes (all of classes in HWDB1.0-1.2).

    A DGRL (*.dgrl) file stores a page of document image. The image has background eliminated (encoded as 255) and foreground (text strokes) encoded in gray level 0-254, one byte per pixel. Each page is stored as a series of lines. Each line has a header denoting the number of characters, sequence of character codes (GBK), top-left position, line height and width, then the block of bitmap (height*width bytes).

    For concatenating the lines into page image, it should be noted that different lines may have overlap of plane, because the text strokes of different lines may overlap in vertical axis. So, for restoring the page image, the foreground pixels of different lines should be combined.

Table 4. Format of offline text data file (*.dgrl)

Item

Type

Length

Instance

File Header

Size of Header

int

4B

Number of bytes: 36+strlen(illustr)

Format code

ASCII (char*)

8B

"DGRL"

Illustration

Text

Arbitrary

"#......\0"

Code type

ASCII (char*)

20B

"ASCII", "GB", etc.

Code length

Short

2B

1, 2, 4, etc.

Bits per pixel

Short

2B

Typically 1(B/W image), 8 (Gray image)

Image Records (concatenated)

Image height

int

4B

Height (pixels) of document image

Image width

int

4B

Width (pixels) of document image

Line number

int

4B

Number of lines in the image

Line Records (concatenated)

Char number

int

4B

Number of characters in a line

Label (code)

Code type

Code length* Char number

Each byte is 0xff(-1) for garbage

Top-left coordinates

int

4B + 4B

(top, left) of a line

Height (H)

int

4B

Height (pixels) of a line

Width (W)

int

4B

Width (pixels) of a line

Bitmap

BYTE

H*( (W + 7 ) / 8) or H*W

Binary or gray image


 DRGL age images can viewed using this software (download the software) developed by us. An example C++ code for reading images from *dgrl file is given in DGRLRead.cpp.pdf.
 The text line data (page images annotated in text lines) of CASIA-OLHWDB2.0-2.2 can be downloaded at here.