Offline Database
For offline data collection, the handwritten pages were scanned (in resolution of 300DPT) to obtain color images, which were segmented and labeled using annotation tools. After the annotation, the database images have background labeled as 255 and foreground pixels in 255 gray levels (0-254). So, binary images can be obtained by simply changing the foreground pixels to 1 and the background pixels to 0.
Offline data examples
(a) Isolated character samples
(b) Handwritten text sample
CASIA-HWDB1.0-1.2
There are three datasets of isolated characters in the offline handwriting database. The statistics of these datasets are shown in Table 1. The datasets include 1,020 files, and each file (*.gnt) stores concatenated gray-scale character images of one writer. The file format of *.gnt is specified in Table 2.
Table 1. Statistics of offline isolated character datasets
Dataset | #writers | #character samples | ||
total | Symbol | Chinese/#class | ||
HWDB1.0 | 420 | 1,680,258 | 71,122 | 1,609,136/3,866 |
HWDB1.1 | 300 | 1,172,907 | 51,158 | 1,121,749/3,755 |
HWDB1.2 | 300 | 1,041,970 | 50,981 | 990,989/3,319 |
Total | 1,020 | 3,895,135 | 173,261 | 3,721,874/7,185 |
HWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and symbols.
Among the 3,866 Chinese characters, 3,740 characters are in the GB2312-80
level-1 set (which contains 3,755 characters in total).
HWDB1.1 includes 3,755 GB2312-80 level-1 Chinese characters and 171
alphanumeric and symbols.
HWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric and symbols.
The set of Chinese characters in HWDB1.2 (3,319 classes) is a disjoint set
of HWDB1.0.
HWDB1.0 and HWDB1.2 together include 7185 Chinese characters
(7,185=3,866+3,319),which include all of 6763 Chinese characters in GB2312.
Table 2. Format of offline isolated character data file (*.gnt)
Item | Type | Length | Instance | Comment |
Sample size | unsigned int | 4B | Number of bytes for one sample (byte count to next sample) | |
Tag code (GB) | char | 2B | "啊"=0xb0a1 Stored as 0xa1b0 | |
Width | unsigned short | 2B | Number of pixels in a row | |
Height | unsigned short | 2B | Number of rows | |
Bitmap | unsigned char | Width*Height bytes | Stored row by row |
Here are three example files (download the
files), one file for each dataset, and you can view them using this
software (download the
software)
developed by us. An example C++ code for reading images from *gnt file is
given in GntRead.cpp.pdf.
The full datasets of CASIA-HWDB1.0-1.2 can be downloaded at
here.
CASIA-HWDB2.0-2.2
The offline text databases were produced by the same writers of the isolated character datasets. Each person wrote five pages of given texts. One writer (no.371) and four pages are missing because of data loss. Each page is stored in a *.dgrl file named after the writer index and page number. In addition to the gray-scale image, the data file also includes ground-truths of text line segmentation and character class labels (in GB codes). The statistics of the datasets and the format of *.dgrl file are shown in Table 3 and Table 4, respectively.
Table 3. Statistics of offline handwritten text datasets
Dataset | #writers | #pages | #lines | #character/#class | #out-of-class sample |
HWDB2.0 | 419 | 2,092 | 20,495 | 538,868/1,222 | 1,106 |
HWDB2.1 | 300 | 1,500 | 17,292 | 429,553/2,310 | 172 |
HWDB2.2 | 300 | 1,499 | 14,443 | 380,993/1,331 | 581 |
Total | 1,019 | 5,091 | 52,230 | 1,349,414/2,703 | 1,859 |
Out-of-class samples are samples out of the 7,356 classes (all of classes in HWDB1.0-1.2).
A DGRL (*.dgrl) file stores a page of document image. The image has background eliminated (encoded as 255) and foreground (text strokes) encoded in gray level 0-254, one byte per pixel. Each page is stored as a series of lines. Each line has a header denoting the number of characters, sequence of character codes (GBK), top-left position, line height and width, then the block of bitmap (height*width bytes).
For concatenating the lines into page image, it should be noted that different lines may have overlap of plane, because the text strokes of different lines may overlap in vertical axis. So, for restoring the page image, the foreground pixels of different lines should be combined.
Table 4. Format of offline text data file (*.dgrl)
Item |
Type |
Length |
Instance |
File Header |
|||
Size of Header |
int |
4B |
Number of bytes: 36+strlen(illustr) |
Format code |
ASCII (char*) |
8B |
"DGRL" |
Illustration |
Text |
Arbitrary |
"#......\0" |
Code type |
ASCII (char*) |
20B |
"ASCII", "GB", etc. |
Code length |
Short |
2B |
1, 2, 4, etc. |
Bits per pixel |
Short |
2B |
Typically 1(B/W image), 8 (Gray image) |
Image Records (concatenated) |
|||
Image height |
int |
4B |
Height (pixels) of document image |
Image width |
int |
4B |
Width (pixels) of document image |
Line number |
int |
4B |
Number of lines in the image |
Line Records (concatenated) |
|||
Char number |
int |
4B |
Number of characters in a line |
Label (code) |
Code type |
Code length* Char number |
Each byte is 0xff(-1) for garbage |
Top-left coordinates |
int |
4B + 4B |
(top, left) of a line |
Height (H) |
int |
4B |
Height (pixels) of a line |
Width (W) |
int |
4B |
Width (pixels) of a line |
Bitmap |
BYTE |
H*( (W + 7 ) / 8) or H*W |
Binary or gray image |
DRGL age images can viewed using this software (download the
software) developed by us. An example C++
code for reading images from *dgrl file is given in
DGRLRead.cpp.pdf.
The text line data (page images annotated in text lines) of CASIA-OLHWDB2.0-2.2
can be downloaded at here.