CASIA Online and Offline Chinese Handwriting Databases

Ground-Truthing Text Lines and Characters tool (GTLC)

GTLC is a practical annotation tool for unconstrained off-line handwritten Chinese documents. Unlike most previous works that align word boundaries, our aim is to align characters in text lines without word segmentation because Chinese texts have no extra space between words. We have validated the effectiveness of the tool and have applied it to annotate the document images on the HIT-HW database.

How to use the annotation tool (GTLC):

1) Download the GTLC.zip file, and unpack it to your computer;
The GTLC include an executable file (GTLC.exe) and a dictionary file (char-dic.csp). The dictionary file must be stored in the subdirectory “classifier”. (If the GTLC.exe is stored as: D:\GTLC.exe, then the char-dic.csp must be stored as: D:\classfier\char-dic.csp). Several example document images and their ground-truth files are included in the subdirectionary "img" of GTLC.zip.

2) Prepare the Ground-truth file (*.txt);
In the Ground-truth file, every line of characters (ending with an ENTER break) corresponds to a text line in the document image. An example is shown in Fig. 1.

(a) Document image file

(b) Ground-truth file (text)

Fig. 1. Input files of GTLC

The document image and ground-truth file should have the same title (but different extensions), e.g., gtlc_test.bmp and gtlc_test.txt.

3) Open the input document image file (*.bmp) that is to be annotated;
The V1.0 of GTLC can only process bi-level (binary, 1-bit) and 8-bit gray-level bitmap image files.

4) Extract the characters (binarize the document image);
This step is not necessary for bi-level image;
For gray-level image, binary it through clicking menu Text extraction or pushing button Ctrl+E.

Fig. 2. Command for gray-level image binarization.

5) Segment the text lines;
Segment the text lines though clicking menu Textline segmentation or push button Ctrl+G.

Fig.3. Segment the text lines

After text line segmentation, the connected components (CC) will be linked by lines into text lines. tagged. The mis-segmented CCs can be corrected manually. For example, Fig. 4(a) shows some CC images that are mis-segmented into other text line. To correct, draw a box embracing the CCs to be corrected (Ctrl + left button of mouse). Then, click in the region of the line for merging the CCs (merged line in Fig. 4(b)).

(a) Select the mis-segmented CCs

(b) The corrected text lines

Fig. 4 Correct mis-segmented CCs

6) Align the characters;
After text line segmentation, the characters in each text line are segmented and aligned with groun-truth by clicking menu Character alignment or push button Ctrl+A (Fig. 5).

Fig. 5. Align the characters.

Mis-segmented or mis-aligned characters can be correctly manually. To correct, draw a box (red box in Fig. 6(b)) to embrace the CCs corresponding to a character, then the boxed character and its neighboring characters are adjusted automatically. Fig. 6 shows an example.

(a) An example of alignment error

(b) Draw a black box to embrace the CCs

(c) Corrected alignment

Fig.6 Correct character alignment error.

If some characters are touched (under-segmented) after automatic alignment, they can be split manually. To do this, draw a line (black line in Fig. 7(b)) to cut the CC to be split (pushing Shift + left button of mouse for drawing the line).

(a) An example of touching characters

(b) Draw a line to cut the touching CCs

(c) The corrected touching character

Fig. 7 Split touching characters manually.