Online Database
For handwriting data collection using Anoto pen, all the template pages were printed on papers with dot pattern. On the printed template pages, each isolated character was written in the space below the pre-printed character, and each text was written on a separate page with the template text printed in the upper part of the page. During writing, the online data (stroke trajectory: sequences of (x,y) coordinates) were recorded by the Anoto pen and later transmitted to computers.
Online data examples
(a) Isolated character samples
(b) Handwritten text sample
CASIA-OLHWDB1.0-1.2
There are three datasets of isolated characters in the online database. The statistics of these datasets are shown in Table 1. The datasets include 1020 files, and each file (*.pot) stores character samples written by one person. The file format of *.pot is specified in Table 2.
Table 1. Statistics of online isolated character datasets
Dataset | #writers | #character samples | ||
total | Symbol | Chinese/#class | ||
OLHWDB1.0 | 420 | 1,694,741 | 71,806 | 1,622,935/3,866 |
OLHWDB1.1 | 300 | 1,174,364 | 51,232 | 1,123,132/3,755 |
OLHWDB1.2 | 300 | 1,042,912 | 51,181 | 991,731/3,319 |
Total | 1,020 | 3,912,017 | 174,219 | 3,737,798/7,185 |
OLHWDB1.0 includes 3,866 Chinese characters and 171 alphanumeric and
symbols. Among the 3,866 Chinese characters, 3,740 characters are in the
GB2312-80 level-1 set (which contains 3,755 characters in total).
OLHWDB1.1
includes 3,755 GB2312-80 level-1 Chinese characters and 171 alphanumeric and
symbols.
OLHWDB1.2 includes 3,319 Chinese characters and 171 alphanumeric
and symbols. The set of Chinese characters in OLHWDB1.2 (3,319 classes) is a
disjoint set of OLHWDB1.0.
OLHWDB1.0 and OLHWDB1.2 together include 7185
Chinese characters (7,185=3,866+3,319),which include all of 6763 Chinese
characters in GB2312.
Table 2. Format of online isolated character data file (*.pot)
Item | Type | Length | Instance | Comment |
Sample size | unsigned short | 2B | Number of bytes for one sample (byte count to next sample) | |
Tag code (GB) | DWORD | 4B | "啊"=0x0000b0a1 Stored as 0xa1b00000 | Only two bytes (GB2132 or GBK) are meaningful |
Stroke number | unsigned short | 2B | Number of strokes in a sample | |
Strokes (concatenated). Each stroke is a point sequence from pen-down to lift | ||||
Coordinates (x, y) (concatenated) | short | 2B+2B | All values less than 32768 | |
Stroke end (-1, 0) | signed short | 2B+2B | ||
Character end tag | ||||
Character end (-1,-1) |
signed short | 2B+2B |
Here are three example files (download the files), one file for each dataset,
and you can view them using this software (download the software) developed by us.
The full datasets of CASIA-OLHWDB1.0-1.2 can be downloaded at here.
CASIA-OLHWDB2.0-2.2
The online handwritten text datasets were produced by the same writers of the isolated character datasets. Each person wrote five pages of given texts. One writer (no.671) and three pages (2 pages of no.328 and 1 page of no.685) are missing because of data loss. Each page is stored in a *.wptt file named after the writer index and page number. In addition to the stroke trajectory data of the page, the data file also includes ground-truths of text line segmentation and character class labels (text line transcript in GB codes). The statistics of the datasets and the format of *.wptt file are shown in Table 3 and Table 4, respectively.
Table 3. Statistics of online handwritten text datasets
Dataset |
#writers |
#pages |
#lines |
#character/#class |
#out-of-class sample |
OLHWDB2.0 |
420 |
2,098 |
20,573 |
540,009/1,214 |
1,282 |
OLHWDB2.1 |
300 |
1,500 |
17,282 |
429,083/2,256 |
255 |
OLHWDB2.2 |
299 |
1,494 |
14,365 |
379,812/1,303 |
581 |
Total |
1,019 |
5,092 |
52,220 |
1,348,904/2,655 |
2,088 |
Out-of-class samples are samples out of the 7,356 classes (all of classes in OLHWDB1.0-1.2).
Table 4. Format of online text file (*.wptt)
Item |
Type |
Length |
Instance |
File Header | |||
Size of Header |
long int |
4B |
Number of bytes: |
Format code |
ASCII (char*) |
8B |
“WPTT” |
Illustration |
Text |
Arbitrary |
“#......\0” |
Code type |
ASCII (char*) |
20B |
“GB” |
Code length |
short int |
2B |
2 |
Data type |
ASCII (char*) |
20B |
“short” |
Sample length |
int |
4B |
|
Page index |
int |
4B |
Corresponding to that in trajectory |
Stroke number |
int |
4B |
|
Strokes (concatenated) | |||
Point number |
short |
2B |
|
Points (concatenated) | |||
Coordinates (x, y) (concatenated) |
unsigned short |
2B+2B |
no PEN_DOWN and PEN_UP, all coordinates are multiplied by 10. |
Line number |
unsigned short |
2B |
|
Lines (concatenated) | |||
Line stroke number |
unsigned short |
2B |
|
Line stroke index (concatenated) |
unsigned short |
2B*lineStrkNum |
|
Line char number |
unsigned short |
2B |
|
Chars (concatenated) | |||
Tag code |
Code type |
codelength*lineCharNum |
If the Tag code equal to 0xffff, it is an abnormal character. |
Sample
Lenth:
4+4+4+strkNum*[2+strkPtNum*4]+2+lineNum*[2+2*lineStrkNum+2+lineCharNum*codeLength].
WPTT handwritten pages can viewed using this software
(download the software)
developed by us. An example C++ code for reading data from *wptt file
is given in WPTTRead.cpp.pdf.
The
full datasets of CASIA-OLHWDB2.0-2.2 can be downloaded here.