Offline recognition and transcription of handwritten texts is a difficult problem for several reasons: On the one hand, the starting point of the process is an image of a digitalized document. The digitalization process, however, can be done with different devices, quality, lighting conditions, etc. On the other hand, the paper on which the letter was written can have multiple artifacts; namely, stamps, stickers, drawings, company logos etc. In addition, handwriting presents high variability between individuals and interpretation of handwritten characters and words usually requires correct interpretation of the surrounding text.

The following sections present a compilation of the process and techniques used for developing a handwritten letter recognition system for the German language.

Handwritten text data

Open-source handwritten German text datasets that can be used to develop commercial applications are very limited. Crowdsourcing handwritten texts using a fixed text might seem like a good first approximation of a dataset for handwriting recognition. For instance, some predefined text can be designed to have relevant words for the application domain and the most common character combinations of the German language. Although that facilitates the labeling process, it also restricts the variability of the words in the dataset, which will limit the generalization capability of the model. Thus, using a good number of documents with unique texts instead of a fixed text is preferable. This will inevitably create the need to label every document with its corresponding ground truth. In addition, in order to enrich the training data, synthetic samples of text can be created from character images from the emnist dataset [1] and using different fonts from fonttools[2], for example.

Image Preprocessing

Before segmenting the text into lines and words, it is sensible to remove artifacts in the image that are irrelevant to word and character recognition and that can hinder the line and word segmentation algorithms. For instance, in handwritten letters that are sent to an office, stamps are very common. Some may have a fixed location and color and can be easily removed, while others might have a random location, orientation, and color. For the latter, contour detection and removal techniques are usually available in OpenCV [3]. If the use case presents diverse types of papers used for writing the letters, a gridline remover routine might also be necessary to remove background gridlines and dense artifacts corresponding to images or drawings on the paper, and other decorations that might disrupt the line and word segmentation algorithms.

The previous methods are completely image processing based and therefore, more sophisticated methods based on machine learning models that provide more flexibility and robustness should be assessed.

Line and word segmentation

The handwriting recognition system discussed here corresponds to a modular approach shown in Figure 1. The image preprocessing module segments the image of the input into lines of text, each one with a set of words that must be detected and segmented once more to be transcribed by the image recognition module. The words recognition module will improve the word predictions with post correction techniques resulting in a transcribed document.

Figure 1. Flowchart of a handwritten text recognizing pipeline.

The neural network based handwriting recognition model requires the inputs to be images of words. The same architecture can be extended to recognize whole lines of text. This creates the need to have specific pipelines to detect and extract lines and words. In texts that have been neatly written, with uniform line separation, it is possible to use projection profile techniques [4] to perform line segmentation. But projecting the pixel values of the binarized image onto the vertical axis gives poor results when the text lines are not parallel. The blob analysis [5] consists of merging the characters into a single entity representing a word, by forming a blob-like representation using a gaussian filter on the image. The same concept can be extended for generating blobs of lines within a page using a dynamic gaussian filter (Figure 2). Once the line blobs are generated, blob enclosing boxes can be created and postprocessed to resolve affiliation conflicts for the components that are part of two boxes at the same time. The blob analysis process is highly reliable for line detection, but the results are affected by image artifacts that extend to several lines, therefore, it is highly dependent on the success of the image preprocessing stage.

Figure 2. Line blobs for line segmentation on a sample handwritten letter.

Once line detection is performed, word segmentation is the next sensible step. The blob analysis with a specific word-like gaussian filter seems like a reasonable approach. Nevertheless, variability in the intra-word and inter-word distances requires additional steps to separate words correctly. For instance, the intra-word/inter-word distances in each document can be used to merge or split the detected blobs. However, if labeled data of whole lines exists, training a character recognition model (See next section) to predict the characters in the image of a line can be used to detect the location of whitespaces with higher confidence (Figure 3).  Both of these methods rely on line extraction to be successful and therefore the image preprocessing –> line detection –> word segmentation form a dependency chain that ideally could be avoided.

Figure 3. Ground truth and prediction of a model trained on lines

End-to-end techniques for scene text detection are another option that can circumvent the issues of the previous approach. Trainable and out of the box frameworks for text detection exist [6,7], but the initial experiments with such did not give better results with respect to the line-blobs technique.

Training a text recognition classifier

A custom-made labeling tool facilitates the labeling of the training images for a word-based deep neural network model (Figure 4). The preprocessed images of the handwritten letters are loaded into the labeling interface where the line segmentation and word segmentation routines allow visualizing the bounding boxes of each word. Once a model is trained with an initial set of labeled data, it can be incorporated into the labeling tool, suggesting probable words for each detected word box and speeding up the labeling process. Further labeling leads to better models to be incorporated into the labeling tool.

Figure 4. Custom-made labeling tool with segmented lines, detected words and their predicted transcription for an example handwritten letter.

The handwritten text recognition model should receive the image of a single word or a single line as input and map that image to a sequence of characters as output. The standard architecture for handwritten text recognition [8] consists of a few convolutional neural network (CNN) layers, followed by recurrent neural network (RNN) layers and a Connectionist Temporal Classification (CTC) layer [9]. The CTC loss is commonly used in speech and handwriting recognition tasks because it allows comparing the ground truth sequence to the generated transcript without the elements of the sequence having a concrete alignment to the input values. In general, the CTC algorithm [10] receives the probability distribution over the outputs for each input step from the previous layer and outputs the probability of an output sequence. For inference, beam search is used to find a probable output for a given input sequence.

After training a model, the quality of the predictions at inference is highly likely to be improved by text correction strategies.  For instance, swapping characters for other characters that are similar looking and are likely to be misinterpreted and searching for valid words in a corpus within a limited edit distance from the predicted word can significatively improve the transcription quality. 

Performance Metrics

The most common metrics in handwriting recognition are the Character Error Rate (CER) and Word Error Rate (WER). CER is based on the normalized edit distance or Levenshtein distance, which is in broad terms, the number of characters edit operations needed in a recognized word to become the ground truth word. In WER, the normalized edit distance is calculated with whole words as characters and the edit operations therefore are calculated with respect to the predicted words.

For the evaluation of the model’s capability to transcribe handwritten documents correctly, a simple word accuracy can be used: number of correctly predicted words/total of words in the document, taking capitalization into account and ignoring punctuation. Additional metrics could be implemented to discriminate words that are more relevant to the use-case.

About the author

Silvia Oviedo has been working for InfAI as a Data Scientist/Software Developer for the past 3 years. Her background is in electronics engineering and she is especially interested in signal/image processing and applied machine learning in the areas of fault detection, biomedical and energy. Her latest experience includes image classification for handwriting document transcription.

 

References

[1] https://www.nist.gov/itl/products-and-services/emnist-dataset

[2] https://fonttools.readthedocs.io/en/latest/

[3] https://docs.opencv.org/3.4/d4/d73/tutorial_py_contours_begin.html

[4] Jaekyu Ha, R. M. Haralick and I. T. Phillips, „Document page decomposition by the bounding-box project,“ Proceedings of 3rd International Conference on Document Analysis and Recognition, 1995, pp. 1119-1122 vol.2, doi: 10.1109/ICDAR.1995.602115.

[5] R. Manmatha and Nitin Srimal. 1999. “Scale Space Technique for Word Segmentation in Handwritten Documents”. In Proceedings of the Second International Conference on Scale-Space Theories in Computer Vision (SCALE-SPACE ’99). Springer-Verlag, Berlin, Heidelberg, 22–33.

[6] https://github.com/githubharald/WordDetectorNN

[7] https://github.com/clovaai/CRAFT-pytorch

[8] https://github.com/githubharald/SimpleHTR

[9] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (ICML ’06). Association for Computing Machinery, New York, NY, USA, 369–376. https://doi.org/10.1145/1143844.1143891

[10] https://distill.pub/2017/ctc/