On this Github repository, we provide a practical guide showing how you can quickly create a dataset and train a customized object detection model for signature detection.
At Hyperlex, we offer a software that speeds up identifying and analyzing important information in your contracts, leveraging the shift from physical to digital documents. Most of the methods we use at Hyperlex are Natural Language Processing (NLP) related. In this post we aim to show how we also leverage state of the art computer vision algorithms to enhance contract analysis and support our NLP pipeline where it lacks.
When using Hyperlex software, one starts by uploading their (often beforehand scanned) documents on our platform and an OCR algorithm (optical character recognition) extracts texts from images. To support the document processing workflow, one should also leverage visual elements to make sense of non textual sections such as signatures or paraphs. As it is a common use case, we propose a pipeline for signature detection on scanned paper documents in the form of a practical guide, using the open sourced Tensorflow Object Detection API. 
We will use object detection algorithms, that can detect and tag multiple objects on the same image. Outputs are bounding boxes located around the object of interest. We leverage a convolutional convolutional neural network based model (Faster RCNN)  pretrained on the Common Objects in Context (COCO)  dataset and finetune it our signature detection task.
Our Dataset is composed of a variety of private signed and unsigned contracts provided by our clients. For confidentiality issues, we cannot disclose this dataset, but a sample dataset of ~40 contracts is available in the Github repository.
We randomly sampled ~1000 images from some of our clients signed and unsigned contracts constituting our training set. The next step was to put together a « realistic » test set. The signature detection system we are designing should be able to generalize well enough on a wide distribution, as contracts can vary tremendously depending on the client (NDAs, Lease Agreements, etc …). To construct our test set, we pick a total of ~200 samples from a population of unseen contracts (documents coming from clients that are not represented in the training set) to generate «new data» distribution and appropriately assess the robustness of our detector.
The next step is to annotate the contracts. Once again, because of confidentiality issues, we locally used the open sourced software VOTT developed by Microsoft, to annotate the contracts.. Other great image annotation softwares (e.g Labelbox, VGG Image Innotator) are available.
VOTT’s project creation interface
The use of VOTT is quite intuitive. After creating a new project, you’ll need to set a source connection (where to load assets from) and a target connection (where to save the project and exported data) and add the tags or labels you need (the only tag in our case is « signature »). After the project is created and the assets are imported, you can start the manual annotation. Once you’re done with the annotation export the project.
VOTT’s image annotation interface
In the path you set in the Target Connection you’ll find “NAME_OF_YOUR_PROJECT-export.json” file containing all your bounding boxes annotations. This json file stores a unique id for each asset (“39469fc1e79e0a3e8235ea772be6dd” on the example below) mapped to a dictionary containing informations about the image (“asset” on the example below ) and a list of all ground truth bounding boxes (“regions” on the example below) with each element of the list containing itself a dictionary storing data about each bounding box.
VOTT’s json output snippet
TRAINING A SIGNATURE DETECTOR
We will use the Faster RCNN model, a paper originally published in NIPS 2015. The authors (Shaoqing Ren et al.) allowed significant advances in the object detection community by designing a single convolutional neural network that learns the region proposals (decides “where” to look for object in the image) thus removing the computationally expensive Selective Search algorithm used in previous models such as RCNN and Fast-RCNN to determine these regions. For a general review of these architectures I recommend reading this blog post. For a thorough review on Faster RCNN : I recommend reading this and the original paper  Faster RCNN).
In the README.md of the Github repository, which is a modified version of the Tensorflow Object Detection API code, (we removed all code that is not related to object detection and incorporated our own data and inference pipeline) we detail how you can quickly train a signature detector in few steps without having to deal with the code and tensorflow.
We train our customized faster RCNN model for 20 epochs with a batch size of 8 and a learning rate of 2e-4. All details about the dataset creation, the training and inference process are available here.
If you have come across a project involving object detection you might be familiar with the metric of choice in the community : the mean Average Precision(mAP). Let start by defining an overlap criterion. We usually compute the Intersection over Union (IoU) which is the ratio of the area of overlap and the area of union between the predicted bounding box and the ground truth bounding box. We then use this ratio to determine if a predicted bounding box is a:
- True Positive (TP) : IoU > 0,5 and correct class detected ;
- False Positive (FP) : IoU < 0,5 or duplicated bounding box ;
- False Negative (FN) : IoU>0,5 but wrong class detected.
Once our metrics are computed (TP, FP, TN), we can compute two other crucial metrics, precision and recall, and plot the Precision-Recall (PR) curve. The Average Precision is then the area under the interpolated PR curve. Finally, the mAP is the average of the AP calculated for all classes (in our problem, the only class is « signature » so mAP=AP).
For a more comprehensive understanding of the above, or if you’re not familiar with the notion of mAP, I suggest reading this blog.
Overall, the mAP is the metric of choice (over accuracy and AUC for instance) as it is model agnostic, and insensitive to class imbalance (which is common in object detection datasets).
Overall, we achieve a performance of mAP = 0,894 for a threshold of IoU=0.5 which is the norm.
When setting the IoU threshold to 0.75 (meaning that a prediction is a true positive only if IoU > 0.75), we observed a very significant drop in performance to mAP = 0.40. However, in practice an IoU threshold of 0.5 is more than enough as our main goal is to localize where is the signature in the document, more than having a precise bounding box.
To nuance these promising results, we can notice that signatures are relatively large and singular elements in contracts or document of any type. Other tasks such as initials or handwritten sentences detection may be harder, either because the objects are too small (one way to overcome this issue is to rescale the image, at the expense of computational cost), or because they are hardly distinguishable from typed characters.
In this post, we focused on signature detection as it is a useful problem that we encountered during many client projects. However, this work can be extended to any other kind of handwriting or « elements » that could be misinterpreted by an OCR.
Indeed, extraction of entities and other tasks remain challenging because of the many variations we can find in a population of documents : poor quality and contrast, dirtiness/creased documents, fonts, non latin characters etc...Presence of tables and images can also affect the accuracy of the OCR. These errors introduce skipped and misread letters and makes the analysis and entity extraction harder.
Of course, you can also train a model to detect different type of elements in a document (Object detection models are initially built to deal with a large number of classes). Here are some other task we tried :
- signatures and initials
- handwritten sentences (full names, dates, comments etc …)
- filled checkboxes
Signature prediction examples
Signature and other handwriting prediction examples
Auteur : Khalil Ouardini, Data Scientist Intern at Hyperlex