The OCR match, our opinion

This article explores theOCR (Optical Character Recognition), technologyextracting data from images. Tools such as Google Cloud Vision, Amazon Text, And Tesseract are compared. Cloud solutions dominate, Tesseract, although free, may be less efficient for handwritten text or low quality documents.

Matthieu Maso Profile Picture
Matthew Maso Data Scientist

What is OCR?

Optical Character Recognition, or OCR, or Optical Character Recognition technologies are methods of extracting data from unstructured Image documents. They are part of the large family of Computer Vision algorithms and are present all around us every day. OCR will, for example, make it possible to visually identify credit cards, bank checks, invoices or expense reports.

 


OCR tools have become very efficient and offer time and quality savings. In fact, their use is increasing in all economic sectors.

In this article, we draw up a comparison of 3 OCR tools that we were able to test during missions at Aqsone and which are: 

 

State of the art

Optical Character Recognition is not a recent subject and the first applications appeared at the beginning of the 20th century. OCR use cases were, for example, aimed at detecting Morse code, Braille or typed characters. 

Over the years, and with the appearance of the first CPUs, larger-scale projects emerged in areas such as postal services, customs and the army. 

In 2005, Hewlett Packard and the University of Nevada released the module OCR Open Source Tesseract thus expanding the use of these technologies. 

In 2013, the MNIST database was published, which brings together 60,000 images of handwritten numbers in black and white. A dataset widely used in Machine Learning on Computer Vision topics.

Finally, recent years have seen the emergence of Cloud OCR models such as Google Cloud Vision Or Amazon Text.

The current state of the art is mainly composed of:

  • Very specific paid software certain tasks (expense reports, mail, technical documents). It is complicated to estimate their performance and generalize it to other tasks.
  • Open source solutions, like Tesseract, a still quite popular solution which offers the advantage of being completely free for relatively good performance.
  • Cloud OCR templates with the two main actors who are Google Cloud Vision and Amazon Text. They offer :
    • better performance on character and language recognition,
    • a variety of complex models,
    • the possibility of training these models on a very large volume of data,
    • the possibility of being coupled with other interesting Natural Language Processing (NLP) bricks specific to GCP and AWS environments.

Unless you want a free 100% solution, Cloud providers are probably the leaders in the current market.

 

Comparison of OCR solutions

The table below summarizes the comparison of the different OCR solutions. The criteria that we put forward are based on the business use cases that we have encountered, and have generally been decisive for the selection of one OCR solution rather than another.

 

Examples of use:

Conclusion

In conclusion, Amazon Textract and Google Cloud Vision are two solutions that offer very similar possibilities and performance. Their prices are also very close since it will cost 1.5$ for 1000 units to use the basic OCR functionality, at the delta of reductions from a large number of units.

Google Cloud Vision will be particularly easy to use for GCP users and also has the advantage of being able to integrate with other Google Cloud services. However, its configuration can be complex for newbies to the platform. Amazon Textract provides a Drag & Drop interface which makes it even easier to use.

Performance-wise, GCV seems to perform better than Textract for handwritten text detection. On the other hand, GCV fails on table extraction since the text is detected normally and not in table form.

Tesseract has the advantage of being free and not requiring any special configuration. It is easy to use, on the command line or via the Python pytesseract library.

In terms of performance, Tesseract is efficient on good resolution data. Its performance deteriorates on low quality data or handwritten text. Table extraction is possible, but can be complex to implement and will also be very sensitive to document quality.

It is a solution that remains interesting due to its accessibility and its effectiveness on certain specific data formats.

 

A must see

Most popular articles

Do you have a transformation project? Let's talk about it !