Best OCR in 2019: In-depth data extraction benchmark

What is the difference between data extraction and OCR? How can we determine the best data extraction solution? What is the most accurate data extraction solution? Are there other criteria that could affect a companies’ procurement decision? What are the areas where data extraction solutions fail?

What is the difference between data extraction and OCR?

In short,

OCR turns documents into text which is a form of unstructured data which needs to be processed by humans
Data extraction solutions provide structured data which is machine readable

Therefore, data extraction solutions enable documents to be automatically processed. For more, feel free to read our OCR article where we explain the difference between OCR and data extraction.

‍

How can we determine the best data extraction solution?

Any AI solution can be measured against its competitors by comparing its accuracy against manually labeled data. This approach forms the basis of most PoC projects by large companies. These companies ask several leading vendors to produce predictions based on their data which has been manually labelled. The accuracy of these solutions is an important input to the companies’ procurement decision.

‍

What is the most accurate data extraction solution?

As you can see below, Hypatos was by far the best solution for these documents in terms of both

Number of all entities extracted
Accuracy of all extracted entities

Another important metric is crucial fields. If companies are not interested in discovering the insights in their spending, they can capture just the critical fields necessary to make a payment and record key aspects of the transaction in SAP. Hypatos was again the best solution in terms of both

Number of crucial fields extracted
Accuracy of crucial extracted fields

Summary of benchmarking in terms of critical fields, demonstrating Hypatos leadership

For most clients, crucial fields include:

Invoice Number
Document Type
Invoice Date
Service time
Net Amount
VAT rate
Total VAT amount
Total Gross amount
Currency
Sender Name

‍

Sample used in benchmarking

We used a relatively small set of 10 invoices from Germany in this initial benchmarking exercise. A major limitation on the sample size is that we needed to use documents which may need to be shared publicly. Because we wanted to be able to share the data set with the tech press and potential customers so they could reproduce our results if they want to. Therefore, we relied on invoices that we received and could not use any of our customers’ documents.

‍

Methodology

We could only benchmark Hypatos against other solutions that offered trial products, but we believe we covered all modern data extraction solutions that deal with semi structured documents including offers, orders, invoices, receipts payslips etc. We excluded solutions that focus on a single type of document as we have seen our clients use our services for multiple types of documents and we have not seen demand for document specific solutions from enterprise clients.

Of course, feel free to add a comment here if you think another products should be listed here. Products we used in the benchmarking are:

Textract available on AWS
Sypht: Free account available on sypht.com

‍

Are there other criteria that could affect a companies’ procurement decision?

Accuracy is not the only factor in the decision. Deployment options, ease of integration and advanced processing options are also important metrics and we benchmarked the sample companies against these metrics:

This table is based on public data and we are happy to update if the benchmark companies share more details with us. Continue below for a more detailed explanation of these metrics:

‍

Deployment options

Most European Fortune 500 prefer to have on-prem or private cloud solutions due to their security and data privacy policies. This can become a deciding factor in the procurement.

‍

Ease of integration

All of these solutions provide APIs which are easy to integrate into most applications. However, having existing integrations to enterprise software makes integration even easier.

‍

Advanced processing options

Extraction is the first step, in almost all cases companies do additional manual processing on extracted data. For example, invoices need to be assigned to accounts if they are not matched with a purchase order. In such cases, your service provider’s support is important to further automate the process.

This is not a requirement; companies can also work with software companies to build customized solutions that increase their level of automation. However, in areas such as back-office automation, most companies in the same industry have similar data and data does not confer them a competitive advantage. In such cases, companies should strive to get the best solution at the best terms and only companies with experience in the topic can offer such terms.

‍

Support

Most companies in the benchmark set a public claim that they offer extensive support options. Even if they did not publicly claim this, we expect all companies in the field to offer support, especially for large companies so we do not deep dive into this area.

‍

Vendor track record

Similar to support, we have seen that all benchmark companies have Fortune 500 customers. We could get into more details here as we believe we have the strongest network of partners and customers in this space. However, given that Amazon is one of the benchmark companies, this is a hard exercise as it is difficult to split their AWS customers from their Textract customers just based on public data. Therefore, for now, we do not deep dive into this area.

‍

Price

Finally, price is also definitely a factor in decision making. However, given that almost none of the companies in the benchmark set disclose their enterprise prices, we couldn’t compare companies by price.

‍

What are the areas where data extraction solutions fail?

While items like sender and recipient are relatively straightforward, others like line item extraction and multiple VAT rates proved challenging to our competitors.

Line item extraction

Line items, located near the bottom of invoices in a table format, include a list of all items that make up the purchase. They are hard to extract since these table-like structures are not formatted clearly like tables. Some extraction fails from our competitors and Hypatos’ successful results for the same document are below:

Multiple VAT rates

Multiple VAT rates are possible when an invoice contains multiple line items (multiple services or products) with different VAT rates. This was not handled successfully by most competitors. However, Hypatos deep learning tech is able to extract multiple VAT rates correctly.

Hope you find the benchmark useful. Choosing a supplier is hard, hopefully our approach helps you in formulating your own approach. And if you need support in document automation, we would love to help.

‍