PDF Data Extraction

paladave · June 18, 2020, 7:11am

Hi,

I was exploring data extraction from a PDF (Invoice) documents. While I was successful in extracting the data, I need help with the alignment of the data. As of now, the output is a serial representation of the data on the invoice document.

How do I map, let’s say Invoice Date header and the date, Invoice amount header, and the value?

a.polianskii · June 18, 2020, 5:12pm

Hello, @paladave!

Can you please tell what function (engine) you used?
You mean how can you extract specific information from an array?

If in such documents the location of the recognized elements does not change, but only the information itself, you can use the “Template Recognition” function, which allows you to select areas for recognition and create variables to store this data. You can use the created template in a loop format for other similar documents.

If the location of the information in the document can change, you can recognize the document completely and extract the information by keywords (or keyword library), or you can use regular expressions in JavaScript that allow you to specify a pattern for searching information.

Let me know if you have any further questions.