Data Scraping from list of scanned pdf and save to excel

zoe · May 11, 2020, 8:01pm

Hi,

So we are on our trial period, and I’ve been spending time and watching the academy videos and resource page about how to extract data from list of low quality scanned pdfs.

But it’s difficult to work with it without a detailed tutorials about how the basic functionalities work.

Can any give out a sample workflow that extract multiple field data from a list of scanned pdf and save to excel? Me and my team want to learn how each activity interact and how to declare multiple types of variable to implement ‘for each’ activity.

(a sample workflow or a zoom technical meeting will be great)
Thanks in advance!

zoe · May 11, 2020, 8:39pm

Also, only the Yandex and Google OCR support pdf format data extract, do I need to have a list of pdf that are same size to complete the task? (my pdfs are in different size and have different heigh and width)

a.polianskii · May 12, 2020, 3:09pm

Hello, @zoe!

If all PDF files have always different format and size, it is possible to create an algorithm instead of using the ‘Recognition Template’ function that will extract elements from the text depending on the keywords.

Since there may be a different number of files, the easiest way is to build two algorithms and include one of them into the loop body of the other algorithm.

Thus, the first algorithm will process each file in the specified folder and may look like this:

The subprogram in the loop body will process files with OCR, save all elements from OCR results and search for the necessary element. In this case we consider an example of processing invoices and searching for the number of such invoice, which follows the symbol ‘#’.

‘ABBYY’ and ‘Yandex’ OCR can work with PDF files. For more information about the limitations and formats used, see the description of each function in this section.

You can read about possible types of variables in the platform in this article.

If you have any questions, please feel free to ask.

zoe · May 12, 2020, 3:51pm

Hi, thanks you for your response! I have a big picture now how activities interact. It will be great if you could see the properties panel of the activities, could you send me a sample .neek file to see in details? (It’s hard to find which activities are being used in your example with just looking at the activity icon…)

Thanks again!

a.polianskii · May 12, 2020, 4:18pm

@zoe, yes.

I mailed you the .neek file

zoe · May 12, 2020, 6:06pm

Thank you again for your support,

I have ran the scanned pdf extraction bot but got error below:

“Selectors: Zip exception.Unable to read value at the specified position - end of stream was reached.
. Please refer to documentation to make sure the function ‘Subprogram’ was set up correctly. If it doesn’t help, click this link to contact support.”

How can I track which file is throwing error and which field is unreadable? Is there a debug option?

Also, is your ‘Recognition Template’ activity is designed for a list of pdf that have fixed format and size?

Thank you

a.polianskii · May 12, 2020, 7:24pm

@zoe, please, open a second subprogram (for PDF recognition) using the platform, change the path in the properties of the ‘Recognize text (Yandex)’ function and start the robot. The error-causing block will be highlighted and selected as the focus.

If you open the ‘Variables’ tab, you will see values of variables that were at the time the robot stopped - the file name, number of iterations and so on.

Please, record a video where you play back this error and try to use the recommendations above. This will help us determine the cause.

Yes, the Recognition Template function allows us to create a recognition pattern for documents that look the same, but the information in them changes. The created template can be applied to a number of documents in a loop format.

zoe · May 12, 2020, 7:55pm

Very clear, thank you!