Extracting PDF Text

jeffp · Post by **jeffp** » Fri Jan 18, 2013 2:45 pm

We are using your library to create searchable PDFs. I essentially get info from our OCR engine and then place the text using

PXC_TextOutA

Recently, we have a client that uses some open source software to grab a PDF file and extract the text inside. He reports that this software is not extracting any of the text in the searchable PDFs we are creating.

Here is what he said below. Do you know what this token stuff is that he if referring to?

I had a few minutes to do some tracing on both a ‘good’ pdf that is readable by our software and one output by your engine.

The output from your engine contains no ‘STRING’ tokens in the PDF stream. Which is what our open source PDF text extractor is looking for. So I can confirm that it most definitely is a PDF file format issue. Now the question becomes, what token is your engine using to output the text blocks.

The software we are using recognizes the following token types:
NUMBER '1
STRING '2
NAME '3
COMMENT '4
START_ARRAY '5
END_ARRAY '6
START_DIC '7
END_DIC '8
REF '9
OTHER '10

Post by **Tracker Supp-Stefan** » Fri Jan 18, 2013 4:17 pm

Hi Jeff,

I am not sure what those "tokens" are either - but they are definitely not mandatory by the PDF specification.
Please check the attached files. The .txt is the "source" of a very simple PDF document, and two pdf files.
The first is exactly the same as the .txt file with just the extension changed, and the second is opened in our Viewer and resaved - so that it can fix the barebone sample and make it more "compliant" with the current standards.
In either version there is no mention of "STRING" - and both files have a selectable and searchable text in them.

Please ask the customer to run his text extraction tool on the test_fixed.pdf and see what the result will be.

Best,
Stefan

jeffp · Post by **jeffp** » Fri Jan 18, 2013 4:54 pm

Ok. I'll have our customer run your sample files.

If it turns out the test_fixed.pdf does work now, how can I deal with this.

Currently, I use your DLL and PXC_TextOutA to produce the hidden text PDF page. Am I going to need to run this through the ViewerAX to produce the results he needs? I'd like to aviod this if possible.

Will this be address in the new major builds of the viewer and dlls that you are working on?

Thanks.
--Jeff

Post by **Tracker Supp-Stefan** » Fri Jan 18, 2013 5:02 pm

Hi Jeff,

I suspect the issue is in the third party (and Open source) tool being used - and it expecting some special elements inside a PDF file. If it turns out to be an issue in our code - certainly we will do the necessary to fix it.

Best,
Stefan

jeffp · Post by **jeffp** » Fri Jan 18, 2013 7:06 pm

The customer got both PDF files you sent to work with his parser. I was expecting the test.pdf to fail.

Is it something I'm doing when I create the PDF using PXC_TextOutA . Attached is a sample searchable PDF that is created in our software using PXC_TextOutA.

--Jeff

Sat Jan 19, 2013 2:24 am

Content of the page often contains XForm XObjects (placed by operator 'Do') where text may be located too.

To extract all text from the page you need to parse these XForm's also.

Also, please note the text in the content may be stored as a string (for example, (ABC) Tj ) as well as an array ([(A)5(BC)] TJ ).

And please remember about font encoding.

jeffp · Post by **jeffp** » Sat Jan 19, 2013 6:19 am

What I was asking was why did your test.pdf work with their parser but my attached MyPDF.pdf file didn't work. We their noticeable differences in the PDF stream in each doc. Mine was created by PXC_TextOutA and thought yours was too.

Sat Jan 19, 2013 6:39 am

The text in MyPDF.pdf is contained in XForm which is placed at the end of the page. Looks like the PDF with text was merged (overlay) with the original image-based PDF:

q
609.12 0.00 -0.00 786.96 0.00 0.24 cm
/img0c0 Do
Q
...
q
0.353 0.357 0.380 rg
331.68 0.00 -0.00 27.60 81.36 666.24 cm
/img0b11 Do
Q
q
1 0 0 1 0 0 cm
/PCx0 Do
Q

Test.pdf contains only text:

BT
/F1 24 Tf
100 100 Td
( Hello World ) Tj
ET

jeffp · Post by **jeffp** » Tue Jan 22, 2013 7:33 pm

Ivan,

Would you mind taking a look at one more thing.

I had my client create two very simple "Hello World" PDF examples. Their extraction software is able to extract text from one but not the other (files named as such).

The non working one was created by our software using your DLL components.

The working one was created by another software product they have.

Each is a scanned image followed by an OCR.

What is the difference in the two PDF streams that would cause the text not to get extracted in one but get extracted in the other? The only thing I can see is that one in PDF format 1.4 and the other 1.7.

Lastly, here is the PDF libary my client uses to extract PDF text.

http://itextpdf.com

--Jeff

Tue Jan 22, 2013 11:19 pm

Again the problem is because in one file the text is located right in page's content (see working.png) and in the other file - in XForm object (not_working.png).

jeffp · Post by **jeffp** » Wed Jan 23, 2013 1:31 am

Ok. Then i guess the question is how can I create a working pdf using your DLL library. Currently, I use PXC_TextOutA combined with PXCp_PlaceContents to place my OCR text. But it looks like it's creating a non working pdf in an XForm object. Is there another way I should be placing OCR text using your DLL library that would produce a working pdf?

--Jeff

Wed Jan 23, 2013 8:21 am

Hi Jeff.
I'm afraid that there is no such way (except complete PDF recreation not using PXCp_... functions).
I would think a much better solution is to use the correct text extraction libraries - there are a lot documents which use XForm objects.

Extracting PDF Text

Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text

Re: Extracting PDF Text