Extracting PDF Text

This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Extracting PDF Text

Post by jeffp »

We are using your library to create searchable PDFs. I essentially get info from our OCR engine and then place the text using

PXC_TextOutA

Recently, we have a client that uses some open source software to grab a PDF file and extract the text inside. He reports that this software is not extracting any of the text in the searchable PDFs we are creating.

Here is what he said below. Do you know what this token stuff is that he if referring to?
I had a few minutes to do some tracing on both a ‘good’ pdf that is readable by our software and one output by your engine.

The output from your engine contains no ‘STRING’ tokens in the PDF stream. Which is what our open source PDF text extractor is looking for. So I can confirm that it most definitely is a PDF file format issue. Now the question becomes, what token is your engine using to output the text blocks.

The software we are using recognizes the following token types:
NUMBER '1
STRING '2
NAME '3
COMMENT '4
START_ARRAY '5
END_ARRAY '6
START_DIC '7
END_DIC '8
REF '9
OTHER '10
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17948
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Extracting PDF Text

Post by Tracker Supp-Stefan »

Hi Jeff,

I am not sure what those "tokens" are either - but they are definitely not mandatory by the PDF specification.
Please check the attached files. The .txt is the "source" of a very simple PDF document, and two pdf files.
The first is exactly the same as the .txt file with just the extension changed, and the second is opened in our Viewer and resaved - so that it can fix the barebone sample and make it more "compliant" with the current standards.
In either version there is no mention of "STRING" - and both files have a selectable and searchable text in them.

Please ask the customer to run his text extraction tool on the test_fixed.pdf and see what the result will be.

Best,
Stefan
Attachments
sample_PDF.zip
(1.9 KiB) Downloaded 210 times
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Extracting PDF Text

Post by jeffp »

Ok. I'll have our customer run your sample files.

If it turns out the test_fixed.pdf does work now, how can I deal with this.

Currently, I use your DLL and PXC_TextOutA to produce the hidden text PDF page. Am I going to need to run this through the ViewerAX to produce the results he needs? I'd like to aviod this if possible.

Will this be address in the new major builds of the viewer and dlls that you are working on?

Thanks.
--Jeff
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17948
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Extracting PDF Text

Post by Tracker Supp-Stefan »

Hi Jeff,

I suspect the issue is in the third party (and Open source) tool being used - and it expecting some special elements inside a PDF file. If it turns out to be an issue in our code - certainly we will do the necessary to fix it.

Best,
Stefan
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Extracting PDF Text

Post by jeffp »

The customer got both PDF files you sent to work with his parser. I was expecting the test.pdf to fail.

Is it something I'm doing when I create the PDF using PXC_TextOutA . Attached is a sample searchable PDF that is created in our software using PXC_TextOutA.

--Jeff
Attachments
MyPDF.pdf
(87.18 KiB) Downloaded 256 times
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3550
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: Extracting PDF Text

Post by Ivan - Tracker Software »

Content of the page often contains XForm XObjects (placed by operator 'Do') where text may be located too.

To extract all text from the page you need to parse these XForm's also.

Also, please note the text in the content may be stored as a string (for example, (ABC) Tj ) as well as an array ([(A)5(BC)] TJ ).

And please remember about font encoding.
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Extracting PDF Text

Post by jeffp »

What I was asking was why did your test.pdf work with their parser but my attached MyPDF.pdf file didn't work. We their noticeable differences in the PDF stream in each doc. Mine was created by PXC_TextOutA and thought yours was too.
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3550
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: Extracting PDF Text

Post by Ivan - Tracker Software »

The text in MyPDF.pdf is contained in XForm which is placed at the end of the page. Looks like the PDF with text was merged (overlay) with the original image-based PDF:

q
609.12 0.00 -0.00 786.96 0.00 0.24 cm
/img0c0 Do
Q
...
q
0.353 0.357 0.380 rg
331.68 0.00 -0.00 27.60 81.36 666.24 cm
/img0b11 Do
Q
q
1 0 0 1 0 0 cm
/PCx0 Do
Q


Test.pdf contains only text:

BT
/F1 24 Tf
100 100 Td
( Hello World ) Tj
ET
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Extracting PDF Text

Post by jeffp »

Ivan,

Would you mind taking a look at one more thing.

I had my client create two very simple "Hello World" PDF examples. Their extraction software is able to extract text from one but not the other (files named as such).

The non working one was created by our software using your DLL components.

The working one was created by another software product they have.

Each is a scanned image followed by an OCR.

What is the difference in the two PDF streams that would cause the text not to get extracted in one but get extracted in the other? The only thing I can see is that one in PDF format 1.4 and the other 1.7.

Lastly, here is the PDF libary my client uses to extract PDF text.

http://itextpdf.com

--Jeff
Attachments
Hello World - Non Working OCR.pdf
(2.37 KiB) Downloaded 221 times
Hello World - Working OCR .pdf
(10.83 KiB) Downloaded 254 times
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3550
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: Extracting PDF Text

Post by Ivan - Tracker Software »

Again the problem is because in one file the text is located right in page's content (see working.png) and in the other file - in XForm object (not_working.png).
Attachments
screens.zip
(54.94 KiB) Downloaded 217 times
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Extracting PDF Text

Post by jeffp »

Ok. Then i guess the question is how can I create a working pdf using your DLL library. Currently, I use PXC_TextOutA combined with PXCp_PlaceContents to place my OCR text. But it looks like it's creating a non working pdf in an XForm object. Is there another way I should be placing OCR text using your DLL library that would produce a working pdf?

--Jeff
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Extracting PDF Text

Post by Lzcat - Tracker Supp »

Hi Jeff.
I'm afraid that there is no such way (except complete PDF recreation not using PXCp_... functions).
I would think a much better solution is to use the correct text extraction libraries - there are a lot documents which use XForm objects.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Post Reply