This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.
We are using your library to create searchable PDFs. I essentially get info from our OCR engine and then place the text using
PXC_TextOutA
Recently, we have a client that uses some open source software to grab a PDF file and extract the text inside. He reports that this software is not extracting any of the text in the searchable PDFs we are creating.
Here is what he said below. Do you know what this token stuff is that he if referring to?
I had a few minutes to do some tracing on both a ‘good’ pdf that is readable by our software and one output by your engine.
The output from your engine contains no ‘STRING’ tokens in the PDF stream. Which is what our open source PDF text extractor is looking for. So I can confirm that it most definitely is a PDF file format issue. Now the question becomes, what token is your engine using to output the text blocks.
The software we are using recognizes the following token types:
NUMBER '1
STRING '2
NAME '3
COMMENT '4
START_ARRAY '5
END_ARRAY '6
START_DIC '7
END_DIC '8
REF '9
OTHER '10
I am not sure what those "tokens" are either - but they are definitely not mandatory by the PDF specification.
Please check the attached files. The .txt is the "source" of a very simple PDF document, and two pdf files.
The first is exactly the same as the .txt file with just the extension changed, and the second is opened in our Viewer and resaved - so that it can fix the barebone sample and make it more "compliant" with the current standards.
In either version there is no mention of "STRING" - and both files have a selectable and searchable text in them.
Please ask the customer to run his text extraction tool on the test_fixed.pdf and see what the result will be.
If it turns out the test_fixed.pdf does work now, how can I deal with this.
Currently, I use your DLL and PXC_TextOutA to produce the hidden text PDF page. Am I going to need to run this through the ViewerAX to produce the results he needs? I'd like to aviod this if possible.
Will this be address in the new major builds of the viewer and dlls that you are working on?
I suspect the issue is in the third party (and Open source) tool being used - and it expecting some special elements inside a PDF file. If it turns out to be an issue in our code - certainly we will do the necessary to fix it.
The customer got both PDF files you sent to work with his parser. I was expecting the test.pdf to fail.
Is it something I'm doing when I create the PDF using PXC_TextOutA . Attached is a sample searchable PDF that is created in our software using PXC_TextOutA.
What I was asking was why did your test.pdf work with their parser but my attached MyPDF.pdf file didn't work. We their noticeable differences in the PDF stream in each doc. Mine was created by PXC_TextOutA and thought yours was too.
The text in MyPDF.pdf is contained in XForm which is placed at the end of the page. Looks like the PDF with text was merged (overlay) with the original image-based PDF:
q
609.12 0.00 -0.00 786.96 0.00 0.24 cm
/img0c0 Do
Q
...
q
0.353 0.357 0.380 rg
331.68 0.00 -0.00 27.60 81.36 666.24 cm
/img0b11 Do
Q
q
1 0 0 1 0 0 cm /PCx0 Do
Q
Test.pdf contains only text:
BT
/F1 24 Tf
100 100 Td
( Hello World ) Tj
ET
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
I had my client create two very simple "Hello World" PDF examples. Their extraction software is able to extract text from one but not the other (files named as such).
The non working one was created by our software using your DLL components.
The working one was created by another software product they have.
Each is a scanned image followed by an OCR.
What is the difference in the two PDF streams that would cause the text not to get extracted in one but get extracted in the other? The only thing I can see is that one in PDF format 1.4 and the other 1.7.
Lastly, here is the PDF libary my client uses to extract PDF text.
Again the problem is because in one file the text is located right in page's content (see working.png) and in the other file - in XForm object (not_working.png).
Ok. Then i guess the question is how can I create a working pdf using your DLL library. Currently, I use PXC_TextOutA combined with PXCp_PlaceContents to place my OCR text. But it looks like it's creating a non working pdf in an XForm object. Is there another way I should be placing OCR text using your DLL library that would produce a working pdf?
Hi Jeff.
I'm afraid that there is no such way (except complete PDF recreation not using PXCp_... functions).
I would think a much better solution is to use the correct text extraction libraries - there are a lot documents which use XForm objects.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.