text is garble

This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
inaniwa
User
Posts: 10
Joined: Thu Oct 22, 2009 4:24 am

text is garble

Post by inaniwa »

Dear Tracker Software Products,

I am using function PXCp_ET_GetElement of PDF-XChange PRO SDK to extract
text.
However, the text broke.
I will attach the PDF that doesn't extract properly.

Please assist asap.

Software Purchased: PDF-XChange 4 PRO SDK Version: 4.0195
Attachments
files.zip
(804.72 KiB) Downloaded 290 times
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: text is garble

Post by Lzcat - Tracker Supp »

Your document's use multibyte encodings which is not currently supported by xcpro40. Please wait for next major update.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
inaniwa
User
Posts: 10
Joined: Thu Oct 22, 2009 4:24 am

Re: text is garble

Post by inaniwa »

Lzcat - Tracker Supp wrote:Your documents use multibyte encodings which are not supported by xcpro40. Please wait for next major update.
Please let me know when release date.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17910
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: text is garble

Post by Tracker Supp-Stefan »

Hello inaniwa,

We do not have a firm release date yet, but are working as hard as we can to complete it as soon as possible.

We will make more specific date announcements when possible, but we are still on track for release before the end of this quarter for the new PDF-XChange Viewer SDK on which much of the reqiuired code for the libraries will be based - the PDF-Tools SDK elements you are using will then follow later this summer.

Thanks for your understanding,
Stefan
ronystg
User
Posts: 4
Joined: Mon Sep 18, 2006 5:23 pm
Contact:

Re: text is garble

Post by ronystg »

Hi,

I m also have problem with text extraction,

for some reason the words are split to many fractions, the PDF coming from ABBY contain image & text, when copy to text doc it look ok,

can you assist

I dont know if you see the hebrew letters but in the csv you can see many words with 1 letter
"
היחסים ששררו בין גולדמן לאביו, ואת העובדה שגולדמן לא מחשיב, ואף

מתעב, את הטקסים הללו, ואילו רק יכול היה ספק אם הוא עצמו היה

משתתף בהלוויתו של אביו, ומה עוד בהלוויתו של אביו של אחד מחבריו

"
"

============
Matrix.d Matrix.e Matrix.f TH fontsize count TEXT OFFSET:
1 51.36 547.8 " 99.000" 10.5 3 " [ףא]" " 0.000" 4.828 9.937
1 61.297 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 64.198 547.8 " 99.000" 10.5 7 " [ ;בישח]" " 0.000" 4.923 7.689 12.586 15.529 22.037 27.163
1 91.361 547.8 " 99.000" 10.5 2 " [מ]" " 0.000" 5.042
1 96.405 547.8 " 99.000" 10.5 3 " [ א]" " 0.000" 5.119 10.112
1 106.517 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 111.008 547.8 " 99.000" 10.5 8 " [ ןמדלוג]" " 0.000" 4.556 7.573 12.782 17.306 21.965 25.034 28.684
1 139.692 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
1 146.036 547.8 " 99.000" 10.5 7 " [ הדבוע]" " 0.000" 5.531 10.581 15.09 19.973 23.026 27.899
1 173.935 547.8 " 99.000" 10.5 2 " [ה]" " 0.000" 4.896
1 178.833 547.8 " 99.000" 10.5 4 " [ תא]" " 0.000" 5.375 10.799 16.026
1 194.859 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 197.76 547.8 " 99.000" 10.5 7 " [ ;ויבא]" " 0.000" 5.049 7.755 10.762 13.644 18.481 23.588
1 221.348 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 225.841 547.8 " 99.000" 10.5 7 " [ ןמדלו]" " 0.000" 4.691 7.52 12.543 16.879 21.351 24.233
1 250.074 547.8 " 99.000" 10.5 2 " [ג]" " 0.000" 3.482
1 253.558 547.8 " 99.000" 10.5 4 " [ ןי]" " 0.000" 4.816 7.771 10.652
1 264.21 547.8 " 99.000" 10.5 2 " [ב]" " 0.000" 4.73
1 268.94 547.8 " 99.000" 10.5 6 " [ וררש]" " 0.000" 5.155 8.38 13.58 18.78 25.445
1 294.385 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
============

Rony Steinberg
Attachments
זכרון דברים - 0010.pdf
(1.4 MiB) Downloaded 284 times
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17910
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: text is garble

Post by Tracker Supp-Stefan »

Hello Rony,

I presume that you are extracting the text element by element - and the result you get is the correct one - somply when OCR-ing this file ABBYY's fine reader did break the words apart, and there is no restriction of a text element being a whole word - so a 3 letter word could consist of three elements.

Please try PXCp_ET_GetPageContentAsTextW
as this could produce a better result, but also please note that for now our tools might not work 100% correctly with right-to-left scripts like Hebrew.

Best,
Stefan
ronystg
User
Posts: 4
Joined: Mon Sep 18, 2006 5:23 pm
Contact:

Re: text is garble

Post by ronystg »

Hello Stefan,

Yes, I use text element by element, I need each word with positions & attributes.

You are right with PXCp_ET_GetPageContentAsTextW the result is better.

is there a way to use this method to receive entire word ?

anyway, ABBYY use dictionary to confirm words, doesn't make sense to split the words before write it to the file

Thanks,
Rony
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: text is garble

Post by Lzcat - Tracker Supp »

You can get all text from page (without positions) or each element with positions - same as in PDF content. Yes, in most cases words in PDF are splitted, and we do not provide mechanism to collate word fragments because there is alot of specific cases which require separate handling - text on curves, overlapped text and so on.
All I can recommend to - get text by elements and than collate words where it is possible (we provide characters metrics, so you can calculate where fragment ends and decide collate it with following or no).
HTH.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
ronystg
User
Posts: 4
Joined: Mon Sep 18, 2006 5:23 pm
Contact:

Re: text is garble

Post by ronystg »

our project is to analyze books, no curves or special cases.

because Hebrew goes from right to left we have to reverse all words in the PDF, special characters .,;'\|" etc can be in the left & the right.

so, we are analyzing the position of the char against the word & decide where to put it on the fly.

we didn't realize that words can split. now we are not sure where to put the fraction.

we will appreciate if we will be able to tell the program that space can be more then 5-10 pixels this way text fraction will be word.

Thanks,

Rony
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17910
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: text is garble

Post by Tracker Supp-Stefan »

Hello Rony,

Please note that the coordinates are in points and not pixels.
As for the logic you will put behind concatenating those text elements - you will need to implement this on your own.
But you could e.g. combine a third party dictionary + comparing with a result returned from PXCp_ET_GetPageContentAsTextW to help you decide where the distance between two consecutive elements is separating two different words or not.

Best,
Stefan
inaniwa
User
Posts: 10
Joined: Thu Oct 22, 2009 4:24 am

Re: text is garble

Post by inaniwa »

Tracker Supp-Stefan wrote:Hello inaniwa,

We do not have a firm release date yet, but are working as hard as we can to complete it as soon as possible.

We will make more specific date announcements when possible, but we are still on track for release before the end of this quarter for the new PDF-XChange Viewer SDK on which much of the reqiuired code for the libraries will be based - the PDF-Tools SDK elements you are using will then follow later this summer.

Thanks for your understanding,
Stefan
Dear Tracker Software Products,
Any update on this?
Please let me know the progress.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17910
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: text is garble

Post by Tracker Supp-Stefan »

Hello inaniwa,

No specific date yet I am afraid, but we hope to have more information for you by the end of the week.
Please note that the SDK products would be released after the end user ones, as there is some more finishing work + documentation for them compared to the end user versions.

Best,
Stefan
inaniwa
User
Posts: 10
Joined: Thu Oct 22, 2009 4:24 am

Re: text is garble

Post by inaniwa »

Hello stefan,

>No specific date yet I am afraid, but we hope to have more information for you by the end of the week.

I have not heard from you yet.

inaniwa
User avatar
Paul - Tracker Supp
Site Admin
Posts: 6897
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: text is garble

Post by Paul - Tracker Supp »

HI inaniwa,

there is a formal announcement on the progress of the new versions here: https://forum.pdf-xchange.com/ ... 521#p50521

hth
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
inaniwa
User
Posts: 10
Joined: Thu Oct 22, 2009 4:24 am

Re: text is garble

Post by inaniwa »

Lzcat - Tracker Supp wrote:Your document's use multibyte encodings which is not currently supported by xcpro40. Please wait for next major update.
Is This file the same cause?
Attachments
sample.pdf
(189.01 KiB) Downloaded 269 times
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: text is garble

Post by Lzcat - Tracker Supp »

Almost same - another encoding of same kind.

Currently the xcpro40 Library can handle such fonts only if the 'ToUnicode' table is embedded with it.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
inaniwa
User
Posts: 10
Joined: Thu Oct 22, 2009 4:24 am

Re: text is garble

Post by inaniwa »

Hi,
Had been said to be fixed in the next version up, but still not fixed.
When it is fixed?
Best regards.
Attachments
files.zip
(711.47 KiB) Downloaded 219 times
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17910
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: text is garble

Post by Tracker Supp-Stefan »

Hello inaniwa,

The "next version" Victor (Lzcat) has in mind is actually V5 of our PDF Tools SDK - and this is not yet released, and no build of V4 of the SDK will support predefined CJK encodings.

Best,
Stefan
inaniwa
User
Posts: 10
Joined: Thu Oct 22, 2009 4:24 am

Re: text is garble

Post by inaniwa »

Hi,
I got it.
Best regards.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17910
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: text is garble

Post by Tracker Supp-Stefan »

Thanks for the understanding!

Regards,
Stefan
Post Reply