text is garble

inaniwa · Post by **inaniwa** » Thu Apr 14, 2011 5:40 am

Dear Tracker Software Products,

I am using function PXCp_ET_GetElement of PDF-XChange PRO SDK to extract
text.
However, the text broke.
I will attach the PDF that doesn't extract properly.

Please assist asap.

Software Purchased: PDF-XChange 4 PRO SDK Version: 4.0195

Thu Apr 14, 2011 7:01 am

Your document's use multibyte encodings which is not currently supported by xcpro40. Please wait for next major update.

inaniwa · Post by **inaniwa** » Thu Apr 14, 2011 8:53 am

Lzcat - Tracker Supp wrote:Your documents use multibyte encodings which are not supported by xcpro40. Please wait for next major update.

Please let me know when release date.

Post by **Tracker Supp-Stefan** » Thu Apr 14, 2011 1:17 pm

Hello inaniwa,

We do not have a firm release date yet, but are working as hard as we can to complete it as soon as possible.

We will make more specific date announcements when possible, but we are still on track for release before the end of this quarter for the new PDF-XChange Viewer SDK on which much of the reqiuired code for the libraries will be based - the PDF-Tools SDK elements you are using will then follow later this summer.

Thanks for your understanding,
Stefan

ronystg · Post by **ronystg** » Mon Jun 06, 2011 9:07 am

Hi,

I m also have problem with text extraction,

for some reason the words are split to many fractions, the PDF coming from ABBY contain image & text, when copy to text doc it look ok,

can you assist

I dont know if you see the hebrew letters but in the csv you can see many words with 1 letter
"
היחסים ששררו בין גולדמן לאביו, ואת העובדה שגולדמן לא מחשיב, ואף

מתעב, את הטקסים הללו, ואילו רק יכול היה ספק אם הוא עצמו היה

משתתף בהלוויתו של אביו, ומה עוד בהלוויתו של אביו של אחד מחבריו

"
"

============
Matrix.d Matrix.e Matrix.f TH fontsize count TEXT OFFSET:
1 51.36 547.8 " 99.000" 10.5 3 " [ףא]" " 0.000" 4.828 9.937
1 61.297 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 64.198 547.8 " 99.000" 10.5 7 " [ ;בישח]" " 0.000" 4.923 7.689 12.586 15.529 22.037 27.163
1 91.361 547.8 " 99.000" 10.5 2 " [מ]" " 0.000" 5.042
1 96.405 547.8 " 99.000" 10.5 3 " [ א]" " 0.000" 5.119 10.112
1 106.517 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 111.008 547.8 " 99.000" 10.5 8 " [ ןמדלוג]" " 0.000" 4.556 7.573 12.782 17.306 21.965 25.034 28.684
1 139.692 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
1 146.036 547.8 " 99.000" 10.5 7 " [ הדבוע]" " 0.000" 5.531 10.581 15.09 19.973 23.026 27.899
1 173.935 547.8 " 99.000" 10.5 2 " [ה]" " 0.000" 4.896
1 178.833 547.8 " 99.000" 10.5 4 " [ תא]" " 0.000" 5.375 10.799 16.026
1 194.859 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 197.76 547.8 " 99.000" 10.5 7 " [ ;ויבא]" " 0.000" 5.049 7.755 10.762 13.644 18.481 23.588
1 221.348 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 225.841 547.8 " 99.000" 10.5 7 " [ ןמדלו]" " 0.000" 4.691 7.52 12.543 16.879 21.351 24.233
1 250.074 547.8 " 99.000" 10.5 2 " [ג]" " 0.000" 3.482
1 253.558 547.8 " 99.000" 10.5 4 " [ ןי]" " 0.000" 4.816 7.771 10.652
1 264.21 547.8 " 99.000" 10.5 2 " [ב]" " 0.000" 4.73
1 268.94 547.8 " 99.000" 10.5 6 " [ וררש]" " 0.000" 5.155 8.38 13.58 18.78 25.445
1 294.385 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
============

Rony Steinberg

Post by **Tracker Supp-Stefan** » Mon Jun 06, 2011 1:15 pm

Hello Rony,

I presume that you are extracting the text element by element - and the result you get is the correct one - somply when OCR-ing this file ABBYY's fine reader did break the words apart, and there is no restriction of a text element being a whole word - so a 3 letter word could consist of three elements.

Please try PXCp_ET_GetPageContentAsTextW
as this could produce a better result, but also please note that for now our tools might not work 100% correctly with right-to-left scripts like Hebrew.

Best,
Stefan

ronystg · Post by **ronystg** » Mon Jun 06, 2011 8:17 pm

Hello Stefan,

Yes, I use text element by element, I need each word with positions & attributes.

You are right with PXCp_ET_GetPageContentAsTextW the result is better.

is there a way to use this method to receive entire word ?

anyway, ABBYY use dictionary to confirm words, doesn't make sense to split the words before write it to the file

Thanks,
Rony

Tue Jun 07, 2011 8:06 am

You can get all text from page (without positions) or each element with positions - same as in PDF content. Yes, in most cases words in PDF are splitted, and we do not provide mechanism to collate word fragments because there is alot of specific cases which require separate handling - text on curves, overlapped text and so on.
All I can recommend to - get text by elements and than collate words where it is possible (we provide characters metrics, so you can calculate where fragment ends and decide collate it with following or no).
HTH.

ronystg · Post by **ronystg** » Tue Jun 07, 2011 10:00 am

our project is to analyze books, no curves or special cases.

because Hebrew goes from right to left we have to reverse all words in the PDF, special characters .,;'\|" etc can be in the left & the right.

so, we are analyzing the position of the char against the word & decide where to put it on the fly.

we didn't realize that words can split. now we are not sure where to put the fraction.

we will appreciate if we will be able to tell the program that space can be more then 5-10 pixels this way text fraction will be word.

Thanks,

Rony

Post by **Tracker Supp-Stefan** » Tue Jun 07, 2011 3:36 pm

Hello Rony,

Please note that the coordinates are in points and not pixels.
As for the logic you will put behind concatenating those text elements - you will need to implement this on your own.
But you could e.g. combine a third party dictionary + comparing with a result returned from PXCp_ET_GetPageContentAsTextW to help you decide where the distance between two consecutive elements is separating two different words or not.

Best,
Stefan

inaniwa · Post by **inaniwa** » Mon Jun 20, 2011 5:58 am

Tracker Supp-Stefan wrote:Hello inaniwa,

We do not have a firm release date yet, but are working as hard as we can to complete it as soon as possible.

We will make more specific date announcements when possible, but we are still on track for release before the end of this quarter for the new PDF-XChange Viewer SDK on which much of the reqiuired code for the libraries will be based - the PDF-Tools SDK elements you are using will then follow later this summer.

Thanks for your understanding,
Stefan

Dear Tracker Software Products,
Any update on this?
Please let me know the progress.

Post by **Tracker Supp-Stefan** » Mon Jun 20, 2011 3:20 pm

Hello inaniwa,

No specific date yet I am afraid, but we hope to have more information for you by the end of the week.
Please note that the SDK products would be released after the end user ones, as there is some more finishing work + documentation for them compared to the end user versions.

Best,
Stefan

inaniwa · Post by **inaniwa** » Tue Jun 28, 2011 8:12 pm

Hello stefan,

>No specific date yet I am afraid, but we hope to have more information for you by the end of the week.

I have not heard from you yet.

inaniwa

Post by **Paul - Tracker Supp** » Tue Jun 28, 2011 8:38 pm

HI inaniwa,

there is a formal announcement on the progress of the new versions here: https://forum.pdf-xchange.com/ ... 521#p50521

hth

inaniwa · Post by **inaniwa** » Thu Sep 29, 2011 9:27 pm

Lzcat - Tracker Supp wrote:Your document's use multibyte encodings which is not currently supported by xcpro40. Please wait for next major update.

Is This file the same cause?

Fri Sep 30, 2011 6:12 am

Almost same - another encoding of same kind.

Currently the xcpro40 Library can handle such fonts only if the 'ToUnicode' table is embedded with it.

inaniwa · Post by **inaniwa** » Mon Oct 01, 2012 8:26 am

Hi,
Had been said to be fixed in the next version up, but still not fixed.
When it is fixed?
Best regards.

Mon Oct 01, 2012 10:15 am

Hello inaniwa,

The "next version" Victor (Lzcat) has in mind is actually V5 of our PDF Tools SDK - and this is not yet released, and no build of V4 of the SDK will support predefined CJK encodings.

Best,
Stefan

inaniwa · Post by **inaniwa** » Tue Oct 02, 2012 12:06 am

Hi,
I got it.
Best regards.

Post by **Tracker Supp-Stefan** » Tue Oct 02, 2012 8:49 am

Thanks for the understanding!

Regards,
Stefan

text is garble

text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble

Re: text is garble