text is garble
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
text is garble
Dear Tracker Software Products,
I am using function PXCp_ET_GetElement of PDF-XChange PRO SDK to extract
text.
However, the text broke.
I will attach the PDF that doesn't extract properly.
Please assist asap.
Software Purchased: PDF-XChange 4 PRO SDK Version: 4.0195
I am using function PXCp_ET_GetElement of PDF-XChange PRO SDK to extract
text.
However, the text broke.
I will attach the PDF that doesn't extract properly.
Please assist asap.
Software Purchased: PDF-XChange 4 PRO SDK Version: 4.0195
- Attachments
-
- files.zip
- (804.72 KiB) Downloaded 290 times
- Lzcat - Tracker Supp
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: text is garble
Your document's use multibyte encodings which is not currently supported by xcpro40. Please wait for next major update.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Re: text is garble
Please let me know when release date.Lzcat - Tracker Supp wrote:Your documents use multibyte encodings which are not supported by xcpro40. Please wait for next major update.
- Tracker Supp-Stefan
- Site Admin
- Posts: 17941
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: text is garble
Hello inaniwa,
We do not have a firm release date yet, but are working as hard as we can to complete it as soon as possible.
We will make more specific date announcements when possible, but we are still on track for release before the end of this quarter for the new PDF-XChange Viewer SDK on which much of the reqiuired code for the libraries will be based - the PDF-Tools SDK elements you are using will then follow later this summer.
Thanks for your understanding,
Stefan
We do not have a firm release date yet, but are working as hard as we can to complete it as soon as possible.
We will make more specific date announcements when possible, but we are still on track for release before the end of this quarter for the new PDF-XChange Viewer SDK on which much of the reqiuired code for the libraries will be based - the PDF-Tools SDK elements you are using will then follow later this summer.
Thanks for your understanding,
Stefan
Re: text is garble
Hi,
I m also have problem with text extraction,
for some reason the words are split to many fractions, the PDF coming from ABBY contain image & text, when copy to text doc it look ok,
can you assist
I dont know if you see the hebrew letters but in the csv you can see many words with 1 letter
"
היחסים ששררו בין גולדמן לאביו, ואת העובדה שגולדמן לא מחשיב, ואף
מתעב, את הטקסים הללו, ואילו רק יכול היה ספק אם הוא עצמו היה
משתתף בהלוויתו של אביו, ומה עוד בהלוויתו של אביו של אחד מחבריו
"
"
============
Matrix.d Matrix.e Matrix.f TH fontsize count TEXT OFFSET:
1 51.36 547.8 " 99.000" 10.5 3 " [ףא]" " 0.000" 4.828 9.937
1 61.297 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 64.198 547.8 " 99.000" 10.5 7 " [ ;בישח]" " 0.000" 4.923 7.689 12.586 15.529 22.037 27.163
1 91.361 547.8 " 99.000" 10.5 2 " [מ]" " 0.000" 5.042
1 96.405 547.8 " 99.000" 10.5 3 " [ א]" " 0.000" 5.119 10.112
1 106.517 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 111.008 547.8 " 99.000" 10.5 8 " [ ןמדלוג]" " 0.000" 4.556 7.573 12.782 17.306 21.965 25.034 28.684
1 139.692 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
1 146.036 547.8 " 99.000" 10.5 7 " [ הדבוע]" " 0.000" 5.531 10.581 15.09 19.973 23.026 27.899
1 173.935 547.8 " 99.000" 10.5 2 " [ה]" " 0.000" 4.896
1 178.833 547.8 " 99.000" 10.5 4 " [ תא]" " 0.000" 5.375 10.799 16.026
1 194.859 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 197.76 547.8 " 99.000" 10.5 7 " [ ;ויבא]" " 0.000" 5.049 7.755 10.762 13.644 18.481 23.588
1 221.348 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 225.841 547.8 " 99.000" 10.5 7 " [ ןמדלו]" " 0.000" 4.691 7.52 12.543 16.879 21.351 24.233
1 250.074 547.8 " 99.000" 10.5 2 " [ג]" " 0.000" 3.482
1 253.558 547.8 " 99.000" 10.5 4 " [ ןי]" " 0.000" 4.816 7.771 10.652
1 264.21 547.8 " 99.000" 10.5 2 " [ב]" " 0.000" 4.73
1 268.94 547.8 " 99.000" 10.5 6 " [ וררש]" " 0.000" 5.155 8.38 13.58 18.78 25.445
1 294.385 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
============
Rony Steinberg
I m also have problem with text extraction,
for some reason the words are split to many fractions, the PDF coming from ABBY contain image & text, when copy to text doc it look ok,
can you assist
I dont know if you see the hebrew letters but in the csv you can see many words with 1 letter
"
היחסים ששררו בין גולדמן לאביו, ואת העובדה שגולדמן לא מחשיב, ואף
מתעב, את הטקסים הללו, ואילו רק יכול היה ספק אם הוא עצמו היה
משתתף בהלוויתו של אביו, ומה עוד בהלוויתו של אביו של אחד מחבריו
"
"
============
Matrix.d Matrix.e Matrix.f TH fontsize count TEXT OFFSET:
1 51.36 547.8 " 99.000" 10.5 3 " [ףא]" " 0.000" 4.828 9.937
1 61.297 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 64.198 547.8 " 99.000" 10.5 7 " [ ;בישח]" " 0.000" 4.923 7.689 12.586 15.529 22.037 27.163
1 91.361 547.8 " 99.000" 10.5 2 " [מ]" " 0.000" 5.042
1 96.405 547.8 " 99.000" 10.5 3 " [ א]" " 0.000" 5.119 10.112
1 106.517 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 111.008 547.8 " 99.000" 10.5 8 " [ ןמדלוג]" " 0.000" 4.556 7.573 12.782 17.306 21.965 25.034 28.684
1 139.692 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
1 146.036 547.8 " 99.000" 10.5 7 " [ הדבוע]" " 0.000" 5.531 10.581 15.09 19.973 23.026 27.899
1 173.935 547.8 " 99.000" 10.5 2 " [ה]" " 0.000" 4.896
1 178.833 547.8 " 99.000" 10.5 4 " [ תא]" " 0.000" 5.375 10.799 16.026
1 194.859 547.8 " 99.000" 10.5 2 " [ו]" " 0.000" 2.9
1 197.76 547.8 " 99.000" 10.5 7 " [ ;ויבא]" " 0.000" 5.049 7.755 10.762 13.644 18.481 23.588
1 221.348 547.8 " 99.000" 10.5 2 " [ל]" " 0.000" 4.491
1 225.841 547.8 " 99.000" 10.5 7 " [ ןמדלו]" " 0.000" 4.691 7.52 12.543 16.879 21.351 24.233
1 250.074 547.8 " 99.000" 10.5 2 " [ג]" " 0.000" 3.482
1 253.558 547.8 " 99.000" 10.5 4 " [ ןי]" " 0.000" 4.816 7.771 10.652
1 264.21 547.8 " 99.000" 10.5 2 " [ב]" " 0.000" 4.73
1 268.94 547.8 " 99.000" 10.5 6 " [ וררש]" " 0.000" 5.155 8.38 13.58 18.78 25.445
1 294.385 547.8 " 99.000" 10.5 2 " [ש]" " 0.000" 6.341
============
Rony Steinberg
- Attachments
-
- זכרון דברים - 0010.pdf
- (1.4 MiB) Downloaded 284 times
- Tracker Supp-Stefan
- Site Admin
- Posts: 17941
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: text is garble
Hello Rony,
I presume that you are extracting the text element by element - and the result you get is the correct one - somply when OCR-ing this file ABBYY's fine reader did break the words apart, and there is no restriction of a text element being a whole word - so a 3 letter word could consist of three elements.
Please try PXCp_ET_GetPageContentAsTextW
as this could produce a better result, but also please note that for now our tools might not work 100% correctly with right-to-left scripts like Hebrew.
Best,
Stefan
I presume that you are extracting the text element by element - and the result you get is the correct one - somply when OCR-ing this file ABBYY's fine reader did break the words apart, and there is no restriction of a text element being a whole word - so a 3 letter word could consist of three elements.
Please try PXCp_ET_GetPageContentAsTextW
as this could produce a better result, but also please note that for now our tools might not work 100% correctly with right-to-left scripts like Hebrew.
Best,
Stefan
Re: text is garble
Hello Stefan,
Yes, I use text element by element, I need each word with positions & attributes.
You are right with PXCp_ET_GetPageContentAsTextW the result is better.
is there a way to use this method to receive entire word ?
anyway, ABBYY use dictionary to confirm words, doesn't make sense to split the words before write it to the file
Thanks,
Rony
Yes, I use text element by element, I need each word with positions & attributes.
You are right with PXCp_ET_GetPageContentAsTextW the result is better.
is there a way to use this method to receive entire word ?
anyway, ABBYY use dictionary to confirm words, doesn't make sense to split the words before write it to the file
Thanks,
Rony
- Lzcat - Tracker Supp
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: text is garble
You can get all text from page (without positions) or each element with positions - same as in PDF content. Yes, in most cases words in PDF are splitted, and we do not provide mechanism to collate word fragments because there is alot of specific cases which require separate handling - text on curves, overlapped text and so on.
All I can recommend to - get text by elements and than collate words where it is possible (we provide characters metrics, so you can calculate where fragment ends and decide collate it with following or no).
HTH.
All I can recommend to - get text by elements and than collate words where it is possible (we provide characters metrics, so you can calculate where fragment ends and decide collate it with following or no).
HTH.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Re: text is garble
our project is to analyze books, no curves or special cases.
because Hebrew goes from right to left we have to reverse all words in the PDF, special characters .,;'\|" etc can be in the left & the right.
so, we are analyzing the position of the char against the word & decide where to put it on the fly.
we didn't realize that words can split. now we are not sure where to put the fraction.
we will appreciate if we will be able to tell the program that space can be more then 5-10 pixels this way text fraction will be word.
Thanks,
Rony
because Hebrew goes from right to left we have to reverse all words in the PDF, special characters .,;'\|" etc can be in the left & the right.
so, we are analyzing the position of the char against the word & decide where to put it on the fly.
we didn't realize that words can split. now we are not sure where to put the fraction.
we will appreciate if we will be able to tell the program that space can be more then 5-10 pixels this way text fraction will be word.
Thanks,
Rony
- Tracker Supp-Stefan
- Site Admin
- Posts: 17941
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: text is garble
Hello Rony,
Please note that the coordinates are in points and not pixels.
As for the logic you will put behind concatenating those text elements - you will need to implement this on your own.
But you could e.g. combine a third party dictionary + comparing with a result returned from PXCp_ET_GetPageContentAsTextW to help you decide where the distance between two consecutive elements is separating two different words or not.
Best,
Stefan
Please note that the coordinates are in points and not pixels.
As for the logic you will put behind concatenating those text elements - you will need to implement this on your own.
But you could e.g. combine a third party dictionary + comparing with a result returned from PXCp_ET_GetPageContentAsTextW to help you decide where the distance between two consecutive elements is separating two different words or not.
Best,
Stefan
Re: text is garble
Dear Tracker Software Products,Tracker Supp-Stefan wrote:Hello inaniwa,
We do not have a firm release date yet, but are working as hard as we can to complete it as soon as possible.
We will make more specific date announcements when possible, but we are still on track for release before the end of this quarter for the new PDF-XChange Viewer SDK on which much of the reqiuired code for the libraries will be based - the PDF-Tools SDK elements you are using will then follow later this summer.
Thanks for your understanding,
Stefan
Any update on this?
Please let me know the progress.
- Tracker Supp-Stefan
- Site Admin
- Posts: 17941
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: text is garble
Hello inaniwa,
No specific date yet I am afraid, but we hope to have more information for you by the end of the week.
Please note that the SDK products would be released after the end user ones, as there is some more finishing work + documentation for them compared to the end user versions.
Best,
Stefan
No specific date yet I am afraid, but we hope to have more information for you by the end of the week.
Please note that the SDK products would be released after the end user ones, as there is some more finishing work + documentation for them compared to the end user versions.
Best,
Stefan
Re: text is garble
Hello stefan,
>No specific date yet I am afraid, but we hope to have more information for you by the end of the week.
I have not heard from you yet.
inaniwa
>No specific date yet I am afraid, but we hope to have more information for you by the end of the week.
I have not heard from you yet.
inaniwa
- Paul - Tracker Supp
- Site Admin
- Posts: 6901
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: text is garble
HI inaniwa,
there is a formal announcement on the progress of the new versions here: https://forum.pdf-xchange.com/ ... 521#p50521
hth
there is a formal announcement on the progress of the new versions here: https://forum.pdf-xchange.com/ ... 521#p50521
hth
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Re: text is garble
Is This file the same cause?Lzcat - Tracker Supp wrote:Your document's use multibyte encodings which is not currently supported by xcpro40. Please wait for next major update.
- Attachments
-
- sample.pdf
- (189.01 KiB) Downloaded 270 times
- Lzcat - Tracker Supp
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: text is garble
Almost same - another encoding of same kind.
Currently the xcpro40 Library can handle such fonts only if the 'ToUnicode' table is embedded with it.
Currently the xcpro40 Library can handle such fonts only if the 'ToUnicode' table is embedded with it.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Re: text is garble
Hi,
Had been said to be fixed in the next version up, but still not fixed.
When it is fixed?
Best regards.
Had been said to be fixed in the next version up, but still not fixed.
When it is fixed?
Best regards.
- Attachments
-
- files.zip
- (711.47 KiB) Downloaded 221 times
- Tracker Supp-Stefan
- Site Admin
- Posts: 17941
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: text is garble
Hello inaniwa,
The "next version" Victor (Lzcat) has in mind is actually V5 of our PDF Tools SDK - and this is not yet released, and no build of V4 of the SDK will support predefined CJK encodings.
Best,
Stefan
The "next version" Victor (Lzcat) has in mind is actually V5 of our PDF Tools SDK - and this is not yet released, and no build of V4 of the SDK will support predefined CJK encodings.
Best,
Stefan
Re: text is garble
Hi,
I got it.
Best regards.
I got it.
Best regards.
- Tracker Supp-Stefan
- Site Admin
- Posts: 17941
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: text is garble
Thanks for the understanding!
Regards,
Stefan
Regards,
Stefan