OCR adds another (new) image layer

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

OCR adds another (new) image layer

Post by Timur Born »

Hello.

I wonder why OCR adds another (new) image layers on top of the text layer when the option to preserve the original content is used. When I OCR a (scanned) image then I end up with two identical image layers.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Hi Timur,

Thanks for the post - I see the same thing here and I'm not entirely sure why that is, but we're currently in the process of re-writing the OCR module completely. It should be released within the next build or two after 320, if all goes well, so there will be some fairly major improvements.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

Hi Will,

thanks for getting back to me. I will wait and look what the new OCR module brings.
User avatar
Patrick-Tracker Supp
Site Admin
Posts: 1645
Joined: Thu Mar 27, 2014 6:14 pm
Location: Vancouver Island
Contact:

Re: OCR adds another (new) image layer

Post by Patrick-Tracker Supp »

:D
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Cheers,

Patrick Charest
Tracker Support North America
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

While you are it: Please allow for Editor to me minimized while OCR is running. Currently only the OCR popup can be minimized, but not the main window.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17818
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR adds another (new) image layer

Post by Tracker Supp-Stefan »

Hello Timur,

Thanks for this suggestion. I will bring it up for discussion on the next meeting!

Regards,
Stefan
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

Out of curiosity: Do the upcoming changes to OCR also include better character recognition? Acrobat DC is kind of unbeatable in this department, even by specialized software like Omnipage and FineReader, but having more reliable OCR in XChange would be a nice bonus.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Hi Timur,

It should do yes - The OCR re-write is going to be fairly heavy and comprihensive, but I can't say how major the improvement will be, or what specifically will be in the initial release of the re-write for now.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Mitch
User
Posts: 4
Joined: Mon May 01, 2017 3:14 am

Re: OCR adds another (new) image layer

Post by Mitch »

Xchange Editor V6 (build 321).
The output option to recreate a new document, or add a text layer (Document - OCR Pages) is not available in the workflow File - New Document (Image Post Processing).
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17818
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: OCR adds another (new) image layer

Post by Tracker Supp-Stefan »

Hello Mitch,

I just tested it - and the OCR is still there as expected. Please note that you need to click the "OCR" check box for the button to select the languages to become active:
File_new_ocr.png
Regards,
Stefan
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

Any news when the new OCR implementation will arrive? I just tried to OCR a document where the current OCR would turn separated (italic) words into one big word, regardless of accuracy settings.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Hi Timur,

The new OCR is slated for release with Version 7, so not until roughly September (date obviously subject to change depending, on content etc.).
I just tried to OCR a document where the current OCR would turn separated (italic) words into one big word, regardless of accuracy settings.
Just tried to reproduce with a document created here and wasn't able to. Can you send a sample doc.?

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

1. I send you a file where Editor's OCR turns *every* line of every paragraph into single words without spaces.

2. This also is a good example of how "Copy as Rich Text" (of the original text already present) shows serious weaknesses. Most text pages are turned into empty pages with text being written one single character per line! This is why I even tried to OCR these files to begin with, even though they already provided text. Compare that to the copy & paste results of Adobe Reader and you will see that it is a night & day difference.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Hi Timur,

Thanks for that - I works perfectly for me here. The copy/paste results are identical between the Editor and Adobe for me (both before and after OCR), and the OCR results don't differ from the original. Can you advise on your OCR settings and/or walk me through step-by-step?

Also, do you see this on every single page, or on specific pages?

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Just thought to test pasting into MS word to see the rich text difference in a better light - I do actually see the copy/paste difference between the Editor and Adobe, so I'll pass that along (ticket RT-3978).

However, I still don't see the OCR issue that you mentioned.
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

Copy & Paste:

- If you mark parts of page 4 and page 5 together then the resulting paste of page 5 gets turned into one character lines. For my original test I used CTRL-A to copy all text, but this time I manually selected text on page 5, page 4+5 and page 5+6. The problem only happens when page 4+5 are selected, but not with the combination of page 5+6. With the latter you rather get blank pages in between when the rich text is copied to Word.

- If you copy one or both columns of page 5 from Editor to Word as rich text then the formatting is all over the place. If you copy the same from Reader then you get a single coherent column in Word (Adobe Acrobat's full version offers a third option that keeps the original formatting intact, including the running text around the center image).

OCR:

Here is an original paragraph from page 5:

"You’re starting the Strange Aeons Adventure Path, but
what kind of character should you play? How much
should you develop your character’s backstory, knowing
that the characters can’t remember much of their past?"

Here is the OCR version (Language English, Accuracy medium):

"You’restartingtheStrangeAeonsAdventurePath,but
whatkindofcharactershouldyouplay?Howmuch
shouldyoudevelopyourcharacter’sbackstory,knowing
thatthecharacterscan’tremembermuchoftheirpast?"
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Hi Timur,
Copy & Paste:

- If you mark parts of page 4 and page 5 together then the resulting paste of page 5 gets turned into one character lines. For my original test I used CTRL-A to copy all text, but this time I manually selected text on page 5, page 4+5 and page 5+6. The problem only happens when page 4+5 are selected, but not with the combination of page 5+6. With the latter you rather get blank pages in between when the rich text is copied to Word.

- If you copy one or both columns of page 5 from Editor to Word as rich text then the formatting is all over the place. If you copy the same from Reader then you get a single coherent column in Word (Adobe Acrobat's full version offers a third option that keeps the original formatting intact, including the running text around the center image).
I've added that to the ticket - I see exactly what you mean and will admit, the result isn't pretty.
OCR:

Here is an original paragraph from page 5:

"You’re starting the Strange Aeons Adventure Path, but
what kind of character should you play? How much
should you develop your character’s backstory, knowing
that the characters can’t remember much of their past?"

Here is the OCR version (Language English, Accuracy medium):

"You’restartingtheStrangeAeonsAdventurePath,but
whatkindofcharactershouldyouplay?Howmuch
shouldyoudevelopyourcharacter’sbackstory,knowing
thatthecharacterscan’tremembermuchoftheirpast?"
I don't see that here - I've attached a Word doc. that shows my results:
My Results.zip
(11.23 KiB) Downloaded 211 times
I get identical results with the Create New Searchable PDF and Preserve Original Content... options and I used medium accuracy & English. Is there something that I'm doing different?
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

Looks like a case of different "Copy white space mode" settings between our setups. I suspect that you use proportional white space, because there are some places in your Word example where more than one white space separates two words. I use "Preserve original" myself, so that's that. ;)

Distance words proportionally:

Code: Select all

While   many   of  the   same   options   that   make   great
characters  in  any  Adventure  Path  work  well  for  this
campaign,   a   few   class   options   are   especially   suited
to  a   campaign  where  the  characters   struggle  against
Only one white space between words:

Code: Select all

While many of the same options that make great
characters in any Adventure Path work well for this
campaign, a few class options are especially suited
to a campaign where the characters struggle against
Preserve original white spaces only:

Code: Select all

Whilemanyofthesameoptionsthatmakegreat
charactersinanyAdventurePathworkwellforthis
campaign,afewclassoptionsareespeciallysuited
toacampaignwherethecharactersstruggleagainst
The latter is the one that reveals that OCR removes white spaces from the original.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Thanks Timur, I was using a different "Copy white space mode" - sorry, I've been away for 2 weeks and am still getting back into the 'swing of things' :wink:

I've reproduced this here and passed it along via a ticket (RT-3984).

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

Timur Born wrote:I wonder why OCR adds another (new) image layers on top of the text layer when the option to preserve the original content is used. When I OCR a (scanned) image then I end up with two identical image layers.
This seems to have been fixed in the meantime?!
Will - Tracker Support wrote:Thanks Timur, I was using a different "Copy white space mode" - sorry, I've been away for 2 weeks and am still getting back into the 'swing of things' :wink:

I've reproduced this here and passed it along via a ticket (RT-3984).
It's worth mentioning that "Edit Content" always uses a single white space regardless of what is in the Copy Text options. I don't know if this is wanted behavior or not?!
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

Hi Timur,
This seems to have been fixed in the meantime?!
Awesome, glad to hear that!
It's worth mentioning that "Edit Content" always uses a single white space regardless of what is in the Copy Text options. I don't know if this is wanted behavior or not?!
I believe that this is by design. Copying text using the Select Text Tool is more difficult than it would visually appear, because spaces as such do not exist in PDF files. We have to determine what constitutes as a space by looking at the gap between characters, rather than actually looking for a space character. But I believe, that when text is being edited, it's we're temporarily able to act like a word processor (i.e. Notepad, WordPad, etc.) that includes space characters, so it's much easier for us to copy text and those options are no longer necessary. I'll need to double check to confirm that, but the developer responsible is on holiday so I won't hear back for a little while.

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR adds another (new) image layer

Post by Timur Born »

He will still have to look into the "Preserve original white spaces" bug for the normal (non edit) text selection tool then. And the presence of this option seems to suggest that there are original white spaces in PDFs?! At least Adobe Reader has no problems copying these. ;)
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: OCR adds another (new) image layer

Post by Will - Tracker Supp »

He will still have to look into the "Preserve original white spaces" bug for the normal (non edit) text selection tool then.
Absolutely!
And the presence of this option seems to suggest that there are original white spaces in PDFs?! At least Adobe Reader has no problems copying these. ;)
Not exactly - the UI text is likely to phrased to make more sense to users that wouldn't otherwise understand.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Post Reply