Page 1 of 1

OCR adds another (new) image layer

Posted: Sun Jan 22, 2017 11:54 pm
by Timur Born
Hello.

I wonder why OCR adds another (new) image layers on top of the text layer when the option to preserve the original content is used. When I OCR a (scanned) image then I end up with two identical image layers.

Re: OCR adds another (new) image layer

Posted: Mon Jan 23, 2017 8:22 am
by Will - Tracker Supp
Hi Timur,

Thanks for the post - I see the same thing here and I'm not entirely sure why that is, but we're currently in the process of re-writing the OCR module completely. It should be released within the next build or two after 320, if all goes well, so there will be some fairly major improvements.

Cheers,

Re: OCR adds another (new) image layer

Posted: Tue Jan 24, 2017 9:06 pm
by Timur Born
Hi Will,

thanks for getting back to me. I will wait and look what the new OCR module brings.

Re: OCR adds another (new) image layer

Posted: Wed Jan 25, 2017 12:26 am
by Patrick-Tracker Supp
:D

Re: OCR adds another (new) image layer

Posted: Thu Jan 26, 2017 9:04 am
by Timur Born
While you are it: Please allow for Editor to me minimized while OCR is running. Currently only the OCR popup can be minimized, but not the main window.

Re: OCR adds another (new) image layer

Posted: Thu Jan 26, 2017 12:05 pm
by Tracker Supp-Stefan
Hello Timur,

Thanks for this suggestion. I will bring it up for discussion on the next meeting!

Regards,
Stefan

Re: OCR adds another (new) image layer

Posted: Tue Jan 31, 2017 10:34 am
by Timur Born
Out of curiosity: Do the upcoming changes to OCR also include better character recognition? Acrobat DC is kind of unbeatable in this department, even by specialized software like Omnipage and FineReader, but having more reliable OCR in XChange would be a nice bonus.

Re: OCR adds another (new) image layer

Posted: Tue Jan 31, 2017 10:51 am
by Will - Tracker Supp
Hi Timur,

It should do yes - The OCR re-write is going to be fairly heavy and comprihensive, but I can't say how major the improvement will be, or what specifically will be in the initial release of the re-write for now.

Cheers,

Re: OCR adds another (new) image layer

Posted: Mon May 01, 2017 3:31 am
by Mitch
Xchange Editor V6 (build 321).
The output option to recreate a new document, or add a text layer (Document - OCR Pages) is not available in the workflow File - New Document (Image Post Processing).

Re: OCR adds another (new) image layer

Posted: Wed May 03, 2017 8:58 am
by Tracker Supp-Stefan
Hello Mitch,

I just tested it - and the OCR is still there as expected. Please note that you need to click the "OCR" check box for the button to select the languages to become active:
File_new_ocr.png
Regards,
Stefan

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 10:35 am
by Timur Born
Any news when the new OCR implementation will arrive? I just tried to OCR a document where the current OCR would turn separated (italic) words into one big word, regardless of accuracy settings.

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 12:18 pm
by Will - Tracker Supp
Hi Timur,

The new OCR is slated for release with Version 7, so not until roughly September (date obviously subject to change depending, on content etc.).
I just tried to OCR a document where the current OCR would turn separated (italic) words into one big word, regardless of accuracy settings.
Just tried to reproduce with a document created here and wasn't able to. Can you send a sample doc.?

Cheers,

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 12:57 pm
by Timur Born
1. I send you a file where Editor's OCR turns *every* line of every paragraph into single words without spaces.

2. This also is a good example of how "Copy as Rich Text" (of the original text already present) shows serious weaknesses. Most text pages are turned into empty pages with text being written one single character per line! This is why I even tried to OCR these files to begin with, even though they already provided text. Compare that to the copy & paste results of Adobe Reader and you will see that it is a night & day difference.

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 1:09 pm
by Will - Tracker Supp
Hi Timur,

Thanks for that - I works perfectly for me here. The copy/paste results are identical between the Editor and Adobe for me (both before and after OCR), and the OCR results don't differ from the original. Can you advise on your OCR settings and/or walk me through step-by-step?

Also, do you see this on every single page, or on specific pages?

Thanks,

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 1:12 pm
by Will - Tracker Supp
Just thought to test pasting into MS word to see the rich text difference in a better light - I do actually see the copy/paste difference between the Editor and Adobe, so I'll pass that along (ticket RT-3978).

However, I still don't see the OCR issue that you mentioned.

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 1:54 pm
by Timur Born
Copy & Paste:

- If you mark parts of page 4 and page 5 together then the resulting paste of page 5 gets turned into one character lines. For my original test I used CTRL-A to copy all text, but this time I manually selected text on page 5, page 4+5 and page 5+6. The problem only happens when page 4+5 are selected, but not with the combination of page 5+6. With the latter you rather get blank pages in between when the rich text is copied to Word.

- If you copy one or both columns of page 5 from Editor to Word as rich text then the formatting is all over the place. If you copy the same from Reader then you get a single coherent column in Word (Adobe Acrobat's full version offers a third option that keeps the original formatting intact, including the running text around the center image).

OCR:

Here is an original paragraph from page 5:

"You’re starting the Strange Aeons Adventure Path, but
what kind of character should you play? How much
should you develop your character’s backstory, knowing
that the characters can’t remember much of their past?"

Here is the OCR version (Language English, Accuracy medium):

"You’restartingtheStrangeAeonsAdventurePath,but
whatkindofcharactershouldyouplay?Howmuch
shouldyoudevelopyourcharacter’sbackstory,knowing
thatthecharacterscan’tremembermuchoftheirpast?"

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 2:20 pm
by Will - Tracker Supp
Hi Timur,
Copy & Paste:

- If you mark parts of page 4 and page 5 together then the resulting paste of page 5 gets turned into one character lines. For my original test I used CTRL-A to copy all text, but this time I manually selected text on page 5, page 4+5 and page 5+6. The problem only happens when page 4+5 are selected, but not with the combination of page 5+6. With the latter you rather get blank pages in between when the rich text is copied to Word.

- If you copy one or both columns of page 5 from Editor to Word as rich text then the formatting is all over the place. If you copy the same from Reader then you get a single coherent column in Word (Adobe Acrobat's full version offers a third option that keeps the original formatting intact, including the running text around the center image).
I've added that to the ticket - I see exactly what you mean and will admit, the result isn't pretty.
OCR:

Here is an original paragraph from page 5:

"You’re starting the Strange Aeons Adventure Path, but
what kind of character should you play? How much
should you develop your character’s backstory, knowing
that the characters can’t remember much of their past?"

Here is the OCR version (Language English, Accuracy medium):

"You’restartingtheStrangeAeonsAdventurePath,but
whatkindofcharactershouldyouplay?Howmuch
shouldyoudevelopyourcharacter’sbackstory,knowing
thatthecharacterscan’tremembermuchoftheirpast?"
I don't see that here - I've attached a Word doc. that shows my results:
My Results.zip
(11.23 KiB) Downloaded 214 times
I get identical results with the Create New Searchable PDF and Preserve Original Content... options and I used medium accuracy & English. Is there something that I'm doing different?

Re: OCR adds another (new) image layer

Posted: Wed Jul 26, 2017 2:55 pm
by Timur Born
Looks like a case of different "Copy white space mode" settings between our setups. I suspect that you use proportional white space, because there are some places in your Word example where more than one white space separates two words. I use "Preserve original" myself, so that's that. ;)

Distance words proportionally:

Code: Select all

While   many   of  the   same   options   that   make   great
characters  in  any  Adventure  Path  work  well  for  this
campaign,   a   few   class   options   are   especially   suited
to  a   campaign  where  the  characters   struggle  against
Only one white space between words:

Code: Select all

While many of the same options that make great
characters in any Adventure Path work well for this
campaign, a few class options are especially suited
to a campaign where the characters struggle against
Preserve original white spaces only:

Code: Select all

Whilemanyofthesameoptionsthatmakegreat
charactersinanyAdventurePathworkwellforthis
campaign,afewclassoptionsareespeciallysuited
toacampaignwherethecharactersstruggleagainst
The latter is the one that reveals that OCR removes white spaces from the original.

Re: OCR adds another (new) image layer

Posted: Thu Jul 27, 2017 1:34 pm
by Will - Tracker Supp
Thanks Timur, I was using a different "Copy white space mode" - sorry, I've been away for 2 weeks and am still getting back into the 'swing of things' :wink:

I've reproduced this here and passed it along via a ticket (RT-3984).

Thanks,

Re: OCR adds another (new) image layer

Posted: Thu Aug 17, 2017 9:13 pm
by Timur Born
Timur Born wrote:I wonder why OCR adds another (new) image layers on top of the text layer when the option to preserve the original content is used. When I OCR a (scanned) image then I end up with two identical image layers.
This seems to have been fixed in the meantime?!
Will - Tracker Support wrote:Thanks Timur, I was using a different "Copy white space mode" - sorry, I've been away for 2 weeks and am still getting back into the 'swing of things' :wink:

I've reproduced this here and passed it along via a ticket (RT-3984).
It's worth mentioning that "Edit Content" always uses a single white space regardless of what is in the Copy Text options. I don't know if this is wanted behavior or not?!

Re: OCR adds another (new) image layer

Posted: Mon Aug 21, 2017 10:21 am
by Will - Tracker Supp
Hi Timur,
This seems to have been fixed in the meantime?!
Awesome, glad to hear that!
It's worth mentioning that "Edit Content" always uses a single white space regardless of what is in the Copy Text options. I don't know if this is wanted behavior or not?!
I believe that this is by design. Copying text using the Select Text Tool is more difficult than it would visually appear, because spaces as such do not exist in PDF files. We have to determine what constitutes as a space by looking at the gap between characters, rather than actually looking for a space character. But I believe, that when text is being edited, it's we're temporarily able to act like a word processor (i.e. Notepad, WordPad, etc.) that includes space characters, so it's much easier for us to copy text and those options are no longer necessary. I'll need to double check to confirm that, but the developer responsible is on holiday so I won't hear back for a little while.

Thanks,

Re: OCR adds another (new) image layer

Posted: Mon Aug 21, 2017 2:50 pm
by Timur Born
He will still have to look into the "Preserve original white spaces" bug for the normal (non edit) text selection tool then. And the presence of this option seems to suggest that there are original white spaces in PDFs?! At least Adobe Reader has no problems copying these. ;)

Re: OCR adds another (new) image layer

Posted: Mon Aug 21, 2017 3:18 pm
by Will - Tracker Supp
He will still have to look into the "Preserve original white spaces" bug for the normal (non edit) text selection tool then.
Absolutely!
And the presence of this option seems to suggest that there are original white spaces in PDFs?! At least Adobe Reader has no problems copying these. ;)
Not exactly - the UI text is likely to phrased to make more sense to users that wouldn't otherwise understand.

Cheers,