OCR of Selected Text Regions, with Close Cropping of Remaining Images

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
Markt-a1b
User
Posts: 45
Joined: Sat Sep 07, 2019 7:10 pm

OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Markt-a1b »

Daniel, et al.:

I would like to edit the attached document, such that all scanned text within the white-backgrounded areas is converted to editable text via OCR, with all such white areas then having their unneeded background (bitmapped) data deleted. This would leave the (effectively) cropped images exactly where they are now, but with all white areas surrounding them cleared of file-size-increasing, unneeded data.

The "Editable Text and Images" and "Fine Page Content" alternatives within the "OCR Pages (Enhanced)" menu offer the choices "clears corresponding region" and "replaces the original page content," respectively. However, I still do not fully understand the workings of these alternative options, and whether one or the other of them is how I might achieve my goals with this file.

Perhaps alternatively, does - or can - the "OCR Selected Region" option likewise clear and/or replace unneeded bitmapped data, in just the white areas behind and surrounding the text?

The OCR Editor sub-module of ABBYY FineReader 15 handles these functions elegantly, but so far I am still unable to achieve the same results within XChange.

Any advice you can offer will be *most* appreciated, thanks! MarkT


[attachment=0]MU-2_Brochure-Optimized.pdf[/attachment]
Attachments
MU-2_Brochure-Optimized.pdf
(3.16 MiB) Downloaded 102 times
Last edited by Markt-a1b on Sat Feb 08, 2020 2:01 pm, edited 1 time in total.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

Hello Mark,

Currently it is not possible to remove the white space during an OCR operation, however after discussing it with the Dev team, they have suggested that we create a formal feature request for it, so that we can consider it in the future. For tracking you can ask any member of our support team about the following ticket number:

RT#5095: FR: Remove White space from images

Unfortunately, neither option will achieve what you are looking for currently, nor will the "OCR selected region" function do this. hopefully we can offer it soon in the future, but I cannot provide a timeline just yet.

As for the differences between the two functions "editable text and images" and "fine page content", they are difficult to describe, I will ask our Writer to add some more details into the manual after this, but here is one of our developer's explanation of the differences.
EditableTextAndImages - it keeps the original content (text, images, vector arts), adds a new text, and under newly added text it removes the parts of original content

FinePagesContent - it removes the original content completely and replaces it by a new content: text from the OCR plus rasterized OCR's Graphic-Zones.
The OCR marks some regions on the page as a 'Graphic-Zone' - we rasterize those regions and put such images to the corresponding regions in the new content, around/under the text.

Note: sometimes the OCR says 'you have one big graphic zone that takes whole page area and some text-zones over it'. In that case we will get very similar(visually) result to the EditableTextAndImages. In both cases we will sometimes create very similar content, because:
-EditableTextAndImages - will keep all existing graphic (it can be already one big image for whole page) and add a new text
-FinePagesContent - will replace original content by "the same" (probably with different resolution and colorspace) big image that also occupies the whole page and add a new text.
I hope this helps!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Markt-a1b
User
Posts: 45
Joined: Sat Sep 07, 2019 7:10 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Markt-a1b »

Daniel: "... I hope this helps."

Why, yes it does, Daniel, and thank you very much for your prompt and detailed response to my inquiry! Please also pass along my thanks to the developer who clearly explains the functions (and differences) of "Editable Text and Images" and "Fine Page Content." I will stand by for resolution of your ticket number "RT#5095: FR: Remove White space from images."

In the meantime, thank you again and best regards - and have a good weekend! MarkT
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

:D
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

There is an important difference, which at least partly seems to be based on buggy behavior of FinePagesContent:

EditableTextAndImages keeps images intact and in their original unaltered state.

FinePagesContent alters images even when they are placed at their original position. The most obvious change is that anti-aliasing is being applied, leading to images being blurred, while sometimes being slightly enhanced.

In my short test I also saw it skew images despite the "fix skew" option being specifically disabled, especially if skew detection is enabled. This can also lead to artifacts of words/characters from the newly skewed image hanging over the newly created text area.

Overall I think that FinePagesContent needs more fixing before it is fully usable.
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

Timur Born wrote: Tue Feb 11, 2020 2:52 pmIn my short test I also saw it skew images despite the "fix skew" option being specifically disabled, especially if skew detection is enabled. This can also lead to artifacts of words/characters from the newly skewed image hanging over the newly created text area.
Turns out that the (full page sized) images were not skewed by OCR, but cut off at the corners as if being skewed (in the other direction than actual skew fixing).
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

Hello Timur,

Your first point is not a bug, it is by design. As stated in my earlier post:
TrackerSupp-Daniel wrote: Fri Feb 07, 2020 11:11 pm FinePagesContent - it removes the original content completely and replaces it by a new content
This is what causes the "slight blurring" and slight enhancement you are speaking of, the images are essentially being rasterized and recreated.

The deskew and page clipping issue you mentioned however is an item which we have been looking into. We are working on improving the detection and handling there, but if you could provide some sample documents which results in this error occurring, it would certainly help.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

This is what causes the "slight blurring" and slight enhancement you are speaking of, the images are essentially being rasterized and recreated.
Plus the images being anti-aliased (smoothed) on top of the reprocessing even when not being rotated.

When skew detection is enabled, but skew fix is disabled then what happens is that Editor cuts the border of the image as *if* it would be rotated afterwards even when it is not. So the bug is with Editor doing one step (cutting corners) even though the other step (skew fix) is disabled.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

Hello Timur,

I do not believe that we apply any anti-aliasing during the OCR process, are you sure this is not simply part of the reprocessing? I have asked one of our Dev team to take a look and verify this, as I am not certain on that point.

AS for the deskew issue, Please send a sample document that this can be reproduced with, as none of my samples here are suffering from this issue, so I have nothing to create an official bug report with.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

I think both issues (anti-aliasing and cropping) may only happen when skewed text is present, I will test some more.
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

So here we go. Cropping and smoothing of images happens when "Fine Page" is used in combination with "Detect Skew" (not fix), but only when skew is even detected. When the "Fix skew" option is enabled then "Editable" applies its fixes without applying smoothing, but "Fine Page" applies smoothing. "Fine Page" also applies smoothing when "Fix skew" is disabled, though, pure positive detection is sufficient to trigger the issue.

I attach some examples at 600 dpi, specifically not just black text on white background to better demonstrate the issue of cropped background (images). While this example scan is JPEG compressed (Auto) to limit its size I tested the same using lossless FLATE compression instead.

Unfortunately the "Fine page" examples exceed 11 mb, so I had to recompress them to JPEG Maximum via "Save as Optimized" . Seems like Editor uses FLATE compression for its OCR "Fine" results, which results in too large files for uploading here.

Additionally I attach another scan that is skewed, but not detected as such by OCR. As a result the "Fine Page" issues do not happen. Curiously the "Deskew Pages Content" function does properly detect the skew and fixes it accordingly and applies skew fixing without smoothing. Once the document is sent through "Fine Page" OCR, though, "Deskew Pages Content" is not able to detect/deskew anymore, which still works after "Editable" OCR.
Attachments
Skew_detect_failure.pdf
(139.47 KiB) Downloaded 91 times
Fine_detect_fix_Optimized.pdf
(1.55 MiB) Downloaded 95 times
Fine_detect_nofix_Optimized.pdf
(1.55 MiB) Downloaded 95 times
Fine_nodetect_nofix_Optimized.pdf
(1.57 MiB) Downloaded 97 times
Editable_detect_fix.pdf
(3.5 MiB) Downloaded 98 times
Editable_detect_nofix.pdf
(3.5 MiB) Downloaded 100 times
Editable_nodetect_nofix.pdf
(3.77 MiB) Downloaded 96 times
Original_Scan.pdf
(4.74 MiB) Downloaded 96 times
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

The "Fine page" examples also demonstrate the issue of OCR text being placed on top of original image text in several areas.
Markt-a1b
User
Posts: 45
Joined: Sat Sep 07, 2019 7:10 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Markt-a1b »

Timur wrote: "... EditableTextAndImages keeps images intact and in their original unaltered state.

"FinePagesContent alters images even when they are placed at their original position. The most obvious change is that anti-aliasing is being applied, leading to images being blurred, while sometimes being slightly enhanced...."
---------------------

Timur, thank you for the additional information on the performance and intended-versus-actual functions of EditableTextAndImages and FinePagesContent within XChange. Being able to closely pre-select irregularly-shaped images, and in the process distinguish them from text areas on which OCR operations are desired, makes a *huge* difference in the size and quality of the resulting (Optimized) file.

The original scan of the attached two-sided brochure, in what I believe was 1,200-dpi resolution, was nearly 78 Mb in size. A 400-dpi, OCR'd Optimization in XChange yielded about 7.2 Mb; and a 300-dpi resolution, which yielded more image degradation than I was willing to accept, yielded a file size of approximately 3.2 Mb (see the previously uploaded file attachment "MU-2_Brochure-Optimized.pdf").

However, by starting again with the 78-Mb original scan, and then carefully designating and thus distinguishing images from text - including OCR text that is intended ultimately to lie above background images - the resulting 400-dpi Optimization is reduced to a quite workable 3.1 Mb. See the attached file, "MU-2_Brochure-ABBYY-Opt.pdf."

If XChange could achieve the same or similar results, it would be a significant enhancement to its capabilities, IMO.

Thank you - and please do keep up the good work! MarkT

[attachment=0]MU-2_Brochure-ABBYY-Opt.pdf[/attachment]
Attachments
MU-2_Brochure-ABBYY-Opt.pdf
(2.99 MiB) Downloaded 93 times
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

Unfortunately it turns out that "Editable" does not ´leave the original images intact. All areas with text are replaced with blank pixels right in the original image. I assumed that these were done as forms on top of the images, but it seems that I was wrong. While this may decrease file-syze it can become a problem when OCR detects words in images that do not even contain any letters to begin with. The latter could be avoided with the "Selected Regions" feature you are suggesting.
editor_ocr_image.jpg
...OCR'd Optimization in XChange yielded about 7.2 Mb...
It is important to underline the "Optimization" part here. If you use "Fine Page" without saving as optimized then even your 3 mb Abby file grows to over 45 mb again. That is because images are saved as FLATE/ZIP instead of JPEG then.

One oddity I just found: "Fine Page" leaves all text on the first page (and part of the second page) of the Abby PDF empty when "Detect skew" is disabled while "Ignore existing text on page" is enabled.

All that being said, Editor's OCR cannot match the Abby result on page 1, because it cannot OCR white text on whatever background (not even black). So the whole "A totally new concept..." part will be excluded from OCR. But being able to mark regions for OCR and then only OCR "Selected area" sure would be useful anyway.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

Hello Timur,

To start, I would like to clarify as you seem to have mis-quoted our Development team earlier, and I would like to avoid continued misunderstandings:
Timur Born wrote: Tue Feb 11, 2020 2:52 pm EditableTextAndImages keeps images intact and in their original unaltered state.
Timur Born wrote: Wed Feb 12, 2020 2:58 pm Unfortunately it turns out that "Editable" does not ´leave the original images intact.
TrackerSupp-Daniel wrote: Fri Feb 07, 2020 11:11 pm […]here is one of our developer's explanation of the differences.
EditableTextAndImages - it keeps the original content (text, images, vector arts), adds a new text, and under newly added text it removes the parts of original content
Following that, I have received confirmation from the Dev team about the fine page content function. NO ANTI-ALIASING is ever applied during this process. The fine pages content will remove the original content and recreate it, part of this process may alter the dpi, angle, or other aspects of the image. This can appear as if some smoothing has been applied, but it is not.

This discussion has lead to some internal discussion as well, and the Dev team is going to place some additional focus on the fine page content functionality, but I cannot say exactly what that will mean to us at this junction. Beyond that, the OCR function is still being worked on actively, and will see improvements in the future, but it is a complex beast, and will take time. Patience is all we ask.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

TrackerSupp-Daniel wrote: Thu Feb 13, 2020 1:15 am[…]here is one of our developer's explanation of the differences.
EditableTextAndImages - it keeps the original content (text, images, vector arts), adds a new text, and under newly added text it removes the parts of original content
Thanks for the clarification. Although I did not mis-quote your development team, but revised my own statement. When I originally wrote "keeps images intact and in their original unaltered state" that was my own interpretation of what I saw in tests. At that time I did not notice that "it removes parts of the original content". That last part is a real problem, because OCR likes to detect text were none is present and thus destroys images (unintentionally).
Following that, I have received confirmation from the Dev team about the fine page content function. NO ANTI-ALIASING is ever applied during this process. The fine pages content will remove the original content and recreate it, part of this process may alter the dpi, angle, or other aspects of the image. This can appear as if some smoothing has been applied, but it is not.
I may have found out what is causing the dramatic decrease in perceived resolution then. "Fine Page" + "Detect Skew" does deskew the images even when "Fix Skew" is disabled, afterwards it skews them back to their original rotation. So (on top of decreased DPI) images are altered twice, once permanently and then again for display in Editor. This likely uses some CPU inexpensive rotation algorithm, maybe on top of simple bilinear resampling.

Here is the original scan with original skew:
original_skew.png

Here is the permanently deskewed version that "Fine Page" creates despite skew fixing being disabled. This is the version you edit/copy out of Editor when "Edit/Save images without image transformations" is enabled:
fine_detect_nofix_deskew.png
fine_detect_nofix_deskew.png (112.01 KiB) Viewed 4384 times

Here is the final reskewed version that "Fine Page" creates for viewing in Editor as part of the same process. This is the version you edit/copy out of Editor when "Edit/Save images without image transformations" is disabled:
fine_detect_nofix_reskew.png
fine_detect_nofix_reskew.png (101.47 KiB) Viewed 4384 times
So the user gets to see the last version, which results in the worst possible quality of all. You cannot deny that the resulting image is seriously smoothed as a result of all the alterations. This also means that enabling "Fix skew" results in better visible results than disabling it, because disabling applies deskew + reskew while enabling only applies deskew.

Saving the resulting worse quality images via FLATE compression is a waste of space then, because the dramatic quality loss already happens in an earlier stage. I am not sure whether using FLATE is done intentionally by Editor, though, or if it is a bug/oversight?!
... Patience is all we ask.
I don't see me using the "Fine Page" function anytime soon, at least not in its current state. So no hurries here. I do feel like your developers could have known and shared the processing, though, as it would have spared me the time to analyze all of this myself (including wrong interpretations along the way).
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

Any comment on desket + reskew being applied with "Fine Page" OCR?
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

My apologies for the delay Timur,

The dev team is looking into this, and as before, they are planning to improve the "fine page content" function. This will likely mean revising most functions, including detect skew. I do not believe that it is intended for this to be applied twice, but the quality degradation does seem to indicate that this may be happening.

I cannot make any guarantees at this junction, but we are working on it.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
JohnCLeBlanc
User
Posts: 18
Joined: Tue Jul 01, 2014 1:02 am
Contact:

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by JohnCLeBlanc »

Is this skew problem what I'm seeing here? Note how the enhanced OCR plug converted $10.00 smoothly but altered its angle:
Image

Is there an existing setting that will prevent this?
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

Hi, JohnCLeBlanc

I am sorry to say it seems your image was not properly uploaded, so I am not sure. Could you try resending the image and then we can take a look?

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

Were the skew issues fixed with the latest V9 version of OCR engine?
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

Hi, Timur Born

V9 uses an entirely new OCR engine, so numerous issues that were reported are now null and void. In some cases issues that existed previously simply no longer do and in other cases previously fixed issue may have reappeared. This is part of what happens when switching from one engine to another.
We have done our best to ensure that the majority of the more prominent bugs from the old engine were not present in the new one, but in some cases they may still be present. At the moment, our main focus is going into reproducing and crashes/hangs that have been reported, and less critical issues, like this are on the backburner. Please do feel free to test it and let us know if you see a notable improvement (or degradation) in comparison on your end.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Timur Born
User
Posts: 874
Joined: Tue Jun 26, 2012 1:50 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by Timur Born »

I hoped I wouldn't have to check myself, but just get an update. So I did a quick test and skew/deskew quality looks good. Overall results are much better with the original flyer example of post #1.

The "Ignore Text in graphic" option helps to keep images intact, but strangely it also fixes some bad OCR results in text areas. In my example it fixes a headline where "schule" turns to "schu e" without the option enabled. In the original flyer example it fixes an instance where "cruises" turns to "cruiser" unless the option is enabled.

Curiously the progress bar window is not displayed unless I click on the Editor window. There doesn't seem to be much/any multi-threading going on.

And unfortunately the choice of fonts can be quite different from the originals and enabling "Ignore text in graphic" seems to make it worse in some areas (better in others). This is where Adobe Acrobat excels.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR of Selected Text Regions, with Close Cropping of Remaining Images

Post by TrackerSupp-Daniel »

Hi, Timur Born

Glad to hear that there was a noticeable improvement in the skew results on your end as well, I suggested the personal test as you are the best judge of what an "improvement" actually is on your own documents. I can say all I want about how much better or worse any feature is, but in the end it works on a case by case basis, and not everything will always be great.

As for the ignore text in graphics options, this is one item where there will always be "good" and "bad" cases, it is why we offer the option to choose, so you can decide which is best for any given document. In the future there should be gradual improvements coming in this area.

The progress bar should appear atop the Editor window by default. I have not had any issues with it appearing behind on my end, but will run a few more tests and see if I can reproduce it. And finally, Multi-threading is currently not included in this new engine, but it is something we are looking into in the future (as I mentioned above, some old issues are fixed, and some previously fixed items may have come back).

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply