How to get resulting PDF after OCR scan in XChange Editor?

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
mattad
User
Posts: 143
Joined: Sat Nov 29, 2008 10:37 am

How to get resulting PDF after OCR scan in XChange Editor?

Post by mattad »

I am performing an OCR scan on a rasterized PDF through menu

Convert--->OCR pages

As langauges I selected German, English

After clicking OK the OCR scan started with the progress bar and finished (successfully ?) without error popup.

Fine.

But where is the resulting PDF?

Not in the original directory.
and not in the download directory.
Is there auto-save feature at all?

I manually saved the pdf and it seems to me that the original is overwritten (change of timestamp).
But the new pdf seems to be still rasterized: I cannot select/highlight text.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Tracker Supp-Stefan »

Hello mattad,

Our OCR tool will include the OCR text layer on top of the existing content in the file you have already. It will be an invisible layer of text - that you can now select with e.g. the "Text Selection" tool - and can then copy and paste in other programs as needed.

Regards,
Stefan
mattad
User
Posts: 143
Joined: Sat Nov 29, 2008 10:37 am

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by mattad »

Hmm, this is NOT true resp. working

Have a look at the attached PDF file and the snapshot of PDF XChange Editor.

I can NOT select or edit any text.

All submenus are disabled/greyed out.

So again: How can I either convert the rasterized pdf or select some text from it?
not selectable selection.png
sample rasterized pdf.pdf
(429.06 KiB) Downloaded 105 times
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Tracker Supp-Stefan »

Hello mattad,

You need to run the OCR tool first, and then e.g. select text (with the button next to the hand tool on the left), and only after that the "Selection" menu will have active entries inside - as this is actually a menu that allows you to perform modifications to an already made selection:
content_transform.png
Regards,
Stefan
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by TrackerSupp-Daniel »

for more info on this process, see this KB article as well:
https://www.pdf-xchange.com/knowle ... -performed
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
mattad
User
Posts: 143
Joined: Sat Nov 29, 2008 10:37 am

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by mattad »

Tracker Supp-Stefan wrote: You need to run the OCR tool first
Hello Stefan,

I am still confused. You tell me "You need to run OCR Tool".

But HOW do I run OCR Tool?

Even if I follow the link of Daniel I found no progress.
I select View--->Panes--->Content and click on "Page 1" on the left.
And then?

Wouldn't it be much more user friendly to offer a toolbar icon "convert image-to-text-based PDF"

I appreciate your XCHange PDF products but this OCR handling is not intuitive.
Willy Van Nuffel
User
Posts: 2348
Joined: Wed Jan 18, 2006 12:10 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Willy Van Nuffel »

To run the OCR tool, just click the Convert tab > OCR pages > OK.

Once the OCR process has ended, there will be a transparent layer upon the scanned image.
By first clicking the Edit icon in the Home ribbon, you can then select the text, but it is not "editable".

The goal of OCR is mainly to make the text "searchable".

If you really like to Edit/Modify the transparent layer, then you have to do some additional manipulations, like set a text-color (instead of transparent) and remove the original images and/or shapes (seen in the Content pane as "Path"):
https://www.pdf-xchange.com/knowle ... -performed

The reason why all the icons in your Selection-menu are grayed out, is because you must first click the Edit-icon (in the Home ribbon).

NOTE: Your example PDF seems to be something else than scanned text. Every single character can be selected as a separate 'shape' (via Edit > Shapes), but you can apply OCR to it without problem. See result in attachment.
Attachments
sample rasterized.pdf
(13.28 KiB) Downloaded 99 times
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Will - Tracker Supp »

Thanks Willy :D
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
mattad
User
Posts: 143
Joined: Sat Nov 29, 2008 10:37 am

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by mattad »

Ok Willy. Thank you. We are approaching the final solution but still not finished.

I OCR scanned the page and switched to Edit mode. I can select individual text.
Wonderful so far.

BUT: Now as final step I want to apply the transparent Edit layer to the underlying PDF and save the full PDF content (holding currently in XCHange Editor)
as new PDF file WITH selectable/highlightable text.

If I click therefore on menu

File->Save As--->Browse

and select a directory then the current pdf is saved but in the same format as before.

I or other users cannot load for example the new pdf into XChange Viewer and highlight e.g. line 5 with a colored background.

So may I ask you again: How can I save the whole new, text-selectable Document as highlightable PDF?

Thank you
Willy Van Nuffel
User
Posts: 2348
Joined: Wed Jan 18, 2006 12:10 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Willy Van Nuffel »

Hello Mattad,

When you run the OCR process on a scanned text, at "Output type", you can choose to use the original PDF or to create a new PDF.
Are you sure to Save the correct PDF (including the transparent layer) ?

I have add the resulting PDF that still includes the original image of the scanned text and also the transparent layer.
You will see that you can perfectly highlight the text in it.

NOTE: I see that you did not yet open my previous example, where only the text layer has been saved and the image has been removed.

Best regards.
Attachments
OCR Pages.png
sample rasterized with highlight.pdf
(714.07 KiB) Downloaded 95 times
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by TrackerSupp-Daniel »

Thank again for the concise and helpful descriptions willy!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
mattad
User
Posts: 143
Joined: Sat Nov 29, 2008 10:37 am

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by mattad »

Hello Willy,
thank you for your hints.
Switching the drop down "Output type" was a key information.
Your procedure works now.

But there are still some important questions:

1.) You are writing "....my previous example, where only the text layer has been saved and the image has been removed".

Where exactly do I tell XChange editor to save a pdf (a) WITH image or (b) WITHOUT image?


2.) Assume I have an unknown pdf file (in Windows Explorer) and load it into XChange Editor:
How can I find out if this pdf file contains only the text version or the text PLUS image layer?

Can I strip later the image layer (from a text PLUS image file)?


3.) When I look in XChange editor at the content (=left content sub pane) then I see that the OCR scan creates for
every word a new, individual pdf frame resp. container entry.

That seems to me rather inefficient and probably space consuming.
Can I tell XChange editor to "optimize" the OCR result,
That means to group all word frames of a paragraph into ONE pdf frame?
Is this possible?

That would help to edit later larger parts of the text of the pdf file.

Thank you
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by TrackerSupp-Daniel »

Hello mattad,
Glad to hear it works, as for you questions:

1. You cannot currently automate removal of the background image, currently following this article is the only method of removal after OCR: https://www.pdf-xchange.com/knowle ... -performed

2. The simplest way is to use the select text tool and drag a box around an image,
as for stripping the images, once again, this must be done manually, as per the above KB article.

3. Not yet, as it stands, even when a word is in another PDF software, they usually treat all words as separate entities. A 'paragraph' from the Editor's (as well as much of the competitions) viewpoint is just a group of words that are close enough together to be handled as such (in many cases theses are even handled as if each letter was its own object!).
However, this is an interesting feature request, so I will bring it to our dev team and see if they think it is something we could implement.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
mattad
User
Posts: 143
Joined: Sat Nov 29, 2008 10:37 am

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by mattad »

Hello Daniel,

the article you referenced tell how to edit and remove parts from TEXT layer.

This is not an answer to my question.

I want to do the opposite:

If I select for example the "image" component in the Content pane (see attached snapshot) and press delete then all text components disappear as well.

How do I remove the original image and leave the text?
Attachments
snapshot.png
Willy Van Nuffel
User
Posts: 2348
Joined: Wed Jan 18, 2006 12:10 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Willy Van Nuffel »

Hello mattad,

It really seems difficult for you to understand how it exactly works ...

In fact when the OCR feature has run, the resulting text is put over the image as an additional "layer".
The text itself is "transparent". This means that the characters have NO fill color and NO border color.
The text is there, but you DO NOT SEE IT.

This is the reason why - when you remove the image - it seems like "everything" disappears.
That is not true. The text stays there, but it is still NOT VISIBLE at that moment.

What you need to do now, is:
1) first make sure that the "Contents" pane and the "Properties" pane are both shown on your screen.
You can activate these panes via the View-menu > Other panes.
2) select all the text - you can do this via the Content pane - click on the first line with Text + SHIFT click on the last line with Text
3) while all the text is selected, look into the Properties pane and change the "Fill Color" from 'None' to Black
4) finally look into the Contents pane, select all what is "Path" and/or "Image" and delete it

All what is left now, is purely 'text'.

For preference - click "Save As" to store this result as a new PDF.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by TrackerSupp-Daniel »

Hello Mattad, Willy,

Thank you the clarification willy, I hope that it is useful.

Mattad, I believe that you may be missing a step in the article I linked before, clearly you are able to remove the image:
Image

But before that have you ensured that all the text has been first made visible by selecting it all in the content pane?
Image

I hope this helps!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
mattad
User
Posts: 143
Joined: Sat Nov 29, 2008 10:37 am

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by mattad »

Willy Van Nuffel wrote: In fact when the OCR feature has run, the resulting text is put over the image as an additional "layer".
The text itself is "transparent". This means that the characters have NO fill color and NO border color.
The text is there, but you DO NOT SEE IT.
Hello Willy,

thank you. THIS (!) is a key information! From where should users know this?
I wonder why an otherwise so comfortable program like XCHange editor does not provide an auto-fill-chars-with-black default option
Which user needs transparent text?

Now I got a text-only pdf as result.

However if I save the text-only pdf, then close the tab in editor and immediately open it again in Editor then the text looks awful (see attached pdf).
It seems to me that the original font specification is NOT embeddded in the pdf.
Even worse: The pdf cannot be displayed in Xchange Viewer. Only Foxit Reader is able to show the text content ......somehow....scrambled
XChange Editor should have detected the correct font since the text looks good after removal of image layer.

How can I tell XChange Editor to add font specifications in saved pdf`s?

The entry "Embedded" in the font details properties pane cannot be changed from "no" to "yes"
New Document.pdf
(8.03 KiB) Downloaded 94 times
pdf look after reload.png
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Tracker Supp-Stefan »

Hello mattad,

The reason why the OCR normally places an invisible layer of text over the existing image is because the font of the original document that is seen on the image can not normally be matched exactly. So OCR will place the invisible letters at the correct locations on the page - but will use a font and size that will make this possible, without worrying too much with the actual font and size used matching the image it is working on. When you select the text as it is invisible - all is good - you can then paste the text in e.g. Notepad - and there a uniform font will be used.

However when you remove the image, and make the OCR font visible - the result is as you have noticed not ideal.

That is one of the main reasons why we do not offer an automated tool that will OCR and clear the image as a one step process yet.

Regards,
Stefan
Willy Van Nuffel
User
Posts: 2348
Joined: Wed Jan 18, 2006 12:10 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by Willy Van Nuffel »

Hello,

I opened the "New Document.pdf" in Mattad's latest post, and I do not know where that strange font (F0000000006AC3C30) comes from.
It seems like that PDF has been made on the hand of PDF-XChange Editor, release 7.0.323.0 (30 Nov. 2017).
Just selecting all the text via the Content pane and changing it to (for example) Calibri gives a totally different view. Please try this too.

Now, with the latest release 7.0.325.1 of PDF-XChange Editor, I ran a new test with the "sample rasterized pdf.pdf" in Mattad's first post.
Instead of "OCR Page(s)...", I have used the new feature "Enhance Scanned Pages".
I have only activated "Recognize text" and selected "English" as language and "Medium" as accuracy.
By default, the resulting text is in "Arial Unicode MS" font.
Myself, I only changed the color of the text from "None" to Black and removed the original image.
The result is really good (see attachment).

I do not know if Tracker Software Development is still working on the "Enhance Scanned Pages" feature, and if yes, would there be a little chance that there would come an option to colorize the resulting text and to remove the original image(s) ? However, there is still a challenge to preserve the "real images" in the document. There should be some algorithm to recognize these and to copy them out of the original scans.

Best regards.
Attachments
sample rasterized pdf_WVN.pdf
(72.78 KiB) Downloaded 81 times
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: How to get resulting PDF after OCR scan in XChange Editor?

Post by TrackerSupp-Daniel »

Hello all,
Yes willy, development of the Enhanced OCR is still ongoing, It is something we hope can eventually replace the old OCR tool, but we have decided to keep the old tool present as the functions are still somewhat different.
As for those features you've requested, some of them are features we hope to implement, and we are always looking for other ways to improve our software, so any suggestions are appreciated.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply