How to get resulting PDF after OCR scan in XChange Editor?
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
How to get resulting PDF after OCR scan in XChange Editor?
I am performing an OCR scan on a rasterized PDF through menu
Convert--->OCR pages
As langauges I selected German, English
After clicking OK the OCR scan started with the progress bar and finished (successfully ?) without error popup.
Fine.
But where is the resulting PDF?
Not in the original directory.
and not in the download directory.
Is there auto-save feature at all?
I manually saved the pdf and it seems to me that the original is overwritten (change of timestamp).
But the new pdf seems to be still rasterized: I cannot select/highlight text.
Convert--->OCR pages
As langauges I selected German, English
After clicking OK the OCR scan started with the progress bar and finished (successfully ?) without error popup.
Fine.
But where is the resulting PDF?
Not in the original directory.
and not in the download directory.
Is there auto-save feature at all?
I manually saved the pdf and it seems to me that the original is overwritten (change of timestamp).
But the new pdf seems to be still rasterized: I cannot select/highlight text.
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello mattad,
Our OCR tool will include the OCR text layer on top of the existing content in the file you have already. It will be an invisible layer of text - that you can now select with e.g. the "Text Selection" tool - and can then copy and paste in other programs as needed.
Regards,
Stefan
Our OCR tool will include the OCR text layer on top of the existing content in the file you have already. It will be an invisible layer of text - that you can now select with e.g. the "Text Selection" tool - and can then copy and paste in other programs as needed.
Regards,
Stefan
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hmm, this is NOT true resp. working
Have a look at the attached PDF file and the snapshot of PDF XChange Editor.
I can NOT select or edit any text.
All submenus are disabled/greyed out.
So again: How can I either convert the rasterized pdf or select some text from it?
Have a look at the attached PDF file and the snapshot of PDF XChange Editor.
I can NOT select or edit any text.
All submenus are disabled/greyed out.
So again: How can I either convert the rasterized pdf or select some text from it?
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello mattad,
You need to run the OCR tool first, and then e.g. select text (with the button next to the hand tool on the left), and only after that the "Selection" menu will have active entries inside - as this is actually a menu that allows you to perform modifications to an already made selection: Regards,
Stefan
You need to run the OCR tool first, and then e.g. select text (with the button next to the hand tool on the left), and only after that the "Selection" menu will have active entries inside - as this is actually a menu that allows you to perform modifications to an already made selection: Regards,
Stefan
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
for more info on this process, see this KB article as well:
https://www.pdf-xchange.com/knowle ... -performed
https://www.pdf-xchange.com/knowle ... -performed
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello Stefan,Tracker Supp-Stefan wrote: You need to run the OCR tool first
I am still confused. You tell me "You need to run OCR Tool".
But HOW do I run OCR Tool?
Even if I follow the link of Daniel I found no progress.
I select View--->Panes--->Content and click on "Page 1" on the left.
And then?
Wouldn't it be much more user friendly to offer a toolbar icon "convert image-to-text-based PDF"
I appreciate your XCHange PDF products but this OCR handling is not intuitive.
-
- User
- Posts: 2348
- Joined: Wed Jan 18, 2006 12:10 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
To run the OCR tool, just click the Convert tab > OCR pages > OK.
Once the OCR process has ended, there will be a transparent layer upon the scanned image.
By first clicking the Edit icon in the Home ribbon, you can then select the text, but it is not "editable".
The goal of OCR is mainly to make the text "searchable".
If you really like to Edit/Modify the transparent layer, then you have to do some additional manipulations, like set a text-color (instead of transparent) and remove the original images and/or shapes (seen in the Content pane as "Path"):
https://www.pdf-xchange.com/knowle ... -performed
The reason why all the icons in your Selection-menu are grayed out, is because you must first click the Edit-icon (in the Home ribbon).
NOTE: Your example PDF seems to be something else than scanned text. Every single character can be selected as a separate 'shape' (via Edit > Shapes), but you can apply OCR to it without problem. See result in attachment.
Once the OCR process has ended, there will be a transparent layer upon the scanned image.
By first clicking the Edit icon in the Home ribbon, you can then select the text, but it is not "editable".
The goal of OCR is mainly to make the text "searchable".
If you really like to Edit/Modify the transparent layer, then you have to do some additional manipulations, like set a text-color (instead of transparent) and remove the original images and/or shapes (seen in the Content pane as "Path"):
https://www.pdf-xchange.com/knowle ... -performed
The reason why all the icons in your Selection-menu are grayed out, is because you must first click the Edit-icon (in the Home ribbon).
NOTE: Your example PDF seems to be something else than scanned text. Every single character can be selected as a separate 'shape' (via Edit > Shapes), but you can apply OCR to it without problem. See result in attachment.
- Attachments
-
- sample rasterized.pdf
- (13.28 KiB) Downloaded 99 times
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: How to get resulting PDF after OCR scan in XChange Editor?
Thanks Willy
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: How to get resulting PDF after OCR scan in XChange Editor?
Ok Willy. Thank you. We are approaching the final solution but still not finished.
I OCR scanned the page and switched to Edit mode. I can select individual text.
Wonderful so far.
BUT: Now as final step I want to apply the transparent Edit layer to the underlying PDF and save the full PDF content (holding currently in XCHange Editor)
as new PDF file WITH selectable/highlightable text.
If I click therefore on menu
File->Save As--->Browse
and select a directory then the current pdf is saved but in the same format as before.
I or other users cannot load for example the new pdf into XChange Viewer and highlight e.g. line 5 with a colored background.
So may I ask you again: How can I save the whole new, text-selectable Document as highlightable PDF?
Thank you
I OCR scanned the page and switched to Edit mode. I can select individual text.
Wonderful so far.
BUT: Now as final step I want to apply the transparent Edit layer to the underlying PDF and save the full PDF content (holding currently in XCHange Editor)
as new PDF file WITH selectable/highlightable text.
If I click therefore on menu
File->Save As--->Browse
and select a directory then the current pdf is saved but in the same format as before.
I or other users cannot load for example the new pdf into XChange Viewer and highlight e.g. line 5 with a colored background.
So may I ask you again: How can I save the whole new, text-selectable Document as highlightable PDF?
Thank you
-
- User
- Posts: 2348
- Joined: Wed Jan 18, 2006 12:10 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello Mattad,
When you run the OCR process on a scanned text, at "Output type", you can choose to use the original PDF or to create a new PDF.
Are you sure to Save the correct PDF (including the transparent layer) ?
I have add the resulting PDF that still includes the original image of the scanned text and also the transparent layer.
You will see that you can perfectly highlight the text in it.
NOTE: I see that you did not yet open my previous example, where only the text layer has been saved and the image has been removed.
Best regards.
When you run the OCR process on a scanned text, at "Output type", you can choose to use the original PDF or to create a new PDF.
Are you sure to Save the correct PDF (including the transparent layer) ?
I have add the resulting PDF that still includes the original image of the scanned text and also the transparent layer.
You will see that you can perfectly highlight the text in it.
NOTE: I see that you did not yet open my previous example, where only the text layer has been saved and the image has been removed.
Best regards.
- Attachments
-
- sample rasterized with highlight.pdf
- (714.07 KiB) Downloaded 95 times
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
Thank again for the concise and helpful descriptions willy!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello Willy,
thank you for your hints.
Switching the drop down "Output type" was a key information.
Your procedure works now.
But there are still some important questions:
1.) You are writing "....my previous example, where only the text layer has been saved and the image has been removed".
Where exactly do I tell XChange editor to save a pdf (a) WITH image or (b) WITHOUT image?
2.) Assume I have an unknown pdf file (in Windows Explorer) and load it into XChange Editor:
How can I find out if this pdf file contains only the text version or the text PLUS image layer?
Can I strip later the image layer (from a text PLUS image file)?
3.) When I look in XChange editor at the content (=left content sub pane) then I see that the OCR scan creates for
every word a new, individual pdf frame resp. container entry.
That seems to me rather inefficient and probably space consuming.
Can I tell XChange editor to "optimize" the OCR result,
That means to group all word frames of a paragraph into ONE pdf frame?
Is this possible?
That would help to edit later larger parts of the text of the pdf file.
Thank you
thank you for your hints.
Switching the drop down "Output type" was a key information.
Your procedure works now.
But there are still some important questions:
1.) You are writing "....my previous example, where only the text layer has been saved and the image has been removed".
Where exactly do I tell XChange editor to save a pdf (a) WITH image or (b) WITHOUT image?
2.) Assume I have an unknown pdf file (in Windows Explorer) and load it into XChange Editor:
How can I find out if this pdf file contains only the text version or the text PLUS image layer?
Can I strip later the image layer (from a text PLUS image file)?
3.) When I look in XChange editor at the content (=left content sub pane) then I see that the OCR scan creates for
every word a new, individual pdf frame resp. container entry.
That seems to me rather inefficient and probably space consuming.
Can I tell XChange editor to "optimize" the OCR result,
That means to group all word frames of a paragraph into ONE pdf frame?
Is this possible?
That would help to edit later larger parts of the text of the pdf file.
Thank you
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello mattad,
Glad to hear it works, as for you questions:
1. You cannot currently automate removal of the background image, currently following this article is the only method of removal after OCR: https://www.pdf-xchange.com/knowle ... -performed
2. The simplest way is to use the select text tool and drag a box around an image,
as for stripping the images, once again, this must be done manually, as per the above KB article.
3. Not yet, as it stands, even when a word is in another PDF software, they usually treat all words as separate entities. A 'paragraph' from the Editor's (as well as much of the competitions) viewpoint is just a group of words that are close enough together to be handled as such (in many cases theses are even handled as if each letter was its own object!).
However, this is an interesting feature request, so I will bring it to our dev team and see if they think it is something we could implement.
Glad to hear it works, as for you questions:
1. You cannot currently automate removal of the background image, currently following this article is the only method of removal after OCR: https://www.pdf-xchange.com/knowle ... -performed
2. The simplest way is to use the select text tool and drag a box around an image,
as for stripping the images, once again, this must be done manually, as per the above KB article.
3. Not yet, as it stands, even when a word is in another PDF software, they usually treat all words as separate entities. A 'paragraph' from the Editor's (as well as much of the competitions) viewpoint is just a group of words that are close enough together to be handled as such (in many cases theses are even handled as if each letter was its own object!).
However, this is an interesting feature request, so I will bring it to our dev team and see if they think it is something we could implement.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello Daniel,
the article you referenced tell how to edit and remove parts from TEXT layer.
This is not an answer to my question.
I want to do the opposite:
If I select for example the "image" component in the Content pane (see attached snapshot) and press delete then all text components disappear as well.
How do I remove the original image and leave the text?
the article you referenced tell how to edit and remove parts from TEXT layer.
This is not an answer to my question.
I want to do the opposite:
If I select for example the "image" component in the Content pane (see attached snapshot) and press delete then all text components disappear as well.
How do I remove the original image and leave the text?
-
- User
- Posts: 2348
- Joined: Wed Jan 18, 2006 12:10 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello mattad,
It really seems difficult for you to understand how it exactly works ...
In fact when the OCR feature has run, the resulting text is put over the image as an additional "layer".
The text itself is "transparent". This means that the characters have NO fill color and NO border color.
The text is there, but you DO NOT SEE IT.
This is the reason why - when you remove the image - it seems like "everything" disappears.
That is not true. The text stays there, but it is still NOT VISIBLE at that moment.
What you need to do now, is:
1) first make sure that the "Contents" pane and the "Properties" pane are both shown on your screen.
You can activate these panes via the View-menu > Other panes.
2) select all the text - you can do this via the Content pane - click on the first line with Text + SHIFT click on the last line with Text
3) while all the text is selected, look into the Properties pane and change the "Fill Color" from 'None' to Black
4) finally look into the Contents pane, select all what is "Path" and/or "Image" and delete it
All what is left now, is purely 'text'.
For preference - click "Save As" to store this result as a new PDF.
It really seems difficult for you to understand how it exactly works ...
In fact when the OCR feature has run, the resulting text is put over the image as an additional "layer".
The text itself is "transparent". This means that the characters have NO fill color and NO border color.
The text is there, but you DO NOT SEE IT.
This is the reason why - when you remove the image - it seems like "everything" disappears.
That is not true. The text stays there, but it is still NOT VISIBLE at that moment.
What you need to do now, is:
1) first make sure that the "Contents" pane and the "Properties" pane are both shown on your screen.
You can activate these panes via the View-menu > Other panes.
2) select all the text - you can do this via the Content pane - click on the first line with Text + SHIFT click on the last line with Text
3) while all the text is selected, look into the Properties pane and change the "Fill Color" from 'None' to Black
4) finally look into the Contents pane, select all what is "Path" and/or "Image" and delete it
All what is left now, is purely 'text'.
For preference - click "Save As" to store this result as a new PDF.
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello Mattad, Willy,
Thank you the clarification willy, I hope that it is useful.
Mattad, I believe that you may be missing a step in the article I linked before, clearly you are able to remove the image:
But before that have you ensured that all the text has been first made visible by selecting it all in the content pane?
I hope this helps!
Thank you the clarification willy, I hope that it is useful.
Mattad, I believe that you may be missing a step in the article I linked before, clearly you are able to remove the image:
But before that have you ensured that all the text has been first made visible by selecting it all in the content pane?
I hope this helps!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello Willy,Willy Van Nuffel wrote: In fact when the OCR feature has run, the resulting text is put over the image as an additional "layer".
The text itself is "transparent". This means that the characters have NO fill color and NO border color.
The text is there, but you DO NOT SEE IT.
thank you. THIS (!) is a key information! From where should users know this?
I wonder why an otherwise so comfortable program like XCHange editor does not provide an auto-fill-chars-with-black default option
Which user needs transparent text?
Now I got a text-only pdf as result.
However if I save the text-only pdf, then close the tab in editor and immediately open it again in Editor then the text looks awful (see attached pdf).
It seems to me that the original font specification is NOT embeddded in the pdf.
Even worse: The pdf cannot be displayed in Xchange Viewer. Only Foxit Reader is able to show the text content ......somehow....scrambled
XChange Editor should have detected the correct font since the text looks good after removal of image layer.
How can I tell XChange Editor to add font specifications in saved pdf`s?
The entry "Embedded" in the font details properties pane cannot be changed from "no" to "yes"
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello mattad,
The reason why the OCR normally places an invisible layer of text over the existing image is because the font of the original document that is seen on the image can not normally be matched exactly. So OCR will place the invisible letters at the correct locations on the page - but will use a font and size that will make this possible, without worrying too much with the actual font and size used matching the image it is working on. When you select the text as it is invisible - all is good - you can then paste the text in e.g. Notepad - and there a uniform font will be used.
However when you remove the image, and make the OCR font visible - the result is as you have noticed not ideal.
That is one of the main reasons why we do not offer an automated tool that will OCR and clear the image as a one step process yet.
Regards,
Stefan
The reason why the OCR normally places an invisible layer of text over the existing image is because the font of the original document that is seen on the image can not normally be matched exactly. So OCR will place the invisible letters at the correct locations on the page - but will use a font and size that will make this possible, without worrying too much with the actual font and size used matching the image it is working on. When you select the text as it is invisible - all is good - you can then paste the text in e.g. Notepad - and there a uniform font will be used.
However when you remove the image, and make the OCR font visible - the result is as you have noticed not ideal.
That is one of the main reasons why we do not offer an automated tool that will OCR and clear the image as a one step process yet.
Regards,
Stefan
-
- User
- Posts: 2348
- Joined: Wed Jan 18, 2006 12:10 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello,
I opened the "New Document.pdf" in Mattad's latest post, and I do not know where that strange font (F0000000006AC3C30) comes from.
It seems like that PDF has been made on the hand of PDF-XChange Editor, release 7.0.323.0 (30 Nov. 2017).
Just selecting all the text via the Content pane and changing it to (for example) Calibri gives a totally different view. Please try this too.
Now, with the latest release 7.0.325.1 of PDF-XChange Editor, I ran a new test with the "sample rasterized pdf.pdf" in Mattad's first post.
Instead of "OCR Page(s)...", I have used the new feature "Enhance Scanned Pages".
I have only activated "Recognize text" and selected "English" as language and "Medium" as accuracy.
By default, the resulting text is in "Arial Unicode MS" font.
Myself, I only changed the color of the text from "None" to Black and removed the original image.
The result is really good (see attachment).
I do not know if Tracker Software Development is still working on the "Enhance Scanned Pages" feature, and if yes, would there be a little chance that there would come an option to colorize the resulting text and to remove the original image(s) ? However, there is still a challenge to preserve the "real images" in the document. There should be some algorithm to recognize these and to copy them out of the original scans.
Best regards.
I opened the "New Document.pdf" in Mattad's latest post, and I do not know where that strange font (F0000000006AC3C30) comes from.
It seems like that PDF has been made on the hand of PDF-XChange Editor, release 7.0.323.0 (30 Nov. 2017).
Just selecting all the text via the Content pane and changing it to (for example) Calibri gives a totally different view. Please try this too.
Now, with the latest release 7.0.325.1 of PDF-XChange Editor, I ran a new test with the "sample rasterized pdf.pdf" in Mattad's first post.
Instead of "OCR Page(s)...", I have used the new feature "Enhance Scanned Pages".
I have only activated "Recognize text" and selected "English" as language and "Medium" as accuracy.
By default, the resulting text is in "Arial Unicode MS" font.
Myself, I only changed the color of the text from "None" to Black and removed the original image.
The result is really good (see attachment).
I do not know if Tracker Software Development is still working on the "Enhance Scanned Pages" feature, and if yes, would there be a little chance that there would come an option to colorize the resulting text and to remove the original image(s) ? However, there is still a challenge to preserve the "real images" in the document. There should be some algorithm to recognize these and to copy them out of the original scans.
Best regards.
- Attachments
-
- sample rasterized pdf_WVN.pdf
- (72.78 KiB) Downloaded 81 times
- TrackerSupp-Daniel
- Site Admin
- Posts: 8440
- Joined: Wed Jan 03, 2018 6:52 pm
Re: How to get resulting PDF after OCR scan in XChange Editor?
Hello all,
Yes willy, development of the Enhanced OCR is still ongoing, It is something we hope can eventually replace the old OCR tool, but we have decided to keep the old tool present as the functions are still somewhat different.
As for those features you've requested, some of them are features we hope to implement, and we are always looking for other ways to improve our software, so any suggestions are appreciated.
Yes willy, development of the Enhanced OCR is still ongoing, It is something we hope can eventually replace the old OCR tool, but we have decided to keep the old tool present as the functions are still somewhat different.
As for those features you've requested, some of them are features we hope to implement, and we are always looking for other ways to improve our software, so any suggestions are appreciated.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Tracker Software Products (Canada) LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com