Page 1 of 1

Font compatibility issue ?

Posted: Mon Feb 04, 2013 7:20 pm
by VincentL

First, I want to say that your tool is amazing ! You just need to build a version for Mac :D

So, I've OCR-ed a lot of pdf files and I'm very happy of the result.

But, I've a bug. If I copy a pdf file to my Mac, no problem I can search the text.
But if I modify my file on my Mac with Apple Preview (juste rotate a page for example), all the text subset layer is corrupted.
I can select the text, but copy/paste is no longer working fine. I just have a lot of strange caracters.

If I re-open the file with PDF X-Change Viewer, it's also corrupted.

I've just noticed that before the modification the font used in my pdf was : ArialMT (Embedded Subset).
And after the modification it is : font0000000016c03819 (Embedded Subset).

I don't know what to do to avoid this kind of trouble.

I've try to install ArialMT font on my Mac, but it solved nothing.

The way I OCR-ed the file :

Click on OCR button
Selecting all the pages
Selecting the language :English
Accuracy : High
PDF output type : Preserver original content & add text layer

Please help ... (and excuse my english).

Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 7:31 pm
by Will - Tracker Supp
Hi Vincent,

Thanks for the post - could I get you to send both the modified and original PDF? I'd like to take a look at them here. Please ensure that they are uploaded in a .zip folder, otherwise our system will not upload them.

Also, could you tell me what build and Version of PDF-XChange you are using, the OS and Architecture of the system that you are running our software on and also how you created the PDF?

Thanks, get back to me when you can with the above! :)

Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 8:39 pm
by VincentL
Thanks you for your quick response.
I've done the test with a pdf from you.

I've OCR-ed it with PDF X-Change Viewer 2.5 (Build 208.0) on my Windows XP. (_before.pdf file).
Then I've juste rotate the page with Preview (on Mac OS X 10.8.2 on my iMac) (_after.pdf file).

I know it's quite strange to OCR a "text-mode" pdf but I think the result is quite the same.

I can do another test with a real "image-mode" pdf but I can't say how this pdf was build.
(279.05 KiB) Downloaded 162 times

Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 9:15 pm
by Walter-Tracker Supp
I'm using OS X 10.6.2 on my personal laptop and tried to reproduce this, but could not. I took your input file and used Preview's rotate function to generate a new page. I can select and copy text just fine, as shown in the attached screenshot, and the text matches that which I see in Windows / PDF-X Viewer.

I checked the output you provided from your version of OS X and I see that some of the text from areas that aren't text (e.g. the menu bars and figures), that the OCR engine tried to recognize anyway, is shown differently than in my own sample, but in both cases this is all just part of the "junk text". I'm not sure exactly why your version of Preview replaced the font with this one, but either way it doesn't impact searchability (you can still search for the actual text on the page).

Besides the fact that your version of Preview did some extended character font replacement, the core of the issue is that sometimes our recognition makes mistakes on non-text areas and tries to recognize images and figures, outputting a bit of junk text in these regions. It's worth pointing out that the primary purpose of the existing OCR is to make documents searchable, and this functionality works very well (despite recognizing the occasional bit of junk from images embedded in the text that weren't properly identified, and you can still search for actual text in the file). The sample you used is probably a good example of the worst case scenario for our current OCR engine, because the lines and shapes in the menu snapshots look somewhat "text-like" but can't be properly recognized because they're part of fairly complex diagrams (from an OCR standpoint).

The next version of OCR we will be offering up shortly has significant improvements in this respect (much better differentiation of text vs. figure/image regions of a scanned page). You can look forward to trying it yourself when the newest PDF-X Editor is released in a couple of weeks.

Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 9:33 pm
by VincentL
It was a bad idea from my part to take this pdf.

I've done a new test with an "image-mode" pdf file

File 1 : Original, just two pages with no text searcheable
File 2 : File after OCR. Everything is correct
File 3 : File after rotating a page. You can select some text area, but if you try to paste ...

In this test, it's exactly what is happening to me.
(2.78 MiB) Downloaded 173 times

Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 10:10 pm
by Walter-Tracker Supp
I can see the issue you're talking about, but I have no idea why your specific configuration of OS X is replacing the font with the unusual one. The font your system embeds is not a standard font (it renders as "Grey Alien" faces on my machine). Something unusual is happening with your system.

My OS X machine does not do this: I can take your second file (the OCR'd one) and rotate pages in preview, and extract text just fine.

For example, this text comes from your second (OCR'd but not manipulated) document, rotated myself in OS X Preview:

"Utilisez cette fonction pour cuire les aliments crus ou surgelés. La fonction Chaleur pulsée automatique ne peut être utilisée qu'avec les familles d'aIiments indiquées ci-dessous. Utilisez la fonction Chaleur pulsée ou Chaleur pulsée combinée pour les autres aliments ou poids non indiqués.

Can you try rotating the pages with PDF X Viewer after OCR, instead?


Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 10:14 pm
by Walter-Tracker Supp
Please make sure you have the latest viewer build, as well.

Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 10:35 pm
by VincentL
If I rotate a page with PDF X Viewer it's ok. From the mac I can copy/paste with no trouble.

When you rotate the file with OSX, do you have save the file and open it again ?
If I open the file 2, rotate a page and select text, it's ok.
But if after rotating the file, I saved it and re-opened it, it's not good.

And I think I've the latest build of the viewer (Build 208.0)

Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 11:28 pm
by Walter-Tracker Supp
Can you try the same operation (rotating the page in OS X's Preview) with the attached document, then save and post the result?


Re: Font compatibility issue ?

Posted: Mon Feb 04, 2013 11:44 pm
by Walter-Tracker Supp
I have now reproduced the issue you are talking about, with your OCR files and with the one I supplied.

It appears to be a font substitution / unicode issue, and the fault of OS X's Preview application. I guess because the embedded font in the PDF is not available on the OS X system, when it creates a new PDF (which it does when it saves your rotated page PDF), it generates some kind of embedded font and makes some mistakes. You would likely see this with any PDF that contains these kinds of unicode symbols and embedded fonts, not just those created with our software.

I'm not sure if we can do anything about this for now - after all, the PDF is fine until OS X modifies it - but I will bring it up with some of the other developers and see if they have any ideas for workarounds.


Re: Font compatibility issue ?

Posted: Tue Feb 05, 2013 6:11 am
by VincentL

Here is the test you ask me to do :
(93.2 KiB) Downloaded 166 times
I think it's quite curious because the embedded font is Arial, isn't it ?

Thanks Walter.


Re: Font compatibility issue ?

Posted: Tue Feb 05, 2013 6:07 pm
by Walter-Tracker Supp
It is an embedded subset of Arial Unicode MS, which is a font that covers a broad spectrum of the unicode universe (many languages, etc).