Current PDF application <Unknown>

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Current PDF application <Unknown>

Post by timtak »

I upgraded from PDF-XChange 3 to 6 yesterday, for a bargain price, to use the Optimize functionality.
Yesterday I was able to use PDF-Exchange & and the super Optimize.

Either because
1) I had my old PDFX in D (because C is an SDD)
2) I used pdf exchange for a while after upgrade even though I was told to restart
3) My system in Japanese
4) I cloned my C drive and left the old one in K
or for some other reason

My desktop short cut goes to the old version of PDF-Xchange
In Edit > Preferences > File Associations the Current PDF application is given as <Unknown> and
in the file associates "Application Details" dialogue on Adobe is offered as a program for viewing pdfs.
My browser says application not found when I attempt to view pdfs
Clicking pdf files in Explorer too, there is no software associated with pdfs, I am given the
choice only of Adobe LIbre office, Firefox and word, there are no other programs,
and when I browse to C/Program Files (where the new Tracker folder is) or
D (where the old tracker folder is) I can not find the Editor.exe.

In C there is a folder PDF Editor folder the contents of which look like this
Image
There is no PDF Editor folder in the Tracker Software folder in D
(which is strange since I can still start my old editor from the desktop)

Aha, the desktop icon is associated with
K:/Program Files/ Tracker Software/PDF Editor/PDFEdit.exe
Which is my old SSD before I cloned it and move the operating system to C.
The icon is shown as the generic icon, and I am told that this path must be
changed, but when I click on change (ammend?) path PDF Xchange 3 starts up.

I think that most of my other software is now operating from C.

Tim

PS
I wish there were batch optimize and batch OCR (as requested on a different thread).
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Current PDF application <Unknown>

Post by Will - Tracker Supp »

Hi Tim,

Thanks for the post - I'd recommend that you completely uninstall and delete all of our program files folders after uninstalling. Once done, re-install Version 6 and there shouldn't be an issue.

Regarding the batch OCR - you'll need PDF-Tools as it is our batch editing and manipulation utility. A license for PDF-Tools can be purchased on its own and includes both the Editor and XChange Lite. It also comes as part of PDF-XChange Pro. If you currently have an Editor license, you can upgrade to Tools or Pro via the Upgrade Options[i/] tab in your account page:
https://www.pdf-xchange.com/login

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Thank you. I will try uninstalling everything and reinstalling now.

Yes, that appears to have worked.

I changed C to D in the installation too so I am not using SSD space.

Does PDF-Tools do batch Optimization too?

Tim
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Current PDF application <Unknown>

Post by Will - Tracker Supp »

Awesome! Glad to hear that :)
Does PDF-Tools do batch Optimization too
Yes, but you'll need enable multi-file selection:
Image

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Thank you.

That is the dialogue box for Split Merge files, but I presume it is the same for the Optimize tool too

I guess that there is a trial period and I may try that.

Thanks for your help, and great software.

Sincerely

Tim
Timothy Takemoto
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Current PDF application <Unknown>

Post by Will - Tracker Supp »

Hi Tim,

That's right - that is the same for virtually all of the tools in PDF-Tools.

There is no trial period per se; the trial is for an unlimited duration and all 100% of the features are available to try, however using them will watermark documents if no license is present.

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Thanks Will

I will try out PDF-Tools.

Thanks again,

Tim
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Current PDF application <Unknown>

Post by Will - Tracker Supp »

:D
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Dear WIll

Only if you have time.... and perhaps I should start (a) new thread(s).

I downloaded PDF-Tools and it looks really sweet.

I am now batch OCRing a couple of books,
one in Japanese the other English.

It looks good, and will overwrite the old files in the same folder (which is what I want).

1) I guess if I leave it running it will do everything in my data (Zotero) folder but the
data folder contains sub folders. Is there way of batch OCRing pdfs in a folder tree?

2) What happens when it comes across an already OCRed file?
I could have checked this myself but I forgot to add an already OCRed document.
I would like it to ignore already OCRed files. There does not seem to be an option
to set this but perhaps it does it automatically.

By the way, OCR generally adds more spaces that needed, and as a result,
the text layer of OCRed documents can often only be used for searching books
but for not quoting from them since the OCRed text has too many spaces,
making retyping quicker that removing lo a ds of spa c es.

It would be nice if there were an option to
1) at least remove double spaces. This could be done quite quickly and easily I guess.
2) more difficult but...re-parse the document looking for wo rds that have a space inside them.
this may be almost impossible to do automatically since there will be "some times" and others
"sometimes".
3) The possibility of manually editing the text layer so at least, with often quoted bits,
the space removed tidied up text remains in the text layer of the OCRed file.

Tim
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Current PDF application <Unknown>

Post by Will - Tracker Supp »

Hi Tim,
1) I guess if I leave it running it will do everything in my data (Zotero) folder but the
data folder contains sub folders. Is there way of batch OCRing pdfs in a folder tree?
If you select the option to Show extended dialog for file selecting:
Image

This will allow you to select a folder, which will then ask if you would like to also process files in subfolders:
Image
2) What happens when it comes across an already OCRed file?
I could have checked this myself but I forgot to add an already OCRed document.
I would like it to ignore already OCRed files. There does not seem to be an option
to set this but perhaps it does it automatically.
Unfortunately that's not possible, as there isn't any way to programmatically distinguish between 'normal' text and OCR text.
By the way, OCR generally adds more spaces that needed, and as a result,
the text layer of OCRed documents can often only be used for searching books
but for not quoting from them since the OCRed text has too many spaces,
making retyping quicker that removing lo a ds of spa c es.
We are actually in the process of re-writing our OCR and it should improve upon problems like that. This is slated for release in Version 6, within the next month or so.
1) at least remove double spaces. This could be done quite quickly and easily I guess.
Spaces don't actually exist, as such, in PDF's. PDF's are coordinate-based, so we have to essentially use our best judgment to determine the how many 'spaces' should be present based on the distance between characters. This should also be improved in the new OCR.
2) more difficult but...re-parse the document looking for wo rds that have a space inside them.
this may be almost impossible to do automatically since there will be "some times" and others
"sometimes".
Another thing that should be improved in the new OCR. However, it's worth noting that all of this things are very dependent on the new document. If you don't notice any improvement in the next release, please let us know and provide samples.
3) The possibility of manually editing the text layer so at least, with often quoted bits,
the space removed tidied up text remains in the text layer of the OCRed file.
PDF-Tools is not designed for content editing like this. You'll need to use the Editor:
https://www.pdf-xchange.com/knowle ... the-Editor
https://www.pdf-xchange.com/knowle ... the-Editor

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Will

Sorry, I found the folder tree option too late. Thank you for pointing it out.

"Unfortunately that's not possible, as there isn't any way to programmatically distinguish between 'normal' text and OCR text."

Is it impossible to distinguish between pdf files that already have a text layer and those that do not?
I have more than 1000 pdfs. I am not sure what will happen if I ocr them all again even those with
text layers already.

I look forward to the new release with improved OCRing.

Oh, the editor can already edit the text layer? That is cool. I will look into it.

Tim
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Current PDF application <Unknown>

Post by Will - Tracker Supp »

Hi Tim,

No worries!
Is it impossible to distinguish between pdf files that already have a text layer and those that do not?
I have more than 1000 pdfs. I am not sure what will happen if I ocr them all again even those with
text layers already.
We can distinguish between those with text and those without it (the primary function of OCR software), however it's not a good idea for us to exclude documents containing text, as some pages may have text and some may not and it's there where we're not able to distinguish automatically whether users want to have those pages OCR'd or not. Also, some users do OCR documents already containing text.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Dear Will

I am not sure what you mean by "documents with text and those without".

I am not asking that your software distinguish between documents containing solely pictures, or
solely diagrams, and documents containing text by use of OCR functionality (by scanning the
document to see if there are text characters in the image), but merely that your wonderful
software would profitably detect the presence of a text layer, indicating that the file has already
been OCRed, or more importantly, *is a pdf that was generated from a (rtf/doc) text file*.

For my own purposes, it would be great if I could specify "ignore documents that have a text layer"
since those documents that do are generally well OCRed, or rather, they are journal articles that have
the original text from the authors, as well as the pdf image, already. This will not always be the case,
so it should definitely be an option but I think you will find that many users have "well OCRed" or rather
(since they were generated from text) "perfectly OCRed" pdfs with perfect text layers, that they want
excluded, imho, perhaps.

E.g. If I submit a MS Word file to a publisher then the publisher will NOT <<create an image of the Word
file and create the text layer by using OCR>> but will create the pdf by including the text from the Word file.

So if PDF-Tools were to OCR such a file, I presume that the the result will be worse than the text
layer already present. But I may be wrong.

Am I making any sense?

Tim

Timothy Takemoto ne Williams
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Current PDF application <Unknown>

Post by Tracker Supp-Stefan »

Hello Timothy,

Will is out for his lunch break, so allow me to also join the discussion.
What he means is that if a file contains even one letter of "actual" text - we can not distinguish whether this was created by OCR or another process. There is simply no information where this text came from, not are there any provisions in the PDF specification for such info to be added.
So in a case where you have e.g. a merged document that consists from some pages that contain text, and some that came from scanner - we won't be able to tell if you do not need the image only pages OCred because they are e.g. art, or because two original source files were merged together.

I do realize that if your files are freshly scanned - they will be image only - but there's no feature in the Editor and Tools to let you only OCR such.

If you OCR a file already containing text - given that this text is curves and a fixed font without noise/errors - the OCR process will simply "double" it up - and you will have two layers of the same text on the page.

Regards,
Stefan
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Dear Stephan
Thank you for joining in.
> I do realize that if your files are freshly scanned - they will be image only - but there's no feature in the Editor and Tools to let you only OCR such.
That is the feature I am humbly requesting.

It is only the "freshly scanned" that I would like to have OCRED.

I have hundreds (at least 10 hundred) documents, but only
about 100 books, or maybe only 50 that I have freshly scanned. That is still
a weekend's work for PDF-Tools, but if I let it loose on all my pdfs then
it may take three weeks and I will have to stop it mid way, resulting perhaps
in triple ups, and it may create text that is worse than that which is already
present in the current text layer and I am not sure how I will distinguish
between the two doubled up layers of text when quoting a segment for
example. So a "only OCR freshly scanned" option would be great, and
as far as I am aware, essential, for me.

Sincerely

Tim
Timothy Takemoto
http://nihonbunka.com
User avatar
Patrick-Tracker Supp
Site Admin
Posts: 1645
Joined: Thu Mar 27, 2014 6:14 pm
Location: Vancouver Island
Contact:

Re: Current PDF application <Unknown>

Post by Patrick-Tracker Supp »

Hello Tim,

Thank you for your posts. One thing that must be understood is that an OCR text layer placed when you run OCR is in no way different than other text objects within PDF. There is currently an issue where if a file which has already been OCR'ed, or contains text objects and OCR is run on those pages - the OCR module 'detects' the text objects and places new, duplicated text objects from the placed text layer as well as the image text. So the first time a user OCRs, there is the Image text, and text object. The second time OCR is run on that page results in the image layer, and three text objects! To reiterate - this will be fixed soon. The end result being that text objects are to be ignored by the OCR module. This does mean that an additional text object layer will be placed on the document when OCR'ed instead of 2 new object layers.

I would recommend that you set up Tools to save the OCR'ed documents to a subfolder. Once all the documents are OCR'ed under the same name you can move the new documents into the parent folder which will allow you to overwrite the originals.

I hope this helps!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Cheers,

Patrick Charest
Tracker Support North America
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Thank you for your detailed explanation.

Alas my pdfs are already spread out within many sub-folders
in my Zotero data directory.

So I hope that you implement a "only freshly scanned" option one day.

I promise to purchase the software if you do. You are giving me a
generous academic discount though, so I am not suggesting it would
be worth your while.

Thanks for the great software.

Tim
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Location: London, UK
Contact:

Re: Current PDF application <Unknown>

Post by Will - Tracker Supp »

Hi Tim,
So I hope that you implement a "only freshly scanned" option one day.
Again, this cannot and will not be done. It simply is not practical at all:
Image

There is also no way to tell if a document has been scanned, so we cannot detect that way either. This is a feature request that simply cannot be accommodated and you will find this with most if not all software.

Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Dear Will

I am afraid I fail to understand the difficulty.

In the three example files that you gave, h
1) should not be scanned since it has a single character
2) should be scanned since it has not one single character (and nothing will be [correctly] found)
3) should be scanned since it has not one single character (and likewise nothing will be [correctly] found)

In point of fact, I do not have any files like the above as far as I am aware.

Either they have been scanned, or they have not, so I do not have any partially scanned files like 1 (afaik
but please see http://nihonbunka.com/temp/example.pdf )

I have very few unscanned pdf files which contain no text (i.e. pdf files containing non-text images like your
examples 2 and 3), and I would not mind if they were scanned along with the rest of my unscanned
text-image files.

I have about 1000 scanned text files (journal articles and books) and about 50 or 60 books
and some papers that are not scanned at all, with I presume not one text character inside them.
I have one or two image pdfs that would be scanned unnecessarily.

But perhaps my "freshly scanned" files do in fact have a few characters inside them?
Perhaps the copier inserts some time of scanning, or other text information.
This file is a few pages from a book.
http://nihonbunka.com/temp/example.pdf
Does it have any text characters?

Here are some programmers discussing ways of assessing whether a file contains text (e.g. by if it contains any
reference to "fonts")
https://stackoverflow.com/questions/602 ... 15#6553015

Sincerely
Tim
Timothy Takemoto
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Current PDF application <Unknown>

Post by Sasha - Tracker Dev Team »

Hello Tim,

Here's what can be done from what I see. There are two possible options:
1) Do not OCR the Document if it contains at least 1 text symbol.
2) Do not OCR the Document's page if it contains at least 1 text symbol.
If that would suffice then we can implement these.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Dear Sasha

Thank you very much indeed.

I think that either of those options would be fine by me but
preferably (1) since there are a lot of blank pages with
no text symbols in all those journal article pdfs that I have.
I don't think I have any part-scanned documents.

It is just the pure, virgin, freshly scanned pdfs that I am concerned with.

I have my wallet at the ready (but as i say, you are offering
an extremely generous discount)!

Thank you very much for your consideration.

Tim
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Current PDF application <Unknown>

Post by Tracker Supp-Stefan »

Hello Tim,

After speaking with Sasha - I've made this ticket:
#4045: FR: Tools - Ability to skip OCR-ing documents already containing text element.
So that we can add the feature to PDF tools to skip the whole files when there is already "text" inside a PDF file. This way you will be able to batch process files quickly (the tools is a batch utility), and skip those that already have any text in them - without distinguishing between OCred and not OCRed files.

For the Editor he is already adding the feature "Do not OCR pages that already contain text content items." - So there you will be able to OCR your file and skip any pages already with text on them.

So the feature in tools will be for whole files that do not have any text, and in the Editor - for pages without any text.

Regards,
Stefan
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Is there a way of being informed of the process, completion, solution of

#4045: FR: Tools - Ability to skip OCR-ing documents already containing text element.


Tim
User avatar
Patrick-Tracker Supp
Site Admin
Posts: 1645
Joined: Thu Mar 27, 2014 6:14 pm
Location: Vancouver Island
Contact:

Re: Current PDF application <Unknown>

Post by Patrick-Tracker Supp »

Hi Tim,

We have a closed ticketing system, you can ask for updates here or by emailing us with reference to the ticket number. When the ticket status is set to resolved you will be emailed.

Cheers!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Cheers,

Patrick Charest
Tracker Support North America
User avatar
timtak
User
Posts: 52
Joined: Mon Mar 19, 2012 8:29 am

Re: Current PDF application <Unknown>

Post by timtak »

Thank you, I very much look forward to it.
Tim
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17810
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Current PDF application <Unknown>

Post by Tracker Supp-Stefan »

:)
Post Reply