Current PDF application <Unknown>
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
Current PDF application <Unknown>
I upgraded from PDF-XChange 3 to 6 yesterday, for a bargain price, to use the Optimize functionality.
Yesterday I was able to use PDF-Exchange & and the super Optimize.
Either because
1) I had my old PDFX in D (because C is an SDD)
2) I used pdf exchange for a while after upgrade even though I was told to restart
3) My system in Japanese
4) I cloned my C drive and left the old one in K
or for some other reason
My desktop short cut goes to the old version of PDF-Xchange
In Edit > Preferences > File Associations the Current PDF application is given as <Unknown> and
in the file associates "Application Details" dialogue on Adobe is offered as a program for viewing pdfs.
My browser says application not found when I attempt to view pdfs
Clicking pdf files in Explorer too, there is no software associated with pdfs, I am given the
choice only of Adobe LIbre office, Firefox and word, there are no other programs,
and when I browse to C/Program Files (where the new Tracker folder is) or
D (where the old tracker folder is) I can not find the Editor.exe.
In C there is a folder PDF Editor folder the contents of which look like this
There is no PDF Editor folder in the Tracker Software folder in D
(which is strange since I can still start my old editor from the desktop)
Aha, the desktop icon is associated with
K:/Program Files/ Tracker Software/PDF Editor/PDFEdit.exe
Which is my old SSD before I cloned it and move the operating system to C.
The icon is shown as the generic icon, and I am told that this path must be
changed, but when I click on change (ammend?) path PDF Xchange 3 starts up.
I think that most of my other software is now operating from C.
Tim
PS
I wish there were batch optimize and batch OCR (as requested on a different thread).
Yesterday I was able to use PDF-Exchange & and the super Optimize.
Either because
1) I had my old PDFX in D (because C is an SDD)
2) I used pdf exchange for a while after upgrade even though I was told to restart
3) My system in Japanese
4) I cloned my C drive and left the old one in K
or for some other reason
My desktop short cut goes to the old version of PDF-Xchange
In Edit > Preferences > File Associations the Current PDF application is given as <Unknown> and
in the file associates "Application Details" dialogue on Adobe is offered as a program for viewing pdfs.
My browser says application not found when I attempt to view pdfs
Clicking pdf files in Explorer too, there is no software associated with pdfs, I am given the
choice only of Adobe LIbre office, Firefox and word, there are no other programs,
and when I browse to C/Program Files (where the new Tracker folder is) or
D (where the old tracker folder is) I can not find the Editor.exe.
In C there is a folder PDF Editor folder the contents of which look like this
There is no PDF Editor folder in the Tracker Software folder in D
(which is strange since I can still start my old editor from the desktop)
Aha, the desktop icon is associated with
K:/Program Files/ Tracker Software/PDF Editor/PDFEdit.exe
Which is my old SSD before I cloned it and move the operating system to C.
The icon is shown as the generic icon, and I am told that this path must be
changed, but when I click on change (ammend?) path PDF Xchange 3 starts up.
I think that most of my other software is now operating from C.
Tim
PS
I wish there were batch optimize and batch OCR (as requested on a different thread).
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Current PDF application <Unknown>
Hi Tim,
Thanks for the post - I'd recommend that you completely uninstall and delete all of our program files folders after uninstalling. Once done, re-install Version 6 and there shouldn't be an issue.
Regarding the batch OCR - you'll need PDF-Tools as it is our batch editing and manipulation utility. A license for PDF-Tools can be purchased on its own and includes both the Editor and XChange Lite. It also comes as part of PDF-XChange Pro. If you currently have an Editor license, you can upgrade to Tools or Pro via the Upgrade Options[i/] tab in your account page:
https://www.pdf-xchange.com/login
Thanks,
Thanks for the post - I'd recommend that you completely uninstall and delete all of our program files folders after uninstalling. Once done, re-install Version 6 and there shouldn't be an issue.
Regarding the batch OCR - you'll need PDF-Tools as it is our batch editing and manipulation utility. A license for PDF-Tools can be purchased on its own and includes both the Editor and XChange Lite. It also comes as part of PDF-XChange Pro. If you currently have an Editor license, you can upgrade to Tools or Pro via the Upgrade Options[i/] tab in your account page:
https://www.pdf-xchange.com/login
Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: Current PDF application <Unknown>
Thank you. I will try uninstalling everything and reinstalling now.
Yes, that appears to have worked.
I changed C to D in the installation too so I am not using SSD space.
Does PDF-Tools do batch Optimization too?
Tim
Yes, that appears to have worked.
I changed C to D in the installation too so I am not using SSD space.
Does PDF-Tools do batch Optimization too?
Tim
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Current PDF application <Unknown>
Awesome! Glad to hear that
Cheers,
Yes, but you'll need enable multi-file selection:Does PDF-Tools do batch Optimization too
Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: Current PDF application <Unknown>
Thank you.
That is the dialogue box for Split Merge files, but I presume it is the same for the Optimize tool too
I guess that there is a trial period and I may try that.
Thanks for your help, and great software.
Sincerely
Tim
Timothy Takemoto
That is the dialogue box for Split Merge files, but I presume it is the same for the Optimize tool too
I guess that there is a trial period and I may try that.
Thanks for your help, and great software.
Sincerely
Tim
Timothy Takemoto
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Current PDF application <Unknown>
Hi Tim,
That's right - that is the same for virtually all of the tools in PDF-Tools.
There is no trial period per se; the trial is for an unlimited duration and all 100% of the features are available to try, however using them will watermark documents if no license is present.
Thanks,
That's right - that is the same for virtually all of the tools in PDF-Tools.
There is no trial period per se; the trial is for an unlimited duration and all 100% of the features are available to try, however using them will watermark documents if no license is present.
Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: Current PDF application <Unknown>
Thanks Will
I will try out PDF-Tools.
Thanks again,
Tim
I will try out PDF-Tools.
Thanks again,
Tim
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Current PDF application <Unknown>
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: Current PDF application <Unknown>
Dear WIll
Only if you have time.... and perhaps I should start (a) new thread(s).
I downloaded PDF-Tools and it looks really sweet.
I am now batch OCRing a couple of books,
one in Japanese the other English.
It looks good, and will overwrite the old files in the same folder (which is what I want).
1) I guess if I leave it running it will do everything in my data (Zotero) folder but the
data folder contains sub folders. Is there way of batch OCRing pdfs in a folder tree?
2) What happens when it comes across an already OCRed file?
I could have checked this myself but I forgot to add an already OCRed document.
I would like it to ignore already OCRed files. There does not seem to be an option
to set this but perhaps it does it automatically.
By the way, OCR generally adds more spaces that needed, and as a result,
the text layer of OCRed documents can often only be used for searching books
but for not quoting from them since the OCRed text has too many spaces,
making retyping quicker that removing lo a ds of spa c es.
It would be nice if there were an option to
1) at least remove double spaces. This could be done quite quickly and easily I guess.
2) more difficult but...re-parse the document looking for wo rds that have a space inside them.
this may be almost impossible to do automatically since there will be "some times" and others
"sometimes".
3) The possibility of manually editing the text layer so at least, with often quoted bits,
the space removed tidied up text remains in the text layer of the OCRed file.
Tim
Only if you have time.... and perhaps I should start (a) new thread(s).
I downloaded PDF-Tools and it looks really sweet.
I am now batch OCRing a couple of books,
one in Japanese the other English.
It looks good, and will overwrite the old files in the same folder (which is what I want).
1) I guess if I leave it running it will do everything in my data (Zotero) folder but the
data folder contains sub folders. Is there way of batch OCRing pdfs in a folder tree?
2) What happens when it comes across an already OCRed file?
I could have checked this myself but I forgot to add an already OCRed document.
I would like it to ignore already OCRed files. There does not seem to be an option
to set this but perhaps it does it automatically.
By the way, OCR generally adds more spaces that needed, and as a result,
the text layer of OCRed documents can often only be used for searching books
but for not quoting from them since the OCRed text has too many spaces,
making retyping quicker that removing lo a ds of spa c es.
It would be nice if there were an option to
1) at least remove double spaces. This could be done quite quickly and easily I guess.
2) more difficult but...re-parse the document looking for wo rds that have a space inside them.
this may be almost impossible to do automatically since there will be "some times" and others
"sometimes".
3) The possibility of manually editing the text layer so at least, with often quoted bits,
the space removed tidied up text remains in the text layer of the OCRed file.
Tim
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Current PDF application <Unknown>
Hi Tim,
This will allow you to select a folder, which will then ask if you would like to also process files in subfolders:
https://www.pdf-xchange.com/knowle ... the-Editor
https://www.pdf-xchange.com/knowle ... the-Editor
Cheers,
If you select the option to Show extended dialog for file selecting:1) I guess if I leave it running it will do everything in my data (Zotero) folder but the
data folder contains sub folders. Is there way of batch OCRing pdfs in a folder tree?
This will allow you to select a folder, which will then ask if you would like to also process files in subfolders:
Unfortunately that's not possible, as there isn't any way to programmatically distinguish between 'normal' text and OCR text.2) What happens when it comes across an already OCRed file?
I could have checked this myself but I forgot to add an already OCRed document.
I would like it to ignore already OCRed files. There does not seem to be an option
to set this but perhaps it does it automatically.
We are actually in the process of re-writing our OCR and it should improve upon problems like that. This is slated for release in Version 6, within the next month or so.By the way, OCR generally adds more spaces that needed, and as a result,
the text layer of OCRed documents can often only be used for searching books
but for not quoting from them since the OCRed text has too many spaces,
making retyping quicker that removing lo a ds of spa c es.
Spaces don't actually exist, as such, in PDF's. PDF's are coordinate-based, so we have to essentially use our best judgment to determine the how many 'spaces' should be present based on the distance between characters. This should also be improved in the new OCR.1) at least remove double spaces. This could be done quite quickly and easily I guess.
Another thing that should be improved in the new OCR. However, it's worth noting that all of this things are very dependent on the new document. If you don't notice any improvement in the next release, please let us know and provide samples.2) more difficult but...re-parse the document looking for wo rds that have a space inside them.
this may be almost impossible to do automatically since there will be "some times" and others
"sometimes".
PDF-Tools is not designed for content editing like this. You'll need to use the Editor:3) The possibility of manually editing the text layer so at least, with often quoted bits,
the space removed tidied up text remains in the text layer of the OCRed file.
https://www.pdf-xchange.com/knowle ... the-Editor
https://www.pdf-xchange.com/knowle ... the-Editor
Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: Current PDF application <Unknown>
Will
Sorry, I found the folder tree option too late. Thank you for pointing it out.
"Unfortunately that's not possible, as there isn't any way to programmatically distinguish between 'normal' text and OCR text."
Is it impossible to distinguish between pdf files that already have a text layer and those that do not?
I have more than 1000 pdfs. I am not sure what will happen if I ocr them all again even those with
text layers already.
I look forward to the new release with improved OCRing.
Oh, the editor can already edit the text layer? That is cool. I will look into it.
Tim
Sorry, I found the folder tree option too late. Thank you for pointing it out.
"Unfortunately that's not possible, as there isn't any way to programmatically distinguish between 'normal' text and OCR text."
Is it impossible to distinguish between pdf files that already have a text layer and those that do not?
I have more than 1000 pdfs. I am not sure what will happen if I ocr them all again even those with
text layers already.
I look forward to the new release with improved OCRing.
Oh, the editor can already edit the text layer? That is cool. I will look into it.
Tim
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Current PDF application <Unknown>
Hi Tim,
No worries!
Cheers,
No worries!
We can distinguish between those with text and those without it (the primary function of OCR software), however it's not a good idea for us to exclude documents containing text, as some pages may have text and some may not and it's there where we're not able to distinguish automatically whether users want to have those pages OCR'd or not. Also, some users do OCR documents already containing text.Is it impossible to distinguish between pdf files that already have a text layer and those that do not?
I have more than 1000 pdfs. I am not sure what will happen if I ocr them all again even those with
text layers already.
Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: Current PDF application <Unknown>
Dear Will
I am not sure what you mean by "documents with text and those without".
I am not asking that your software distinguish between documents containing solely pictures, or
solely diagrams, and documents containing text by use of OCR functionality (by scanning the
document to see if there are text characters in the image), but merely that your wonderful
software would profitably detect the presence of a text layer, indicating that the file has already
been OCRed, or more importantly, *is a pdf that was generated from a (rtf/doc) text file*.
For my own purposes, it would be great if I could specify "ignore documents that have a text layer"
since those documents that do are generally well OCRed, or rather, they are journal articles that have
the original text from the authors, as well as the pdf image, already. This will not always be the case,
so it should definitely be an option but I think you will find that many users have "well OCRed" or rather
(since they were generated from text) "perfectly OCRed" pdfs with perfect text layers, that they want
excluded, imho, perhaps.
E.g. If I submit a MS Word file to a publisher then the publisher will NOT <<create an image of the Word
file and create the text layer by using OCR>> but will create the pdf by including the text from the Word file.
So if PDF-Tools were to OCR such a file, I presume that the the result will be worse than the text
layer already present. But I may be wrong.
Am I making any sense?
Tim
Timothy Takemoto ne Williams
I am not sure what you mean by "documents with text and those without".
I am not asking that your software distinguish between documents containing solely pictures, or
solely diagrams, and documents containing text by use of OCR functionality (by scanning the
document to see if there are text characters in the image), but merely that your wonderful
software would profitably detect the presence of a text layer, indicating that the file has already
been OCRed, or more importantly, *is a pdf that was generated from a (rtf/doc) text file*.
For my own purposes, it would be great if I could specify "ignore documents that have a text layer"
since those documents that do are generally well OCRed, or rather, they are journal articles that have
the original text from the authors, as well as the pdf image, already. This will not always be the case,
so it should definitely be an option but I think you will find that many users have "well OCRed" or rather
(since they were generated from text) "perfectly OCRed" pdfs with perfect text layers, that they want
excluded, imho, perhaps.
E.g. If I submit a MS Word file to a publisher then the publisher will NOT <<create an image of the Word
file and create the text layer by using OCR>> but will create the pdf by including the text from the Word file.
So if PDF-Tools were to OCR such a file, I presume that the the result will be worse than the text
layer already present. But I may be wrong.
Am I making any sense?
Tim
Timothy Takemoto ne Williams
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Current PDF application <Unknown>
Hello Timothy,
Will is out for his lunch break, so allow me to also join the discussion.
What he means is that if a file contains even one letter of "actual" text - we can not distinguish whether this was created by OCR or another process. There is simply no information where this text came from, not are there any provisions in the PDF specification for such info to be added.
So in a case where you have e.g. a merged document that consists from some pages that contain text, and some that came from scanner - we won't be able to tell if you do not need the image only pages OCred because they are e.g. art, or because two original source files were merged together.
I do realize that if your files are freshly scanned - they will be image only - but there's no feature in the Editor and Tools to let you only OCR such.
If you OCR a file already containing text - given that this text is curves and a fixed font without noise/errors - the OCR process will simply "double" it up - and you will have two layers of the same text on the page.
Regards,
Stefan
Will is out for his lunch break, so allow me to also join the discussion.
What he means is that if a file contains even one letter of "actual" text - we can not distinguish whether this was created by OCR or another process. There is simply no information where this text came from, not are there any provisions in the PDF specification for such info to be added.
So in a case where you have e.g. a merged document that consists from some pages that contain text, and some that came from scanner - we won't be able to tell if you do not need the image only pages OCred because they are e.g. art, or because two original source files were merged together.
I do realize that if your files are freshly scanned - they will be image only - but there's no feature in the Editor and Tools to let you only OCR such.
If you OCR a file already containing text - given that this text is curves and a fixed font without noise/errors - the OCR process will simply "double" it up - and you will have two layers of the same text on the page.
Regards,
Stefan
Re: Current PDF application <Unknown>
Dear Stephan
Thank you for joining in.
> I do realize that if your files are freshly scanned - they will be image only - but there's no feature in the Editor and Tools to let you only OCR such.
That is the feature I am humbly requesting.
It is only the "freshly scanned" that I would like to have OCRED.
I have hundreds (at least 10 hundred) documents, but only
about 100 books, or maybe only 50 that I have freshly scanned. That is still
a weekend's work for PDF-Tools, but if I let it loose on all my pdfs then
it may take three weeks and I will have to stop it mid way, resulting perhaps
in triple ups, and it may create text that is worse than that which is already
present in the current text layer and I am not sure how I will distinguish
between the two doubled up layers of text when quoting a segment for
example. So a "only OCR freshly scanned" option would be great, and
as far as I am aware, essential, for me.
Sincerely
Tim
Timothy Takemoto
http://nihonbunka.com
Thank you for joining in.
> I do realize that if your files are freshly scanned - they will be image only - but there's no feature in the Editor and Tools to let you only OCR such.
That is the feature I am humbly requesting.
It is only the "freshly scanned" that I would like to have OCRED.
I have hundreds (at least 10 hundred) documents, but only
about 100 books, or maybe only 50 that I have freshly scanned. That is still
a weekend's work for PDF-Tools, but if I let it loose on all my pdfs then
it may take three weeks and I will have to stop it mid way, resulting perhaps
in triple ups, and it may create text that is worse than that which is already
present in the current text layer and I am not sure how I will distinguish
between the two doubled up layers of text when quoting a segment for
example. So a "only OCR freshly scanned" option would be great, and
as far as I am aware, essential, for me.
Sincerely
Tim
Timothy Takemoto
http://nihonbunka.com
- Patrick-Tracker Supp
- Site Admin
- Posts: 1645
- Joined: Thu Mar 27, 2014 6:14 pm
- Location: Vancouver Island
- Contact:
Re: Current PDF application <Unknown>
Hello Tim,
Thank you for your posts. One thing that must be understood is that an OCR text layer placed when you run OCR is in no way different than other text objects within PDF. There is currently an issue where if a file which has already been OCR'ed, or contains text objects and OCR is run on those pages - the OCR module 'detects' the text objects and places new, duplicated text objects from the placed text layer as well as the image text. So the first time a user OCRs, there is the Image text, and text object. The second time OCR is run on that page results in the image layer, and three text objects! To reiterate - this will be fixed soon. The end result being that text objects are to be ignored by the OCR module. This does mean that an additional text object layer will be placed on the document when OCR'ed instead of 2 new object layers.
I would recommend that you set up Tools to save the OCR'ed documents to a subfolder. Once all the documents are OCR'ed under the same name you can move the new documents into the parent folder which will allow you to overwrite the originals.
I hope this helps!
Thank you for your posts. One thing that must be understood is that an OCR text layer placed when you run OCR is in no way different than other text objects within PDF. There is currently an issue where if a file which has already been OCR'ed, or contains text objects and OCR is run on those pages - the OCR module 'detects' the text objects and places new, duplicated text objects from the placed text layer as well as the image text. So the first time a user OCRs, there is the Image text, and text object. The second time OCR is run on that page results in the image layer, and three text objects! To reiterate - this will be fixed soon. The end result being that text objects are to be ignored by the OCR module. This does mean that an additional text object layer will be placed on the document when OCR'ed instead of 2 new object layers.
I would recommend that you set up Tools to save the OCR'ed documents to a subfolder. Once all the documents are OCR'ed under the same name you can move the new documents into the parent folder which will allow you to overwrite the originals.
I hope this helps!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Cheers,
Patrick Charest
Tracker Support North America
Thank you.
Cheers,
Patrick Charest
Tracker Support North America
Re: Current PDF application <Unknown>
Thank you for your detailed explanation.
Alas my pdfs are already spread out within many sub-folders
in my Zotero data directory.
So I hope that you implement a "only freshly scanned" option one day.
I promise to purchase the software if you do. You are giving me a
generous academic discount though, so I am not suggesting it would
be worth your while.
Thanks for the great software.
Tim
Alas my pdfs are already spread out within many sub-folders
in my Zotero data directory.
So I hope that you implement a "only freshly scanned" option one day.
I promise to purchase the software if you do. You are giving me a
generous academic discount though, so I am not suggesting it would
be worth your while.
Thanks for the great software.
Tim
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: Current PDF application <Unknown>
Hi Tim,
There is also no way to tell if a document has been scanned, so we cannot detect that way either. This is a feature request that simply cannot be accommodated and you will find this with most if not all software.
Thanks,
Again, this cannot and will not be done. It simply is not practical at all:So I hope that you implement a "only freshly scanned" option one day.
There is also no way to tell if a document has been scanned, so we cannot detect that way either. This is a feature request that simply cannot be accommodated and you will find this with most if not all software.
Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: Current PDF application <Unknown>
Dear Will
I am afraid I fail to understand the difficulty.
In the three example files that you gave, h
1) should not be scanned since it has a single character
2) should be scanned since it has not one single character (and nothing will be [correctly] found)
3) should be scanned since it has not one single character (and likewise nothing will be [correctly] found)
In point of fact, I do not have any files like the above as far as I am aware.
Either they have been scanned, or they have not, so I do not have any partially scanned files like 1 (afaik
but please see http://nihonbunka.com/temp/example.pdf )
I have very few unscanned pdf files which contain no text (i.e. pdf files containing non-text images like your
examples 2 and 3), and I would not mind if they were scanned along with the rest of my unscanned
text-image files.
I have about 1000 scanned text files (journal articles and books) and about 50 or 60 books
and some papers that are not scanned at all, with I presume not one text character inside them.
I have one or two image pdfs that would be scanned unnecessarily.
But perhaps my "freshly scanned" files do in fact have a few characters inside them?
Perhaps the copier inserts some time of scanning, or other text information.
This file is a few pages from a book.
http://nihonbunka.com/temp/example.pdf
Does it have any text characters?
Here are some programmers discussing ways of assessing whether a file contains text (e.g. by if it contains any
reference to "fonts")
https://stackoverflow.com/questions/602 ... 15#6553015
Sincerely
Tim
Timothy Takemoto
I am afraid I fail to understand the difficulty.
In the three example files that you gave, h
1) should not be scanned since it has a single character
2) should be scanned since it has not one single character (and nothing will be [correctly] found)
3) should be scanned since it has not one single character (and likewise nothing will be [correctly] found)
In point of fact, I do not have any files like the above as far as I am aware.
Either they have been scanned, or they have not, so I do not have any partially scanned files like 1 (afaik
but please see http://nihonbunka.com/temp/example.pdf )
I have very few unscanned pdf files which contain no text (i.e. pdf files containing non-text images like your
examples 2 and 3), and I would not mind if they were scanned along with the rest of my unscanned
text-image files.
I have about 1000 scanned text files (journal articles and books) and about 50 or 60 books
and some papers that are not scanned at all, with I presume not one text character inside them.
I have one or two image pdfs that would be scanned unnecessarily.
But perhaps my "freshly scanned" files do in fact have a few characters inside them?
Perhaps the copier inserts some time of scanning, or other text information.
This file is a few pages from a book.
http://nihonbunka.com/temp/example.pdf
Does it have any text characters?
Here are some programmers discussing ways of assessing whether a file contains text (e.g. by if it contains any
reference to "fonts")
https://stackoverflow.com/questions/602 ... 15#6553015
Sincerely
Tim
Timothy Takemoto
-
- User
- Posts: 5522
- Joined: Fri Nov 21, 2014 8:27 am
- Contact:
Re: Current PDF application <Unknown>
Hello Tim,
Here's what can be done from what I see. There are two possible options:
1) Do not OCR the Document if it contains at least 1 text symbol.
2) Do not OCR the Document's page if it contains at least 1 text symbol.
If that would suffice then we can implement these.
Cheers,
Alex
Here's what can be done from what I see. There are two possible options:
1) Do not OCR the Document if it contains at least 1 text symbol.
2) Do not OCR the Document's page if it contains at least 1 text symbol.
If that would suffice then we can implement these.
Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Re: Current PDF application <Unknown>
Dear Sasha
Thank you very much indeed.
I think that either of those options would be fine by me but
preferably (1) since there are a lot of blank pages with
no text symbols in all those journal article pdfs that I have.
I don't think I have any part-scanned documents.
It is just the pure, virgin, freshly scanned pdfs that I am concerned with.
I have my wallet at the ready (but as i say, you are offering
an extremely generous discount)!
Thank you very much for your consideration.
Tim
Thank you very much indeed.
I think that either of those options would be fine by me but
preferably (1) since there are a lot of blank pages with
no text symbols in all those journal article pdfs that I have.
I don't think I have any part-scanned documents.
It is just the pure, virgin, freshly scanned pdfs that I am concerned with.
I have my wallet at the ready (but as i say, you are offering
an extremely generous discount)!
Thank you very much for your consideration.
Tim
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Current PDF application <Unknown>
Hello Tim,
After speaking with Sasha - I've made this ticket:
#4045: FR: Tools - Ability to skip OCR-ing documents already containing text element.
So that we can add the feature to PDF tools to skip the whole files when there is already "text" inside a PDF file. This way you will be able to batch process files quickly (the tools is a batch utility), and skip those that already have any text in them - without distinguishing between OCred and not OCRed files.
For the Editor he is already adding the feature "Do not OCR pages that already contain text content items." - So there you will be able to OCR your file and skip any pages already with text on them.
So the feature in tools will be for whole files that do not have any text, and in the Editor - for pages without any text.
Regards,
Stefan
After speaking with Sasha - I've made this ticket:
#4045: FR: Tools - Ability to skip OCR-ing documents already containing text element.
So that we can add the feature to PDF tools to skip the whole files when there is already "text" inside a PDF file. This way you will be able to batch process files quickly (the tools is a batch utility), and skip those that already have any text in them - without distinguishing between OCred and not OCRed files.
For the Editor he is already adding the feature "Do not OCR pages that already contain text content items." - So there you will be able to OCR your file and skip any pages already with text on them.
So the feature in tools will be for whole files that do not have any text, and in the Editor - for pages without any text.
Regards,
Stefan
Re: Current PDF application <Unknown>
Is there a way of being informed of the process, completion, solution of
#4045: FR: Tools - Ability to skip OCR-ing documents already containing text element.
Tim
#4045: FR: Tools - Ability to skip OCR-ing documents already containing text element.
Tim
- Patrick-Tracker Supp
- Site Admin
- Posts: 1645
- Joined: Thu Mar 27, 2014 6:14 pm
- Location: Vancouver Island
- Contact:
Re: Current PDF application <Unknown>
Hi Tim,
We have a closed ticketing system, you can ask for updates here or by emailing us with reference to the ticket number. When the ticket status is set to resolved you will be emailed.
Cheers!
We have a closed ticketing system, you can ask for updates here or by emailing us with reference to the ticket number. When the ticket status is set to resolved you will be emailed.
Cheers!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Cheers,
Patrick Charest
Tracker Support North America
Thank you.
Cheers,
Patrick Charest
Tracker Support North America
Re: Current PDF application <Unknown>
Thank you, I very much look forward to it.
Tim
Tim
- Tracker Supp-Stefan
- Site Admin
- Posts: 17824
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact: