help: issues with converting graphic pdf to ebook

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Paul - Tracker Supp, Chris - Tracker Supp, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
Gorlash
User
Posts: 11
Joined: Fri Sep 23, 2016 8:20 pm

help: issues with converting graphic pdf to ebook

Post by Gorlash » Tue Dec 10, 2019 5:29 pm

I have a pdf book which consists of scanned pages. I want to convert it to text, then reformat and convert to ebook format (likely epub or mobi).

So I ran PDF-XChange Editor's OCR function on the book. It successfully converted the text to text, but I still have a couple of problems which need to be solved; I'm hoping folk here can help me with these.

I did the original OCR version using the free version of this program, but have since upgraded to the paid version, but not Pro version.

My remaining issues for this task:

1. The OCR operation successfully converted most of the text to text, but it also created a blank image on each page, representing the blank background of the page (the book has 438 pages). Not only do these blank images create a *huge* epub file, but my Kindle Paperwhite reader is absolutely stupified by the images, and it ends up creating pages which are completely unreadable !!

Is there some way that I can either redo the OCR without generating the images, or subsequently have PXE strip the images out of the document??

2. The OCR operation also generated text where each displayed line is a separate line, ending in newline character. We will certainly want all the lines for each paragraph to be joined together, so that pages can be resized properly in the Kindle. Is there any way that I can ask PXE to do this?

Thank you for any assistance that you can provide with these tasks!!

My system, in case it is useful here:
Windows 7 64-bit
GTX 1070ti video card
16GB RAM
plenty of disk space

Gorlash
User
Posts: 11
Joined: Fri Sep 23, 2016 8:20 pm

Re: help: issues with converting graphic pdf to ebook

Post by Gorlash » Tue Dec 10, 2019 6:20 pm

Well, I found a related page which discusses the first issue. It is this thread:

viewtopic.php?f=62&t=30815&p=124201&hil ... es#p124201

One post gives the following instructions:
What you need to do now, is:
1) first make sure that the "Contents" pane and the "Properties" pane are both shown on your screen.
You can activate these panes via the View-menu > Other panes.
2) select all the text - you can do this via the Content pane - click on the first line with Text + SHIFT click on the last line with Text
3) while all the text is selected, look into the Properties pane and change the "Fill Color" from 'None' to Black
4) finally look into the Contents pane, select all what is "Path" and/or "Image" and delete it

All what is left now, is purely 'text'.

//*************************************************
This... appears to work, once I realized that I had to select the text from the 'content' panel, not from the displayed page (otherwise the Fill Color field on the Properties page is greyed out).

This isn't quite ideal, though, because I will still have to go back and tweak each page individually, to delete the images.
However, that's better than the alternative !

Now, if I can solve the join-lines-in-paragraph issue, I think I'll be set...

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 4063
Joined: Wed Jan 03, 2018 6:52 pm

Re: help: issues with converting graphic pdf to ebook

Post by TrackerSupp-Daniel » Tue Dec 10, 2019 10:07 pm

Hello Gorlash,

PDF, unlike docx or epub formats, is coordinate based, not flow based. This causes a number of issues when attempting to quickly convert from one to the other. My advice for you is that instead of trying to convert from PDF to epub, you export the document as a Word or txt file instead, and then convert that (more appropriately flow based) document into epub.

Regarding the issue with return characters after OCR, unfortunately there is not much that can be done in that regard, OCR works optically, and if you are looking at any text, visually the end of a line is the end of a line. We know that it continues because we can interpret what the text says and know offhand that it probably continues on the next line. A computer does not have that level of understanding and interpretation at its disposal.

If you can send us a copy of a few files you are working on converting, I can forward it to our OCR team for further investigation, and to see if there is a pattern we can try to look for that may minimize this, but it is unlikely to come soon.

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Support: <Support@tracker-software.com>
Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Gorlash
User
Posts: 11
Joined: Fri Sep 23, 2016 8:20 pm

Re: help: issues with converting graphic pdf to ebook

Post by Gorlash » Fri Dec 13, 2019 8:30 pm

TrackerSupp-Daniel wrote:
Tue Dec 10, 2019 10:07 pm
Hello Gorlash,

Regarding the issue with return characters after OCR, unfortunately there is not much that can be done in that regard, OCR works optically, and if you are looking at any text, visually the end of a line is the end of a line. We know that it continues because we can interpret what the text says and know offhand that it probably continues on the next line. A computer does not have that level of understanding and interpretation at its disposal.

If you can send us a copy of a few files you are working on converting, I can forward it to our OCR team for further investigation, and to see if there is a pattern we can try to look for that may minimize this, but it is unlikely to come soon.

Kind regards,
Yeah, once I thought about it, I realized that the end-of-line issue is inherent in the OCR process.
We tried generating text output, but there was a *great* deal of garbage in the document, partially because the text on the back side of the page is slightly visible, and the OCR thinks there is something there to decode.

I also was not able to use the docx files which PXE is generating; for some reason, none of the Windows word processors can read it at all!! So we are taking the converted PDF and manually translating into an old-format .DOC file, which we can then convert to epub format.

There is probably more that I can do with this process, I'll be looking at it further in coming weeks. It would be good if I can automate this process, because it is common for us to find scanned PDF books on sites such as Internet Archive, and *their* epub generation is atrocious !!!

User avatar
TrackerSupp-Daniel
Site Admin
Posts: 4063
Joined: Wed Jan 03, 2018 6:52 pm

Re: help: issues with converting graphic pdf to ebook

Post by TrackerSupp-Daniel » Fri Dec 13, 2019 11:17 pm

Hello Gorlash,

If you are ending up with a number of fragments because of scanner shine through, you could try running OCR in "low accuracy" mode.

Regarding these .docx files we are generating that no word processor can read, could you please send me one of these, and the original PDF document before the conversion? We really cannot do anything to help with either of these issues unless we have some samples to work with. If the issue is that the documents contain sensitive information ill-suited for a public forum, you can email them directly to us via Support@tracker-software.com

Kind regards,
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Support: <Support@tracker-software.com>
Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623

Gorlash
User
Posts: 11
Joined: Fri Sep 23, 2016 8:20 pm

Re: help: issues with converting graphic pdf to ebook

Post by Gorlash » Mon Dec 16, 2019 2:00 pm

TrackerSupp-Daniel wrote:
Fri Dec 13, 2019 11:17 pm
Hello Gorlash,

If you are ending up with a number of fragments because of scanner shine through, you could try running OCR in "low accuracy" mode.

Regarding these .docx files we are generating that no word processor can read, could you please send me one of these, and the original PDF document before the conversion? We really cannot do anything to help with either of these issues unless we have some samples to work with. If the issue is that the documents contain sensitive information ill-suited for a public forum, you can email them directly to us via Support@tracker-software.com

Kind regards,
Okay, I'll need to re-generate the .docx files, I've deleted them. As for 'no word processor can open them', I should have been more specific; I've already discussed this in a separate support ticket - one of our WP are MS Word 2000, which doesn't support docx, and the other is Open Office, and as your support people said, I may have an older version of the program. I don't use it very often !!

Thank you for the note on low-accuracy OCR, I'll try that and see how our results change!

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 14196
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: help: issues with converting graphic pdf to ebook

Post by Tracker Supp-Stefan » Mon Dec 16, 2019 2:52 pm

Hello Gorlash,

Thanks for the follow up, and yeah - Office 2000/2003 were not working with .docx files but rather with .doc (non XML version), and even Office 2007 was using a different version of the .docx - so even 2007 might not be able to open the .docx files we will generate, but all more recent versions of Office should.

Cheers,
Stefan

Gorlash
User
Posts: 11
Joined: Fri Sep 23, 2016 8:20 pm

Re: help: issues with converting graphic pdf to ebook

Post by Gorlash » Mon Dec 16, 2019 3:58 pm

Gorlash wrote:
Mon Dec 16, 2019 3:37 pm
Gorlash wrote:
Mon Dec 16, 2019 3:34 pm
Okay, I *do* have a copy of the first docx that I generated with PXE. This is huge, because it still contains the blank image files for each of the 438 pages of the book.

My copy of Open Office is V4.1.6, which is very recent. When I try to open this docx in OpenOffice Writer, the program just hangs indefinitely, and I have to terminate it from Task Manager. I'm using Windows 7 64-bit

I will also upload what is the closest file that I have to the original, but I don't think it is *actually* the original, because I can copy and paste text out of it... I'm pretty sure the true original was graphics only, and I couldn't do that.

Sadly, I accidentally destroyed my true original pdf, when I was doing the conversion with PXE... I hit Control-S, which I did *not* want to do... I also did not make a backup copy of the original before starting to mess with it...

I originally obtained this document from Internet Archive, but it is no longer available there, for some reason.

Okay, I cannot upload the docx file here, it is 531MB even in zip format. I will email it to you.
Later: damn, I cannot send it via email either, it is too large. I'll try pushing it to dropbox.

Here is the dropbox link to the docx file:
https://www.dropbox.com/s/36e376nglendq ... .docx?dl=1

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 14196
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: help: issues with converting graphic pdf to ebook

Post by Tracker Supp-Stefan » Mon Dec 16, 2019 4:05 pm

Hello Gorlash,

You seem to have quoted yourself 3 times in the above posts?
Is it that you want to add something to them and are e.g. experiencing some technical difficulties?

I am downloading your .docx file now and will take a look at it once it finishes downloading.

Regards,
Stefan

Gorlash
User
Posts: 11
Joined: Fri Sep 23, 2016 8:20 pm

Re: help: issues with converting graphic pdf to ebook

Post by Gorlash » Tue Dec 17, 2019 3:43 am

Tracker Supp-Stefan wrote:
Mon Dec 16, 2019 4:05 pm
Hello Gorlash,

You seem to have quoted yourself 3 times in the above posts?
Is it that you want to add something to them and are e.g. experiencing some technical difficulties?

I am downloading your .docx file now and will take a look at it once it finishes downloading.

Regards,
Stefan
ummm... I have no idea why all the duplicate posts occurred... clearly I got confused about the interface!
Actually, I see what was happening... I was clicking 'reply with quote' instead of 'edit'...
I apologize for that... I was *very* frustrated with not being able to transfer that file via any normal means...
Can you delete the duplicates? It doesn't appear that I can...

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 14196
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: help: issues with converting graphic pdf to ebook

Post by Tracker Supp-Stefan » Tue Dec 17, 2019 3:13 pm

Hello Gorlash,

Yes - I just removed the duplicate posts!
As for the file itself - I did download it last night, and it appears like it was opening correctly for me in Word.
I am using MS Office 2010 (a rather old version) - and it was still opening that file correctly.

Regards,
Stefan

Gorlash
User
Posts: 11
Joined: Fri Sep 23, 2016 8:20 pm

Re: help: issues with converting graphic pdf to ebook

Post by Gorlash » Thu Dec 19, 2019 2:39 pm

Tracker Supp-Stefan wrote:
Tue Dec 17, 2019 3:13 pm
As for the file itself - I did download it last night, and it appears like it was opening correctly for me in Word.
I am using MS Office 2010 (a rather old version) - and it was still opening that file correctly.

Regards,
Stefan
Well, that's interesting... I don't suppose you have a copy of OpenOffice available anywhere??
It simply hangs while trying to open this file, even after 20 minutes of letting it grind, it had not resolved.
However, the docx file *does* appear to be a perfectly valid .zip file, which I can extract using my unzip program, and the .xml contents all look perfectly valid.

Anyway, I think we can suspend this issue for now. I no longer have access to the original fully-graphics .pdf file, and Internet Archive no longer distributes books in that format, so I probably won't need to solve this problem again anytime soon.

I appreciate all the effort that all of you made to try to resolve this issue here; I agree it is a rather obtuse problem...

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 14196
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: help: issues with converting graphic pdf to ebook

Post by Tracker Supp-Stefan » Thu Dec 19, 2019 3:08 pm

Hello Gorlash,

I am afraid that no - I really do not have OpenOffice around - but as you said - the file appears valid, so it's most likely an issue that OpenOffice has with handling such a huge and heavy file.

If you are happy with stopping this case - so are we!

Season's greetings,
Stefan

Post Reply