PDF-XChange - Tracker PDF Viewer - TIFF-XChange - Image-XChange - XMF-XChange - Raster-XChange - Support

Moderators: Tracker Support, TrackerSupp-Daniel, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Sean - Tracker, Tracker Supp-Stefan

 
arno.engelbrecht
User
Topic Author
Posts: 4
Joined: Tue Aug 12, 2014 7:37 am

How to determine is a PDF is searchable

Tue Aug 12, 2014 7:44 am

I have the PDF X-Change PRO SDK that includes the OCR module. I can OCR documents, but I have a large amount of documents, some of which are image-based and thus need to be OCR'ed and other that are already searchable and do not need to be OCR'ed. Is there a way with the SDK to determine if a document is already searchable or not?
 
Paul - Tracker Supp
User
Posts: 4713
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: How to determine is a PDF is searchable

Tue Aug 12, 2014 5:36 pm

Hi Arno,

thanks for the post,

I moved it from the End User OCR to the SDK OCR forum.

I an not personally sure how to do this and will have one of the development team advise when they have a spare moment.

regards
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
 
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 1855
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: How to determine is a PDF is searchable

Tue Aug 12, 2014 6:01 pm

Hi, arno.engelbrecht.

Possible way - you can check if any page contains any text by:
PDFDocument hDoc;

// open document...

DWORD pagesNum = 0;
PXCp_GetPagesCount(hDoc, &pagesNum);

// check for existing text

PXCp_ET_Prepare(hDoc);

bool isSeachable = false;

for (DWORD i = 0; i < pageNum; i++)
{
     PXCp_ET_AnalyzePageContent(hDoc, i);
     DWORD textElementsNum = 0;
     PXCp_ET_GetElementCount(hDocument, &textElementsNum);
     if (textElementsNum != 0) 
     { 
        isSeachable = true;   
        break;
     }
}

PXCp_ET_Finish(hDoc);


HTH
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
 
arno.engelbrecht
User
Topic Author
Posts: 4
Joined: Tue Aug 12, 2014 7:37 am

Re: How to determine is a PDF is searchable

Mon Aug 18, 2014 7:33 am

Hi

Thanks a lot. Can I assume that if I find any text that it is already searchable or should I search for a minimum amount of text? Basically I just want to make sure that I don't get a few random characters in some files that aren't actually searchable.
 
John - Tracker Supp
Site Admin
Posts: 8192
Joined: Tue Jun 29, 2004 10:34 am
Location: Vancouver Island - Canada
Contact:

Re: How to determine is a PDF is searchable

Tue Aug 19, 2014 9:56 pm

Well that would be down to you to analyse what's returned and decide if its usable or not ...
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com

Who is online

Users browsing this forum: No registered users and 1 guest