How to determine is a PDF is searchable

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Sean - Tracker, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Tracker Supp-Stefan

Post Reply
arno.engelbrecht
User
Posts: 4
Joined: Tue Aug 12, 2014 7:37 am

How to determine is a PDF is searchable

Post by arno.engelbrecht » Tue Aug 12, 2014 7:44 am

I have the PDF X-Change PRO SDK that includes the OCR module. I can OCR documents, but I have a large amount of documents, some of which are image-based and thus need to be OCR'ed and other that are already searchable and do not need to be OCR'ed. Is there a way with the SDK to determine if a document is already searchable or not?

Paul - Tracker Supp
Site Admin
Posts: 4848
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: How to determine is a PDF is searchable

Post by Paul - Tracker Supp » Tue Aug 12, 2014 5:36 pm

Hi Arno,

thanks for the post,

I moved it from the End User OCR to the SDK OCR forum.

I an not personally sure how to do this and will have one of the development team advise when they have a spare moment.

regards
_________________
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com

User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 1916
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: How to determine is a PDF is searchable

Post by Vasyl-Tracker Dev Team » Tue Aug 12, 2014 6:01 pm

Hi, arno.engelbrecht.

Possible way - you can check if any page contains any text by:

Code: Select all

PDFDocument hDoc;

// open document...

DWORD pagesNum = 0;
PXCp_GetPagesCount(hDoc, &pagesNum);

// check for existing text

PXCp_ET_Prepare(hDoc);

bool isSeachable = false;

for (DWORD i = 0; i < pageNum; i++)
{
     PXCp_ET_AnalyzePageContent(hDoc, i);
     DWORD textElementsNum = 0;
     PXCp_ET_GetElementCount(hDocument, &textElementsNum);
     if (textElementsNum != 0)  
     {  
        isSeachable = true;   
        break; 
     }
}

PXCp_ET_Finish(hDoc);
HTH
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.

arno.engelbrecht
User
Posts: 4
Joined: Tue Aug 12, 2014 7:37 am

Re: How to determine is a PDF is searchable

Post by arno.engelbrecht » Mon Aug 18, 2014 7:33 am

Hi

Thanks a lot. Can I assume that if I find any text that it is already searchable or should I search for a minimum amount of text? Basically I just want to make sure that I don't get a few random characters in some files that aren't actually searchable.

John - Tracker Supp
Site Admin
Posts: 8201
Joined: Tue Jun 29, 2004 10:34 am
Location: Vancouver Island - Canada
Contact:

Re: How to determine is a PDF is searchable

Post by John - Tracker Supp » Tue Aug 19, 2014 9:56 pm

Well that would be down to you to analyse what's returned and decide if its usable or not ...
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com

Post Reply