Page 1 of 1

How to determine is a PDF is searchable

Posted: Tue Aug 12, 2014 7:44 am
by arno.engelbrecht
I have the PDF X-Change PRO SDK that includes the OCR module. I can OCR documents, but I have a large amount of documents, some of which are image-based and thus need to be OCR'ed and other that are already searchable and do not need to be OCR'ed. Is there a way with the SDK to determine if a document is already searchable or not?

Re: How to determine is a PDF is searchable

Posted: Tue Aug 12, 2014 5:36 pm
by Paul - Tracker Supp
Hi Arno,

thanks for the post,

I moved it from the End User OCR to the SDK OCR forum.

I an not personally sure how to do this and will have one of the development team advise when they have a spare moment.

regards

Re: How to determine is a PDF is searchable

Posted: Tue Aug 12, 2014 6:01 pm
by Vasyl-Tracker Dev Team
Hi, arno.engelbrecht.

Possible way - you can check if any page contains any text by:

Code: Select all

PDFDocument hDoc;

// open document...

DWORD pagesNum = 0;
PXCp_GetPagesCount(hDoc, &pagesNum);

// check for existing text

PXCp_ET_Prepare(hDoc);

bool isSeachable = false;

for (DWORD i = 0; i < pageNum; i++)
{
     PXCp_ET_AnalyzePageContent(hDoc, i);
     DWORD textElementsNum = 0;
     PXCp_ET_GetElementCount(hDocument, &textElementsNum);
     if (textElementsNum != 0)  
     {  
        isSeachable = true;   
        break; 
     }
}

PXCp_ET_Finish(hDoc);
HTH

Re: How to determine is a PDF is searchable

Posted: Mon Aug 18, 2014 7:33 am
by arno.engelbrecht
Hi

Thanks a lot. Can I assume that if I find any text that it is already searchable or should I search for a minimum amount of text? Basically I just want to make sure that I don't get a few random characters in some files that aren't actually searchable.

Re: How to determine is a PDF is searchable

Posted: Tue Aug 19, 2014 9:56 pm
by John - Tracker Supp
Well that would be down to you to analyse what's returned and decide if its usable or not ...