Page 1 of 1
How to determine is a PDF is searchable
Posted: Tue Aug 12, 2014 7:44 am
by arno.engelbrecht
I have the PDF X-Change PRO SDK that includes the OCR module. I can OCR documents, but I have a large amount of documents, some of which are image-based and thus need to be OCR'ed and other that are already searchable and do not need to be OCR'ed. Is there a way with the SDK to determine if a document is already searchable or not?
Re: How to determine is a PDF is searchable
Posted: Tue Aug 12, 2014 5:36 pm
by Paul - Tracker Supp
Hi Arno,
thanks for the post,
I moved it from the End User OCR to the SDK OCR forum.
I an not personally sure how to do this and will have one of the development team advise when they have a spare moment.
regards
Re: How to determine is a PDF is searchable
Posted: Tue Aug 12, 2014 6:01 pm
by Vasyl-Tracker Dev Team
Hi, arno.engelbrecht.
Possible way - you can check if any page contains any text by:
Code: Select all
PDFDocument hDoc;
// open document...
DWORD pagesNum = 0;
PXCp_GetPagesCount(hDoc, &pagesNum);
// check for existing text
PXCp_ET_Prepare(hDoc);
bool isSeachable = false;
for (DWORD i = 0; i < pageNum; i++)
{
PXCp_ET_AnalyzePageContent(hDoc, i);
DWORD textElementsNum = 0;
PXCp_ET_GetElementCount(hDocument, &textElementsNum);
if (textElementsNum != 0)
{
isSeachable = true;
break;
}
}
PXCp_ET_Finish(hDoc);
HTH
Re: How to determine is a PDF is searchable
Posted: Mon Aug 18, 2014 7:33 am
by arno.engelbrecht
Hi
Thanks a lot. Can I assume that if I find any text that it is already searchable or should I search for a minimum amount of text? Basically I just want to make sure that I don't get a few random characters in some files that aren't actually searchable.
Re: How to determine is a PDF is searchable
Posted: Tue Aug 19, 2014 9:56 pm
by John - Tracker Supp
Well that would be down to you to analyse what's returned and decide if its usable or not ...