Page 1 of 1

SDK to extract text, then search

Posted: Tue Dec 03, 2013 9:59 pm
by Archie
We write software for the Logistics industry.
We have an "Edoc" system that keeps shipping documents as PDFs.
Our customers are asking for the ability to search the recently received PDFs (say 3,000) for things like a company name.
I am thinking about using the PDF-XChange Viewer SDK to do that automatically in the background with no user input as the "Edoc" pdf is created.
I expect to create, for each Edoc PDF, a pdf.txt file containing the extracted text from the .pdf

After that I need to write a program that will do the search of the pdf.txt files looking for the desired string, eg company name.

Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?

Thanks
Archie

Re: SDK to extract text, then search

Posted: Tue Dec 03, 2013 10:13 pm
by Paul - Tracker Supp
Hi Archie,

thanks for the post and welcome to the Tracker Forums.

I think you should look at the PDF-Tools SDK. https://www.pdf-xchange.com/product/pdf-tools-sdk Assuming your PDFs are being created as text based PDFs and not image based then you should be able to search the strings directly on the PDF without the txt file in between.

If it's image based then you'd need to OCR the PDF first then search the text.

All the SDKs are fully functional, even in 'Trial Mode' and you can test every aspect of your program before committing to a purchase. The caveat is that until licensed anything you do with the SDK will result in water marked PDFs. Once you are happy that you have the right solution simply purchase a license, inject the serial keys and dev code we give you into your source code, recompile and go...

I hope that helps. Do be sure to let us know if you have further questions.

regards

Re: SDK to extract text, then search

Posted: Tue Dec 03, 2013 10:33 pm
by Archie
Hi Paul

Thanks for the quick reply.
Some PDFs come from forms that have text but lots come from faxes attached to emails. They are image based.
For those I will need the OCR stuff.

I presume your OCR stuff will allow me to create a pdf.txt file with the OCR produced text and that you have ActiveX modules that my programs can call to do it.

The question for which I am looking for some guidance is related to the next step where I build an application program that asks the user for a search string and it goes off and searches all the pdf.txt files for the desired string.

Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?

Thanks

Re: SDK to extract text, then search

Posted: Wed Dec 04, 2013 4:52 pm
by John - Tracker Supp
Hi Archie,

Topic moved to the correct forum ...

Walter (our OCR specialist) will reply shortly with regards your question - but in regards licensing - you will need a PDF-XChange PRO SDK (Not PDF-Tools SDK as advised by Paul) to gain access to the Live OCR SDK functions.

HTH

Re: SDK to extract text, then search

Posted: Wed Dec 04, 2013 5:21 pm
by Walter-Tracker Supp
You can do this with the Pro Tools SDK, but it is not active-X but rather native C++ DLL with a flat C-style API. We have functions to extract existing text, and an OCR component that lets you perform OCR and create either a searchable PDF output, or extract text which you can save to a text file if you wish.

We have wrappers for .NET and a few other languages, so you aren't restricted to C++, but it is not an Active-X component.

We do have an Active-X viewer component but this is typically used for providing customized viewing (and annotating, etc) capabilities in the scope of a custom application. You can't use it to automate text extraction.

-Walter

Re: SDK to extract text, then search

Posted: Wed Dec 04, 2013 10:30 pm
by Archie
We develop in a language called VisualDataflex which easily supports using ActiveX controls.

I went to the VDF forum and asked if your stuff would be usable.
The best reply I got was the following:
ask them how their exported functions are declared. If they use __STDCALL then you're all good. Each of their functions becomes an External_function statement in VDF

My question, then, is the above. Are the exported functions declared using __STDCALL?

Thanks

Re: SDK to extract text, then search

Posted: Wed Dec 04, 2013 11:17 pm
by Walter-Tracker Supp
Yes, we use the __stdcall calling convention. You do not need to purchase the product to try it; there are some limitations (e.g. watermarks if you create documents, limits on the number of pages you can OCR, etc) but you can try every feature out without purchasing a license.

-Walter

Re: SDK to extract text, then search

Posted: Wed Dec 04, 2013 11:22 pm
by Archie
Good news Walter.
Thanks for the info and the quick response.

Re: SDK to extract text, then search

Posted: Thu Dec 05, 2013 3:23 am
by John - Tracker Supp
Thanks Archie - do come back if you need any further info.

Re: SDK to extract text, then search

Posted: Tue Sep 15, 2015 10:13 am
by Peter2
This posting is nearly 2 years old, but I'm looking for a similar solution. So my question is:

Has something important changed since 2013? New features, new tools out-of-the-box, new SDKs??

Peter

Re: SDK to extract text, then search

Posted: Tue Sep 15, 2015 2:42 pm
by Will - Tracker Supp
Hi Peter,

Lots has changed since 2013, but nothing too dramatic in terms of the OCR's overall functionality. What, specifically, are you looking for?

Thanks,

Re: SDK to extract text, then search

Posted: Tue Sep 15, 2015 3:05 pm
by Peter2
Hi Will

this is the main-thread with the side-discussion "Where is menu "text Props"?"
https://forum.pdf-xchange.com/ ... 62&t=24215

The (half baked) idea is:
The reason why I'm asking is that scanned drawings need
- OCR
- finding the position (coordinates) of the new strings
- "transform" (in a way which needs to be found ...) the content and the position of the strings to a vector-drawing.

This is why I'm thinking about "which text is where"?

Re: SDK to extract text, then search

Posted: Tue Sep 15, 2015 10:27 pm
by Ivan - Tracker Software
You can use Editor SDK or Core API SDK to retrieve a text from page and search in it.

Please take a look at IPXC_PageText interface https://sdkhelp.pdf-xchange.com/vie ... C_PageText