SDK to extract text, then search
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan
SDK to extract text, then search
We write software for the Logistics industry.
We have an "Edoc" system that keeps shipping documents as PDFs.
Our customers are asking for the ability to search the recently received PDFs (say 3,000) for things like a company name.
I am thinking about using the PDF-XChange Viewer SDK to do that automatically in the background with no user input as the "Edoc" pdf is created.
I expect to create, for each Edoc PDF, a pdf.txt file containing the extracted text from the .pdf
After that I need to write a program that will do the search of the pdf.txt files looking for the desired string, eg company name.
Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?
Thanks
Archie
We have an "Edoc" system that keeps shipping documents as PDFs.
Our customers are asking for the ability to search the recently received PDFs (say 3,000) for things like a company name.
I am thinking about using the PDF-XChange Viewer SDK to do that automatically in the background with no user input as the "Edoc" pdf is created.
I expect to create, for each Edoc PDF, a pdf.txt file containing the extracted text from the .pdf
After that I need to write a program that will do the search of the pdf.txt files looking for the desired string, eg company name.
Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?
Thanks
Archie
- Paul - Tracker Supp
- Site Admin
- Posts: 6844
- Joined: Wed Mar 25, 2009 10:37 pm
- Location: Chemainus, Canada
- Contact:
Re: SDK to extract text, then search
Hi Archie,
thanks for the post and welcome to the Tracker Forums.
I think you should look at the PDF-Tools SDK. https://www.pdf-xchange.com/product/pdf-tools-sdk Assuming your PDFs are being created as text based PDFs and not image based then you should be able to search the strings directly on the PDF without the txt file in between.
If it's image based then you'd need to OCR the PDF first then search the text.
All the SDKs are fully functional, even in 'Trial Mode' and you can test every aspect of your program before committing to a purchase. The caveat is that until licensed anything you do with the SDK will result in water marked PDFs. Once you are happy that you have the right solution simply purchase a license, inject the serial keys and dev code we give you into your source code, recompile and go...
I hope that helps. Do be sure to let us know if you have further questions.
regards
thanks for the post and welcome to the Tracker Forums.
I think you should look at the PDF-Tools SDK. https://www.pdf-xchange.com/product/pdf-tools-sdk Assuming your PDFs are being created as text based PDFs and not image based then you should be able to search the strings directly on the PDF without the txt file in between.
If it's image based then you'd need to OCR the PDF first then search the text.
All the SDKs are fully functional, even in 'Trial Mode' and you can test every aspect of your program before committing to a purchase. The caveat is that until licensed anything you do with the SDK will result in water marked PDFs. Once you are happy that you have the right solution simply purchase a license, inject the serial keys and dev code we give you into your source code, recompile and go...
I hope that helps. Do be sure to let us know if you have further questions.
regards
Best regards
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
Re: SDK to extract text, then search
Hi Paul
Thanks for the quick reply.
Some PDFs come from forms that have text but lots come from faxes attached to emails. They are image based.
For those I will need the OCR stuff.
I presume your OCR stuff will allow me to create a pdf.txt file with the OCR produced text and that you have ActiveX modules that my programs can call to do it.
The question for which I am looking for some guidance is related to the next step where I build an application program that asks the user for a search string and it goes off and searches all the pdf.txt files for the desired string.
Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?
Thanks
Thanks for the quick reply.
Some PDFs come from forms that have text but lots come from faxes attached to emails. They are image based.
For those I will need the OCR stuff.
I presume your OCR stuff will allow me to create a pdf.txt file with the OCR produced text and that you have ActiveX modules that my programs can call to do it.
The question for which I am looking for some guidance is related to the next step where I build an application program that asks the user for a search string and it goes off and searches all the pdf.txt files for the desired string.
Question:
Does anyone know of an ActiveX module that I can use to do the search with as little user input as possible, just the string to search for?
Thanks
- John - Tracker Supp
- Site Admin
- Posts: 5219
- Joined: Tue Jun 29, 2004 10:34 am
- Location: United Kingdom
- Contact:
Re: SDK to extract text, then search
Hi Archie,
Topic moved to the correct forum ...
Walter (our OCR specialist) will reply shortly with regards your question - but in regards licensing - you will need a PDF-XChange PRO SDK (Not PDF-Tools SDK as advised by Paul) to gain access to the Live OCR SDK functions.
HTH
Topic moved to the correct forum ...
Walter (our OCR specialist) will reply shortly with regards your question - but in regards licensing - you will need a PDF-XChange PRO SDK (Not PDF-Tools SDK as advised by Paul) to gain access to the Live OCR SDK functions.
HTH
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: SDK to extract text, then search
You can do this with the Pro Tools SDK, but it is not active-X but rather native C++ DLL with a flat C-style API. We have functions to extract existing text, and an OCR component that lets you perform OCR and create either a searchable PDF output, or extract text which you can save to a text file if you wish.
We have wrappers for .NET and a few other languages, so you aren't restricted to C++, but it is not an Active-X component.
We do have an Active-X viewer component but this is typically used for providing customized viewing (and annotating, etc) capabilities in the scope of a custom application. You can't use it to automate text extraction.
-Walter
We have wrappers for .NET and a few other languages, so you aren't restricted to C++, but it is not an Active-X component.
We do have an Active-X viewer component but this is typically used for providing customized viewing (and annotating, etc) capabilities in the scope of a custom application. You can't use it to automate text extraction.
-Walter
Re: SDK to extract text, then search
We develop in a language called VisualDataflex which easily supports using ActiveX controls.
I went to the VDF forum and asked if your stuff would be usable.
The best reply I got was the following:
ask them how their exported functions are declared. If they use __STDCALL then you're all good. Each of their functions becomes an External_function statement in VDF
My question, then, is the above. Are the exported functions declared using __STDCALL?
Thanks
I went to the VDF forum and asked if your stuff would be usable.
The best reply I got was the following:
ask them how their exported functions are declared. If they use __STDCALL then you're all good. Each of their functions becomes an External_function statement in VDF
My question, then, is the above. Are the exported functions declared using __STDCALL?
Thanks
-
- User
- Posts: 381
- Joined: Mon Jun 13, 2011 5:10 pm
Re: SDK to extract text, then search
Yes, we use the __stdcall calling convention. You do not need to purchase the product to try it; there are some limitations (e.g. watermarks if you create documents, limits on the number of pages you can OCR, etc) but you can try every feature out without purchasing a license.
-Walter
-Walter
Re: SDK to extract text, then search
Good news Walter.
Thanks for the info and the quick response.
Thanks for the info and the quick response.
- John - Tracker Supp
- Site Admin
- Posts: 5219
- Joined: Tue Jun 29, 2004 10:34 am
- Location: United Kingdom
- Contact:
Re: SDK to extract text, then search
Thanks Archie - do come back if you need any further info.
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com
Re: SDK to extract text, then search
This posting is nearly 2 years old, but I'm looking for a similar solution. So my question is:
Has something important changed since 2013? New features, new tools out-of-the-box, new SDKs??
Peter
Has something important changed since 2013? New features, new tools out-of-the-box, new SDKs??
Peter
PDF-X-Change Pro German
- Will - Tracker Supp
- Site Admin
- Posts: 6815
- Joined: Mon Oct 15, 2012 9:21 pm
- Location: London, UK
- Contact:
Re: SDK to extract text, then search
Hi Peter,
Lots has changed since 2013, but nothing too dramatic in terms of the OCR's overall functionality. What, specifically, are you looking for?
Thanks,
Lots has changed since 2013, but nothing too dramatic in terms of the OCR's overall functionality. What, specifically, are you looking for?
Thanks,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Thank you.
Best regards
Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Re: SDK to extract text, then search
Hi Will
this is the main-thread with the side-discussion "Where is menu "text Props"?"
https://forum.pdf-xchange.com/ ... 62&t=24215
The (half baked) idea is:
this is the main-thread with the side-discussion "Where is menu "text Props"?"
https://forum.pdf-xchange.com/ ... 62&t=24215
The (half baked) idea is:
The reason why I'm asking is that scanned drawings need
- OCR
- finding the position (coordinates) of the new strings
- "transform" (in a way which needs to be found ...) the content and the position of the strings to a vector-drawing.
This is why I'm thinking about "which text is where"?
PDF-X-Change Pro German
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: SDK to extract text, then search
You can use Editor SDK or Core API SDK to retrieve a text from page and search in it.
Please take a look at IPXC_PageText interface https://sdkhelp.pdf-xchange.com/vie ... C_PageText
Please take a look at IPXC_PageText interface https://sdkhelp.pdf-xchange.com/vie ... C_PageText
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.