Hello,
we want to analyze text of PDFs which are generated from a scanner-software (Fujitsu ScanSnap).
We need to extract address-data, telefone-numbers and so on. The scanner-software does ocr and adds text-information to the pdf. The scanned documents don't have a given format - so we need to get all available information, to get a better result for analyzing.
With "GetAllText" in PDF ActiveX Viewer it is possible to get all text from a PDF. But this function strips additional information like positions (Tm-marks), which I can see if I analyze the PDF with another software.
Example:
10.0551 0 0 9.7 140.6849 380.3401 Tm
(Informationen)Tj
Is there a way to extract this information with ActiceX Viewer-Control?
Greetings
Hans-Peter
Extracting text-layer with additional informations (OCR)
Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
-
- User
- Posts: 49
- Joined: Tue Dec 28, 2004 9:49 am
- Tracker Supp-Stefan
- Site Admin
- Posts: 17906
- Joined: Mon Jan 12, 2009 8:07 am
- Location: London
- Contact:
Re: Extracting text-layer with additional informations (OCR)
Hi Hans-Peter,
I don't believe this is possible with the Viewer AX SDK, but it certainly is with the Tools SDK One:
Text Extraction
Regards,
Stefan
I don't believe this is possible with the Viewer AX SDK, but it certainly is with the Tools SDK One:
Text Extraction
Regards,
Stefan
-
- User
- Posts: 158
- Joined: Mon Aug 04, 2014 7:34 pm
Re: Extracting text-layer with additional informations (OCR)
Sorry to resurrect an old thread but I recently began text extraction with the PDF Tools library and it seems to be working well, however I have a question about the positional data as well.
Is it possible to get the relative positions of the text extracted? Right now it is just a flat string that is dumped out.
EDIT: The link above didn't seem to take me to anything helping with positional information of extracted text, and the example included with the SDK is just a flat file as well.
Is it possible to get the relative positions of the text extracted? Right now it is just a flat string that is dumped out.
EDIT: The link above didn't seem to take me to anything helping with positional information of extracted text, and the example included with the SDK is just a flat file as well.
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: Extracting text-layer with additional informations (OCR)
If you are using PDF Tools library, you need to use functions PXCp_ET_xxx functions: https://help.pdf-xchange.com/DEV/de ... getelement
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.