Extracting text-layer with additional informations (OCR)

PDF-XChange Viewer SDK for Developer's
(ActiveX and Simple DLL Versions)

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Chris - Tracker Supp, Tracker Supp-Stefan, Ivan - Tracker Software

Post Reply
prosozial_schmitt
User
Posts: 52
Joined: Tue Dec 28, 2004 9:49 am

Extracting text-layer with additional informations (OCR)

Post by prosozial_schmitt » Wed Nov 27, 2013 6:33 am

Hello,

we want to analyze text of PDFs which are generated from a scanner-software (Fujitsu ScanSnap).
We need to extract address-data, telefone-numbers and so on. The scanner-software does ocr and adds text-information to the pdf. The scanned documents don't have a given format - so we need to get all available information, to get a better result for analyzing.

With "GetAllText" in PDF ActiveX Viewer it is possible to get all text from a PDF. But this function strips additional information like positions (Tm-marks), which I can see if I analyze the PDF with another software.

Example:
10.0551 0 0 9.7 140.6849 380.3401 Tm
(Informationen)Tj

Is there a way to extract this information with ActiceX Viewer-Control?


Greetings
Hans-Peter

User avatar
Tracker Supp-Stefan
Site Admin
Posts: 13561
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Extracting text-layer with additional informations (OCR)

Post by Tracker Supp-Stefan » Wed Nov 27, 2013 11:41 am

Hi Hans-Peter,

I don't believe this is possible with the Viewer AX SDK, but it certainly is with the Tools SDK One:
Text Extraction

Regards,
Stefan

DolphinMann
User
Posts: 158
Joined: Mon Aug 04, 2014 7:34 pm

Re: Extracting text-layer with additional informations (OCR)

Post by DolphinMann » Fri Nov 14, 2014 6:26 pm

Sorry to resurrect an old thread but I recently began text extraction with the PDF Tools library and it seems to be working well, however I have a question about the positional data as well.

Is it possible to get the relative positions of the text extracted? Right now it is just a flat string that is dumped out.

EDIT: The link above didn't seem to take me to anything helping with positional information of extracted text, and the example included with the SDK is just a flat file as well.

Ivan - Tracker Software
Site Admin
Posts: 3609
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: Extracting text-layer with additional informations (OCR)

Post by Ivan - Tracker Software » Fri Nov 14, 2014 6:51 pm

If you are using PDF Tools library, you need to use functions PXCp_ET_xxx functions: http://help.tracker-software.com/DEV/de ... getelement
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.

Post Reply