Extracting text-layer with additional informations (OCR)

prosozial_schmitt · Post by **prosozial_schmitt** » Wed Nov 27, 2013 6:33 am

Hello,

we want to analyze text of PDFs which are generated from a scanner-software (Fujitsu ScanSnap).
We need to extract address-data, telefone-numbers and so on. The scanner-software does ocr and adds text-information to the pdf. The scanned documents don't have a given format - so we need to get all available information, to get a better result for analyzing.

With "GetAllText" in PDF ActiveX Viewer it is possible to get all text from a PDF. But this function strips additional information like positions (Tm-marks), which I can see if I analyze the PDF with another software.

Example:
10.0551 0 0 9.7 140.6849 380.3401 Tm
(Informationen)Tj

Is there a way to extract this information with ActiceX Viewer-Control?

Greetings
Hans-Peter

Wed Nov 27, 2013 11:41 am

Hi Hans-Peter,

I don't believe this is possible with the Viewer AX SDK, but it certainly is with the Tools SDK One:
Text Extraction

Regards,
Stefan

DolphinMann · Post by **DolphinMann** » Fri Nov 14, 2014 6:26 pm

Sorry to resurrect an old thread but I recently began text extraction with the PDF Tools library and it seems to be working well, however I have a question about the positional data as well.

Is it possible to get the relative positions of the text extracted? Right now it is just a flat string that is dumped out.

EDIT: The link above didn't seem to take me to anything helping with positional information of extracted text, and the example included with the SDK is just a flat file as well.

Fri Nov 14, 2014 6:51 pm

If you are using PDF Tools library, you need to use functions PXCp_ET_xxx functions: https://help.pdf-xchange.com/DEV/de ... getelement

Extracting text-layer with additional informations (OCR)

Extracting text-layer with additional informations (OCR)

Re: Extracting text-layer with additional informations (OCR)

Re: Extracting text-layer with additional informations (OCR)

Re: Extracting text-layer with additional informations (OCR)