Extract Text from PDF

A forum for questions or concerns related to the PDF-XChange Core API SDK

Moderators: Tracker Support, TrackerSupp-Daniel, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Sean - Tracker, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
ddinnebeil
User
Posts: 1
Joined: Fri Aug 17, 2018 2:42 pm

Extract Text from PDF

Post by ddinnebeil » Fri Aug 17, 2018 2:47 pm

I am evaluating your CoreAPI SDK, I have downloaded the CoreAPIDemo from github -- which provides much useful insight and a framework within which I can do my evaluation. My primary interest is extracting the entire text from a PDF, paragraph-by-paragraph. I will be doing further processing on the text for each paragraph. Can you point me in the direction of useful resources or specific API calls to accomplish this? Thank you.

User avatar
Sasha - Tracker Dev Team
User
Posts: 4075
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Extract Text from PDF

Post by Sasha - Tracker Dev Team » Sat Aug 18, 2018 5:47 am

Hello ddinnebeil,

Please check out the "9.3. Convert from PDF to txt file" sample. It will visually output (like you see it on screen) the text into the txt file.
Also, you can obtain each separate character from the IPXC_PageText by using the information provided by

Code: Select all

Text.GetChars(textsLineInfo[i].nFirstCharIndex, textsLineInfo[i].nCharsCount)
And you can see where is the end line character so that you can build paragraphs and implement your own logic.

Cheers,
Alex
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Post Reply