Page 1 of 1

Extract Text from PDF

Posted: Fri Aug 17, 2018 2:47 pm
by ddinnebeil
I am evaluating your CoreAPI SDK, I have downloaded the CoreAPIDemo from github -- which provides much useful insight and a framework within which I can do my evaluation. My primary interest is extracting the entire text from a PDF, paragraph-by-paragraph. I will be doing further processing on the text for each paragraph. Can you point me in the direction of useful resources or specific API calls to accomplish this? Thank you.

Re: Extract Text from PDF

Posted: Sat Aug 18, 2018 5:47 am
by Sasha - Tracker Dev Team
Hello ddinnebeil,

Please check out the "9.3. Convert from PDF to txt file" sample. It will visually output (like you see it on screen) the text into the txt file.
Also, you can obtain each separate character from the IPXC_PageText by using the information provided by

Code: Select all

Text.GetChars(textsLineInfo[i].nFirstCharIndex, textsLineInfo[i].nCharsCount)
And you can see where is the end line character so that you can build paragraphs and implement your own logic.

Cheers,
Alex