Page 1 of 1
Extract Text from PDF
Posted: Fri Aug 17, 2018 2:47 pm
by ddinnebeil
I am evaluating your CoreAPI SDK, I have downloaded the CoreAPIDemo from github -- which provides much useful insight and a framework within which I can do my evaluation. My primary interest is extracting the entire text from a PDF, paragraph-by-paragraph. I will be doing further processing on the text for each paragraph. Can you point me in the direction of useful resources or specific API calls to accomplish this? Thank you.
Re: Extract Text from PDF
Posted: Sat Aug 18, 2018 5:47 am
by Sasha - Tracker Dev Team
Hello ddinnebeil,
Please check out the "9.3. Convert from PDF to txt file" sample. It will visually output (like you see it on screen) the text into the txt file.
Also, you can obtain each separate character from the IPXC_PageText by using the information provided by
Code: Select all
Text.GetChars(textsLineInfo[i].nFirstCharIndex, textsLineInfo[i].nCharsCount)
And you can see where is the end line character so that you can build paragraphs and implement your own logic.
Cheers,
Alex