Get full text from PDF

A forum for questions or concerns related to the PDF-XChange Core API SDK

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Chris - Tracker Supp, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Get full text from PDF

Post by Tom Princen » Mon Jul 04, 2016 1:33 pm

I have this code:

IPXC_Document MydocSource = MyPXC.OpenDocumentFromFile(lsSourceFile, clbk);

for (int i = 0; i < MydocSource.Pages.Count; i++)
{
IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(null);
Docinfo2 = Docinfo2 + " " + MyPageText.GetChars(0, MyPageText.CharCount);
}

It gets the text from all of the pages. The only thing is that there are no cariage returns at the end of a line.
How can i solve this?

User avatar
Sasha - Tracker Dev Team
User
Posts: 4208
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team » Mon Jul 04, 2016 1:59 pm

Hello Tom,

The correct way of using the IPXC_PageText in your case would be to read each character separately:
http://sdkhelp.tracker-software.com/vie ... eText_Char
And then look for the TFC_LineBegin char flag for the correct new line monitoring:
http://sdkhelp.tracker-software.com/vie ... _CharFlags

Also note that there can be a null symbols in the text.

Cheers,
Alex
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen » Mon Jul 04, 2016 2:11 pm

Any sample code?
How could it be so complex to just get a text of a PDF??

Lzcat - Tracker Supp
Site Admin
Posts: 711
Joined: Thu Jun 28, 2007 8:42 am

Re: Get full text from PDF

Post by Lzcat - Tracker Supp » Mon Jul 04, 2016 3:46 pm

Hi Tom.
Any sample code?
You may do it yourself faster. Just add one more loop to get each character and its flags. First character of each line will have TCF_LineBegin flag set.
Also please not that text in PDF file can contain any arbitrary character codes, like null-terminating characters, carriage returns and so on anywhere in line, so you need to filter them too.
How could it be so complex to just get a text of a PDF??
This is because nature of the PDF - it does not contain text lines as you expect. You may check specification yourself.
HTH.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.

Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen » Mon Oct 24, 2016 8:02 pm

2 other questions concerning th get TEXT method:

1. is there a way to control the order of how the characters are looped. It seems like he's looping from bottom right to top left?
(The IPXC_GetPageTextOptions parameter is not really documented)

2. Could you help me with the coordinates from get_CharRect. What are the dimensions? Pixels, mm?

User avatar
Sasha - Tracker Dev Team
User
Posts: 4208
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team » Tue Oct 25, 2016 11:53 am

Hello Tom,

1. Please provide a piece of your code so that we can analyze it and assist further.
2. The get_CharRect returns the character rectangle in the page points.

Cheers,
Alex
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen » Tue Oct 25, 2016 12:27 pm

for (int i = 0; i < MydocSource.Pages.Count; i++)
{

if (i>0){
Docinfo2 += System.Environment.NewLine;
}

IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(IPXC_GetPageTextOptions.);

for (uint j = 0; j < MyPageText.CharsCount; j++)
{
if ((MyPageText.get_CharFlags(j) == (uint)PXC_TextCharFlags.TCF_LineBegin) & ( j>1 ) )
{
Docinfo2 += System.Environment.NewLine;
}

Docinfo2 += MyPageText.GetChars(j,1);
}

}

User avatar
Sasha - Tracker Dev Team
User
Posts: 4208
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team » Tue Oct 25, 2016 12:32 pm

Have you tried using the default behavior?

Code: Select all

PDFXEdit.IPXC_PageText pText = page.GetText(null, false);
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen » Tue Oct 25, 2016 12:39 pm

That was the code I was using before. And thats the code thats reads the PDF from bottom to top...

I modified the code to check what's inside the parameters.

User avatar
Sasha - Tracker Dev Team
User
Posts: 4208
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team » Tue Oct 25, 2016 1:53 pm

Well the GetChars method gets the characters by the order that they were added. It seems you are using the document that has such structure.
Try using these:
http://sdkhelp.tracker-software.com/vie ... locksCount
http://sdkhelp.tracker-software.com/vie ... _BlockInfo
Then by having the TextBlockInfo you can get the ParaInfo from it:
http://sdkhelp.tracker-software.com/vie ... o_ParaInfo
From the paragraph info you can get the information about the lines in the paragraph. Then you can use this method:
http://sdkhelp.tracker-software.com/vie ... t_LineInfo
Then having the information about the line, you can get the character indexes from it and construct your resulting string by using the GetChars method.

Cheers,
Alex
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen » Tue Oct 25, 2016 9:42 pm

Sorry but the blocks are also in the wrong order...

User avatar
Sasha - Tracker Dev Team
User
Posts: 4208
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team » Wed Oct 26, 2016 8:59 am

Hello Tom,

We provide the information of the paragraphs', lines' and characters' bound boxes. My previous post describes how to get it. Judging by the files that you are using you will have to use the provided coordinates and sort them out manually. Then you can have the result you require for all of the files that you can come across.

Cheers,
Alex
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen » Wed Oct 26, 2016 1:56 pm

Sorry i tried sorting based on top and left positions of Boxes, lines, ...
But nothing gives a reasonable result.

(I added the PDF as attachment.)

Code: Select all

                MyLines = new LineOrder[NbLine];
                NbLine = 0;
                for (uint j = 0; j < MyPageText.BlocksCount; j++)
                {
                    for (uint k = 0; k < MyPageText.BlockInfo[j].ParaCount; k++)
                    {
                        for (uint l = 0; l < MyPageText.BlockInfo[j].ParaInfo[k].nLinesCount; l++)
                        {
                            MyLines[NbLine].BlockID = j;
                            MyLines[NbLine].ParaID = k;
                            MyLines[NbLine].LineID = l + MyPageText.BlockInfo[j].ParaInfo[k].nFirstLineIndex;

                            MyLines[NbLine].Top = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.top;
                            MyLines[NbLine].Left = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.left;

                            NbLine++;
                        }
                    }
                }

                // array sort:
                Array.Sort(MyLines, delegate(LineOrder x, LineOrder y)
                {
                    if (x.Top == y.Top) {
                        return x.Left.CompareTo(y.Left);
                    }
                    return x.Top.CompareTo(y.Top); 
                });

                for (uint j = 0; j < NbLine ; j++)
                {
                    uint FirstChar = MyPageText.get_LineInfo(MyLines[j].LineID).nFirstCharIndex;
                    uint CharCount = MyPageText.get_LineInfo(MyLines[j].LineID).nCharsCount;

                    Docinfo2 += System.Environment.NewLine;

                    Console.WriteLine("j=" + j + " LineID: " + MyLines[j].LineID + " Top: " + MyLines[j].Top + " Left: " + MyLines[j].Left);

                    for (uint m = 0; m < CharCount; m++)
                    {
                        uint currentChar = m + FirstChar;
                        if (MyPageText.get_CharFlags(currentChar) == (uint)PXC_TextCharFlags.TCF_LineBegin)
                        {
                            Docinfo2 += System.Environment.NewLine;
                        }

                        Docinfo2 += MyPageText.GetChars(currentChar, 1);
                    }
                }
Attachments
test.pdf
(463.68 KiB) Downloaded 57 times

User avatar
Sasha - Tracker Dev Team
User
Posts: 4208
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team » Wed Oct 26, 2016 2:14 pm

I will experiment with this code and will reply with the results.

Cheers,
Alex
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Post Reply