Page 1 of 1

Get full text from PDF

Posted: Mon Jul 04, 2016 1:33 pm
by Tom Princen
I have this code:

IPXC_Document MydocSource = MyPXC.OpenDocumentFromFile(lsSourceFile, clbk);

for (int i = 0; i < MydocSource.Pages.Count; i++)
{
IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(null);
Docinfo2 = Docinfo2 + " " + MyPageText.GetChars(0, MyPageText.CharCount);
}

It gets the text from all of the pages. The only thing is that there are no cariage returns at the end of a line.
How can i solve this?

Re: Get full text from PDF

Posted: Mon Jul 04, 2016 1:59 pm
by Sasha - Tracker Dev Team
Hello Tom,

The correct way of using the IPXC_PageText in your case would be to read each character separately:
http://sdkhelp.tracker-software.com/vie ... eText_Char
And then look for the TFC_LineBegin char flag for the correct new line monitoring:
http://sdkhelp.tracker-software.com/vie ... _CharFlags

Also note that there can be a null symbols in the text.

Cheers,
Alex

Re: Get full text from PDF

Posted: Mon Jul 04, 2016 2:11 pm
by Tom Princen
Any sample code?
How could it be so complex to just get a text of a PDF??

Re: Get full text from PDF

Posted: Mon Jul 04, 2016 3:46 pm
by Lzcat - Tracker Supp
Hi Tom.
Any sample code?
You may do it yourself faster. Just add one more loop to get each character and its flags. First character of each line will have TCF_LineBegin flag set.
Also please not that text in PDF file can contain any arbitrary character codes, like null-terminating characters, carriage returns and so on anywhere in line, so you need to filter them too.
How could it be so complex to just get a text of a PDF??
This is because nature of the PDF - it does not contain text lines as you expect. You may check specification yourself.
HTH.

Re: Get full text from PDF

Posted: Mon Oct 24, 2016 8:02 pm
by Tom Princen
2 other questions concerning th get TEXT method:

1. is there a way to control the order of how the characters are looped. It seems like he's looping from bottom right to top left?
(The IPXC_GetPageTextOptions parameter is not really documented)

2. Could you help me with the coordinates from get_CharRect. What are the dimensions? Pixels, mm?

Re: Get full text from PDF

Posted: Tue Oct 25, 2016 11:53 am
by Sasha - Tracker Dev Team
Hello Tom,

1. Please provide a piece of your code so that we can analyze it and assist further.
2. The get_CharRect returns the character rectangle in the page points.

Cheers,
Alex

Re: Get full text from PDF

Posted: Tue Oct 25, 2016 12:27 pm
by Tom Princen
for (int i = 0; i < MydocSource.Pages.Count; i++)
{

if (i>0){
Docinfo2 += System.Environment.NewLine;
}

IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(IPXC_GetPageTextOptions.);

for (uint j = 0; j < MyPageText.CharsCount; j++)
{
if ((MyPageText.get_CharFlags(j) == (uint)PXC_TextCharFlags.TCF_LineBegin) & ( j>1 ) )
{
Docinfo2 += System.Environment.NewLine;
}

Docinfo2 += MyPageText.GetChars(j,1);
}

}

Re: Get full text from PDF

Posted: Tue Oct 25, 2016 12:32 pm
by Sasha - Tracker Dev Team
Have you tried using the default behavior?

Code: Select all

PDFXEdit.IPXC_PageText pText = page.GetText(null, false);

Re: Get full text from PDF

Posted: Tue Oct 25, 2016 12:39 pm
by Tom Princen
That was the code I was using before. And thats the code thats reads the PDF from bottom to top...

I modified the code to check what's inside the parameters.

Re: Get full text from PDF

Posted: Tue Oct 25, 2016 1:53 pm
by Sasha - Tracker Dev Team
Well the GetChars method gets the characters by the order that they were added. It seems you are using the document that has such structure.
Try using these:
http://sdkhelp.tracker-software.com/vie ... locksCount
http://sdkhelp.tracker-software.com/vie ... _BlockInfo
Then by having the TextBlockInfo you can get the ParaInfo from it:
http://sdkhelp.tracker-software.com/vie ... o_ParaInfo
From the paragraph info you can get the information about the lines in the paragraph. Then you can use this method:
http://sdkhelp.tracker-software.com/vie ... t_LineInfo
Then having the information about the line, you can get the character indexes from it and construct your resulting string by using the GetChars method.

Cheers,
Alex

Re: Get full text from PDF

Posted: Tue Oct 25, 2016 9:42 pm
by Tom Princen
Sorry but the blocks are also in the wrong order...

Re: Get full text from PDF

Posted: Wed Oct 26, 2016 8:59 am
by Sasha - Tracker Dev Team
Hello Tom,

We provide the information of the paragraphs', lines' and characters' bound boxes. My previous post describes how to get it. Judging by the files that you are using you will have to use the provided coordinates and sort them out manually. Then you can have the result you require for all of the files that you can come across.

Cheers,
Alex

Re: Get full text from PDF

Posted: Wed Oct 26, 2016 1:56 pm
by Tom Princen
Sorry i tried sorting based on top and left positions of Boxes, lines, ...
But nothing gives a reasonable result.

(I added the PDF as attachment.)

Code: Select all

                MyLines = new LineOrder[NbLine];
                NbLine = 0;
                for (uint j = 0; j < MyPageText.BlocksCount; j++)
                {
                    for (uint k = 0; k < MyPageText.BlockInfo[j].ParaCount; k++)
                    {
                        for (uint l = 0; l < MyPageText.BlockInfo[j].ParaInfo[k].nLinesCount; l++)
                        {
                            MyLines[NbLine].BlockID = j;
                            MyLines[NbLine].ParaID = k;
                            MyLines[NbLine].LineID = l + MyPageText.BlockInfo[j].ParaInfo[k].nFirstLineIndex;

                            MyLines[NbLine].Top = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.top;
                            MyLines[NbLine].Left = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.left;

                            NbLine++;
                        }
                    }
                }

                // array sort:
                Array.Sort(MyLines, delegate(LineOrder x, LineOrder y)
                {
                    if (x.Top == y.Top) {
                        return x.Left.CompareTo(y.Left);
                    }
                    return x.Top.CompareTo(y.Top); 
                });

                for (uint j = 0; j < NbLine ; j++)
                {
                    uint FirstChar = MyPageText.get_LineInfo(MyLines[j].LineID).nFirstCharIndex;
                    uint CharCount = MyPageText.get_LineInfo(MyLines[j].LineID).nCharsCount;

                    Docinfo2 += System.Environment.NewLine;

                    Console.WriteLine("j=" + j + " LineID: " + MyLines[j].LineID + " Top: " + MyLines[j].Top + " Left: " + MyLines[j].Left);

                    for (uint m = 0; m < CharCount; m++)
                    {
                        uint currentChar = m + FirstChar;
                        if (MyPageText.get_CharFlags(currentChar) == (uint)PXC_TextCharFlags.TCF_LineBegin)
                        {
                            Docinfo2 += System.Environment.NewLine;
                        }

                        Docinfo2 += MyPageText.GetChars(currentChar, 1);
                    }
                }

Re: Get full text from PDF

Posted: Wed Oct 26, 2016 2:14 pm
by Sasha - Tracker Dev Team
I will experiment with this code and will reply with the results.

Cheers,
Alex