Get full text from PDF

A forum for questions or concerns related to the PDF-XChange Core API SDK

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Get full text from PDF

Post by Tom Princen »

I have this code:

IPXC_Document MydocSource = MyPXC.OpenDocumentFromFile(lsSourceFile, clbk);

for (int i = 0; i < MydocSource.Pages.Count; i++)
{
IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(null);
Docinfo2 = Docinfo2 + " " + MyPageText.GetChars(0, MyPageText.CharCount);
}

It gets the text from all of the pages. The only thing is that there are no cariage returns at the end of a line.
How can i solve this?
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team »

Hello Tom,

The correct way of using the IPXC_PageText in your case would be to read each character separately:
https://sdkhelp.pdf-xchange.com/vie ... eText_Char
And then look for the TFC_LineBegin char flag for the correct new line monitoring:
https://sdkhelp.pdf-xchange.com/vie ... _CharFlags

Also note that there can be a null symbols in the text.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen »

Any sample code?
How could it be so complex to just get a text of a PDF??
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Get full text from PDF

Post by Lzcat - Tracker Supp »

Hi Tom.
Any sample code?
You may do it yourself faster. Just add one more loop to get each character and its flags. First character of each line will have TCF_LineBegin flag set.
Also please not that text in PDF file can contain any arbitrary character codes, like null-terminating characters, carriage returns and so on anywhere in line, so you need to filter them too.
How could it be so complex to just get a text of a PDF??
This is because nature of the PDF - it does not contain text lines as you expect. You may check specification yourself.
HTH.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen »

2 other questions concerning th get TEXT method:

1. is there a way to control the order of how the characters are looped. It seems like he's looping from bottom right to top left?
(The IPXC_GetPageTextOptions parameter is not really documented)

2. Could you help me with the coordinates from get_CharRect. What are the dimensions? Pixels, mm?
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team »

Hello Tom,

1. Please provide a piece of your code so that we can analyze it and assist further.
2. The get_CharRect returns the character rectangle in the page points.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen »

for (int i = 0; i < MydocSource.Pages.Count; i++)
{

if (i>0){
Docinfo2 += System.Environment.NewLine;
}

IPXC_PageText MyPageText = MydocSource.Pages[(uint)i].GetText(IPXC_GetPageTextOptions.);

for (uint j = 0; j < MyPageText.CharsCount; j++)
{
if ((MyPageText.get_CharFlags(j) == (uint)PXC_TextCharFlags.TCF_LineBegin) & ( j>1 ) )
{
Docinfo2 += System.Environment.NewLine;
}

Docinfo2 += MyPageText.GetChars(j,1);
}

}
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team »

Have you tried using the default behavior?

Code: Select all

PDFXEdit.IPXC_PageText pText = page.GetText(null, false);
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen »

That was the code I was using before. And thats the code thats reads the PDF from bottom to top...

I modified the code to check what's inside the parameters.
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team »

Well the GetChars method gets the characters by the order that they were added. It seems you are using the document that has such structure.
Try using these:
https://sdkhelp.pdf-xchange.com/vie ... locksCount
https://sdkhelp.pdf-xchange.com/vie ... _BlockInfo
Then by having the TextBlockInfo you can get the ParaInfo from it:
https://sdkhelp.pdf-xchange.com/vie ... o_ParaInfo
From the paragraph info you can get the information about the lines in the paragraph. Then you can use this method:
https://sdkhelp.pdf-xchange.com/vie ... t_LineInfo
Then having the information about the line, you can get the character indexes from it and construct your resulting string by using the GetChars method.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen »

Sorry but the blocks are also in the wrong order...
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team »

Hello Tom,

We provide the information of the paragraphs', lines' and characters' bound boxes. My previous post describes how to get it. Judging by the files that you are using you will have to use the provided coordinates and sort them out manually. Then you can have the result you require for all of the files that you can come across.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Tom Princen
User
Posts: 83
Joined: Wed Mar 25, 2015 10:15 am

Re: Get full text from PDF

Post by Tom Princen »

Sorry i tried sorting based on top and left positions of Boxes, lines, ...
But nothing gives a reasonable result.

(I added the PDF as attachment.)

Code: Select all

                MyLines = new LineOrder[NbLine];
                NbLine = 0;
                for (uint j = 0; j < MyPageText.BlocksCount; j++)
                {
                    for (uint k = 0; k < MyPageText.BlockInfo[j].ParaCount; k++)
                    {
                        for (uint l = 0; l < MyPageText.BlockInfo[j].ParaInfo[k].nLinesCount; l++)
                        {
                            MyLines[NbLine].BlockID = j;
                            MyLines[NbLine].ParaID = k;
                            MyLines[NbLine].LineID = l + MyPageText.BlockInfo[j].ParaInfo[k].nFirstLineIndex;

                            MyLines[NbLine].Top = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.top;
                            MyLines[NbLine].Left = MyPageText.get_LineInfo(MyLines[NbLine].LineID).rcBBox.left;

                            NbLine++;
                        }
                    }
                }

                // array sort:
                Array.Sort(MyLines, delegate(LineOrder x, LineOrder y)
                {
                    if (x.Top == y.Top) {
                        return x.Left.CompareTo(y.Left);
                    }
                    return x.Top.CompareTo(y.Top); 
                });

                for (uint j = 0; j < NbLine ; j++)
                {
                    uint FirstChar = MyPageText.get_LineInfo(MyLines[j].LineID).nFirstCharIndex;
                    uint CharCount = MyPageText.get_LineInfo(MyLines[j].LineID).nCharsCount;

                    Docinfo2 += System.Environment.NewLine;

                    Console.WriteLine("j=" + j + " LineID: " + MyLines[j].LineID + " Top: " + MyLines[j].Top + " Left: " + MyLines[j].Left);

                    for (uint m = 0; m < CharCount; m++)
                    {
                        uint currentChar = m + FirstChar;
                        if (MyPageText.get_CharFlags(currentChar) == (uint)PXC_TextCharFlags.TCF_LineBegin)
                        {
                            Docinfo2 += System.Environment.NewLine;
                        }

                        Docinfo2 += MyPageText.GetChars(currentChar, 1);
                    }
                }
Attachments
test.pdf
(463.68 KiB) Downloaded 164 times
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Get full text from PDF

Post by Sasha - Tracker Dev Team »

I will experiment with this code and will reply with the results.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Post Reply