Query PDF elements within an area

PDF-XChange Editor SDK for Developers

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Query PDF elements within an area

Post by lidds »

I am allowing the user to draw a rectangle on a PDF and then I am extracting all the text within this bounds or the annotation rectangle, this all works fine. However now what I want to do is to extract the PDF elements that are within the bounds of the annotation rectangle.

e.g. from the example below I would like to be able to get information on the Square and Circle as they are within the bound area of the annotation rectangle. Is this possible?
ExtractPDFElements.png
Thanks

Simon
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Query PDF elements within an area

Post by Sasha - Tracker Dev Team »

Hello Simon,

You will have to run through the content items and check whether the items coordinates intersect with the rectangle - that's the only way.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Query PDF elements within an area

Post by lidds »

Thanks Alex,

So I have quickly done the following, which loops through the PDF content.

Code: Select all

  Private Sub getContent(ByVal areaLeft As Double, ByVal areaTop As Double, ByVal areaRight As Double, ByVal areaBottom As Double, ByVal pageNumber As Integer)
    Dim curPage As IPXC_Page = Me.docPreview.Doc.CoreDoc.Pages(pageNumber)
    Dim content As IPXC_Content = curPage.GetContent(PXC_ContentAccessMode.CAccessMode_Readonly)
    Dim items As IPXC_ContentItems = content.Items

    For itemIndex As UInt32 = 0 To items.Count - 1
      Dim item As IPXC_ContentItem = items.Item(itemIndex)
      Dim itemType As PXC_CIType = item.Type

      Dim itemBox As PXC_Rect = item.BBox

      If (itemBox.left < areaLeft) AndAlso (itemBox.top >= areaTop) AndAlso (itemBox.right >= areaRight) AndAlso (itemBox.bottom < areaBottom) Then
        Console.WriteLine(itemType.ToString)
      End If
    Next
  End Sub
However on the itemType https://sdkhelp.pdf-xchange.com/vi ... PXC_CIType I can't seem to see an element type of Box, Circle etc I assume that all shapes are maybe placed as CIT_Paths?

If I explain what I am trying to do then you maybe able to suggest a way of achieving it. I am allowing the user to select an area, as described in my first post. From that selection I want to retain information about the PDF contentItems e.g. shape, size etc. within the selected area. Then I want to scan the whole PDF and find other contentItems that match the same pattern as the one previously retained.

So basically a compare type tool. Any help and ideas on how to do this would be really appreciated.

Thanks

Simon
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Query PDF elements within an area

Post by Sasha - Tracker Dev Team »

Hello Simon,

Unfortunately, there is no additional information about the Path items in the PDF Specification - they are just path items, that consist from basic elements (lines, arcs) that's all.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Query PDF elements within an area

Post by lidds »

Alex,

So just to clarify, I can't get anymore information about the path elements e.g. coordinates, if it is a line or arc etc. Basically I want to try and make a digital finger print of the selected PDF elements so that I can try and find any others that match in the PDF. And to do that I obviously need to try and get as much information about the path elements.

Thanks

Simon
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Query PDF elements within an area

Post by Sasha - Tracker Dev Team »

Hello Simon,

Just thought of a way that can help you. You can get the path data from the needed Path content items as follows:

Code: Select all

IPXC_Content content = pdfCtl.Doc.CoreDoc.Pages[0].GetContent(PXC_ContentAccessMode.CAccessMode_Readonly);
for (uint i = 0; i < content.Items.Count; i++)
{
	IPXC_ContentItem item = content.Items[i];
	if (item.Type != PXC_CIType.CIT_Path)
		continue;
	Array aCmds, aPts;
	item.Path_GetDataSA(out aCmds, out aPts);
	for (int j = 0; j < aCmds.GetLength(0); j++)
	{
		PXC_CI_PathCommands type = (PXC_CI_PathCommands)aCmds.GetValue(j);
		Console.WriteLine("Path Command: " + type.ToString() + " X:" + aPts.GetValue(j * 2).ToString() + " Y:" + aPts.GetValue(j * 2 + 1).ToString());
	}
}
The Path_GetDataSA method will give you an array of commands used for the path and the array of the correspondent points for that commands. That can greatly help with the comparison - though if you need to find same blocks on page, you will have to measure paths' for additional data along with types for correct comparisons. For example you have two commands - LineTo and MoveTo with correspondent point coordinates - you will have to store the line information (length and angle or deltas) and then find the content items with same command count (two) and then check whether it's a LineTo and MoveTo commands. If so, measure the same data that you have stored - if it does match the information that you are searching for - then you have found your item.
Basically, for the random IPXC_ContentItem, you will have to check for the same command array size first, then check whether the commands in that content item are the same with the commands that you have stored index by index. If so, then use the data comparison - this should give you the result.

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Query PDF elements within an area

Post by lidds »

Alex,

Thank you so much for your code snippet and advise on how to accomplish this, I will be looking into this properly tomorrow.

I do have one quick question though, when running your code the X and Y coordinates of the Path Command seem to not be in page coordinates e.g some of the Path Command coordinates are larger that the size of the PDF page.

Is there a way to turn these into actual page coordinate values?

Thanks

Simon
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Query PDF elements within an area

Post by lidds »

Alex,

Don't worry I have sorted this by using the BBox of the item.

Thanks

Simon
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Query PDF elements within an area

Post by Sasha - Tracker Dev Team »

:)
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Query PDF elements within an area

Post by lidds »

Hi Alex,

I am having a bit of an issue with matching the PDF elements. I have written an application that allows you to draw a rectangle on a PDF document and this then finds the PDF content items within that boundary and then interrogates the content item, like you suggested previously. This then builds a shape fingerprint, as I call it, and then you can loop through all PDF content and match this with the shape fingerprint data. This works with simple shapes, but starts to give in consistent results with complex shapes. I was hoping you could have a loop at the code to see if I am doing something silly.

The link below contains a video explaining what I am doing, plus example PDF files and the application (obviously with no license code).

https://ticodi.sharefile.com/d-s41513a16ba142a38

Thanks

Simon
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Query PDF elements within an area

Post by lidds »

I was just wondering if anyone had had a chance to have a look at this for me?

Thanks

Simon
lidds
User
Posts: 510
Joined: Sat May 16, 2009 1:55 pm

Re: Query PDF elements within an area

Post by lidds »

Alex,

I know you guys are busy, but is there any chance someone could have a look at this for me please.

Thanks

Simon
Sasha - Tracker Dev Team
User
Posts: 5522
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: Query PDF elements within an area

Post by Sasha - Tracker Dev Team »

Hello Simon,

Sorry for the delay - we are indeed busy - the release should be out soon.
It's hard for me to debug the VB.Net but it seems that you are calling drawShapeBoundary twice - the first time, the width and height are correct. And the second time they are greater that they need to be. It seems that you have missed the size recalculation somewhere. Also, code looks the same in the 515 and 492 if/else statements - does it need to be like that?

Cheers,
Alex
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ
Post Reply