NEWB Question-Region Selection
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
NEWB Question-Region Selection
My sincerest apologies as I am just getting started with the SDK. I have spent a couple of days already and I am just not making progress. Problem: I want to select a region within the PDF and put into a memory stream to print as its own PDF document. This is a precursor step to sending it to my own custom OCR engine. My problem is that I cannot seem to figure out how to extract a selected region from within the PDF. My ultimate goal is to pull searchable text from a pdf table based on a region selection.
I am feeling so clueless as this is probably super simple but I just don't see how to make this work after reading through the FullDemo code. Is there other developer documentation besides (https://sdkhelp.pdf-xchange.com/view/PXV:IUIX_Cmd)?
thank you!
-Eric
I am feeling so clueless as this is probably super simple but I just don't see how to make this work after reading through the FullDemo code. Is there other developer documentation besides (https://sdkhelp.pdf-xchange.com/view/PXV:IUIX_Cmd)?
thank you!
-Eric
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: NEWB Question-Region Selection
Based on your description, one of the DrawXXX functions should help.
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
Re: NEWB Question-Region Selection
Thank you for that reference. Do you have any actual implementation insight or suggested approach that has been successfully attempted before? I am sure I am not asking a unique question ... "How to select a region and retrieve all text and/or all image data within that region and save it to a memory stream?" Unfortunately, I do not find any help documentation, forum posts, or sample reference material for this particular question. But that's why I am sure it is a newb question because I am still learning the SDK documentation. So if you could provide some greater insight that would be very helpful. Thank You!!
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: NEWB Question-Region Selection
Can you please describe your task in a bit more detail?
You say "How to select a region...". How do you expect this selection should happen? By your user on a rendered page? Programmatically?
We can try to help or direct you in the right direction, but we need to understand what exactly you are trying to archive.
You say "How to select a region...". How do you expect this selection should happen? By your user on a rendered page? Programmatically?
We can try to help or direct you in the right direction, but we need to understand what exactly you are trying to archive.
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
Re: NEWB Question-Region Selection
Ivan - very fair question! Specifically, there are 2 use cases. In each case, the user will perform a selection by using their mouse too select a region on the PDF through the PDF Viewer. The information being extracted should be in tabular form and can break across pages.
Case 1: The PDF is searchable with text in a tabular format.
Case 2: The PDF is an image and the text is in a tabular format.
In both cases: The text data needs to be extracted based on the region selected using positional reference data such that headers and starting column row text descriptors can be identified as part of the extraction process. If the data breaks across pages then the data can be extracted as a complete table, column, or row set. The number of rows in the data is NOT consistent but the number of columns per table and the table titles consistent for each type of PDF report.
The users are working with consistent PDF form layouts that they first map out the regions for the data extraction. These mappings are saved within our database so that when a user wants to extract tabular data in the future they select the appropriate mapping layout and then I use the SDK to pull the associated data sets. "IF" the data breaks across pages then I will pull the data until the table or column is completely extracted.
I have my own OCR engine where I can send it a PDF image and it can convert it to text. But in both the searchable text and the PDF image scenarios, I cannot seem too pull the necessary information from the viewer control. I have not figured out which method I can use that will return the necessary RECTANGLE properties for me to pull the information. I know this must be super simple but I am just stuck and I am concerned that my ignorance of the library is pointing me in the wrong direction. I know I have to compensate for viewer zoom, page orientation, and page relations. Let me know if you have any other questions as this is super important for me to understand as I get going!
Case 1: The PDF is searchable with text in a tabular format.
Case 2: The PDF is an image and the text is in a tabular format.
In both cases: The text data needs to be extracted based on the region selected using positional reference data such that headers and starting column row text descriptors can be identified as part of the extraction process. If the data breaks across pages then the data can be extracted as a complete table, column, or row set. The number of rows in the data is NOT consistent but the number of columns per table and the table titles consistent for each type of PDF report.
The users are working with consistent PDF form layouts that they first map out the regions for the data extraction. These mappings are saved within our database so that when a user wants to extract tabular data in the future they select the appropriate mapping layout and then I use the SDK to pull the associated data sets. "IF" the data breaks across pages then I will pull the data until the table or column is completely extracted.
I have my own OCR engine where I can send it a PDF image and it can convert it to text. But in both the searchable text and the PDF image scenarios, I cannot seem too pull the necessary information from the viewer control. I have not figured out which method I can use that will return the necessary RECTANGLE properties for me to pull the information. I know this must be super simple but I am just stuck and I am concerned that my ignorance of the library is pointing me in the wrong direction. I know I have to compensate for viewer zoom, page orientation, and page relations. Let me know if you have any other questions as this is super important for me to understand as I get going!
- Chris - Tracker Supp
- Site Admin
- Posts: 795
- Joined: Tue Apr 14, 2009 11:33 pm
Re: NEWB Question-Region Selection
Hi Lambchop,
I sent you the sample app yesterday that our developers prepared for you.
Hope that helps
I sent you the sample app yesterday that our developers prepared for you.
Hope that helps
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Chris Attrell
Tracker Sales & Support North America
http://www.tracker-software.com
Chris Attrell
Tracker Sales & Support North America
http://www.tracker-software.com
-
- User
- Posts: 1370
- Joined: Thu Sep 05, 2019 12:35 pm
Re: NEWB Question-Region Selection
Hi Chris,
Can you share this sample somewhere?
-žarko
Can you share this sample somewhere?
-žarko
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: NEWB Question-Region Selection
Yes, we will. We are working on adding this kind of sample to our github repository.
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
Re: NEWB Question-Region Selection
Ivan - Thank you tons for the sample code. I notice that you used a few compressed coding lines that are just not making sense as I cannot convert them cleanly into steps. When you can... please break out these lines with their associated references:
line 1: tagPOINT pt = pEvent.get_Pos();
issue: get_Pos() does not exist off of IUIX_Event
line 2: m_curPageBBox = pView.Doc.CoreDoc.Pages[(uint)editPageIndex].get_Box(PXC_BoxType.PBox_ViewBox);
issue: get_Box does not exist off of IPXC_Page or PXC_Rect
FYI ... the ambiguous use of implements vs. inherits in C# is just too fun
line 1: tagPOINT pt = pEvent.get_Pos();
issue: get_Pos() does not exist off of IUIX_Event
line 2: m_curPageBBox = pView.Doc.CoreDoc.Pages[(uint)editPageIndex].get_Box(PXC_BoxType.PBox_ViewBox);
issue: get_Box does not exist off of IPXC_Page or PXC_Rect
FYI ... the ambiguous use of implements vs. inherits in C# is just too fun
- Vasyl-Tracker Dev Team
- Site Admin
- Posts: 2352
- Joined: Thu Jun 30, 2005 4:11 pm
- Location: Canada
Re: NEWB Question-Region Selection
What programming language are you using?
Vasyl Yaremyn
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Re: NEWB Question-Region Selection
Visual Studio 2019
I code in VB.NET but the class extensions are not recognized in the C# either.
I code in VB.NET but the class extensions are not recognized in the C# either.
- Vasyl-Tracker Dev Team
- Site Admin
- Posts: 2352
- Joined: Thu Jun 30, 2005 4:11 pm
- Location: Canada
Re: NEWB Question-Region Selection
That C#-sample project we provided - can you compile it on your side?
Vasyl Yaremyn
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
- Vasyl-Tracker Dev Team
- Site Admin
- Posts: 2352
- Joined: Thu Jun 30, 2005 4:11 pm
- Location: Canada
Re: NEWB Question-Region Selection
I just made a simple VB.NET app (WinForm, Desktop), added to its main form the Editor's ActiveX control, and I can see this:
I'm not so experienced in VB.NET but seems all look very similar to corresponding stuff in C#-project...
I'm not so experienced in VB.NET but seems all look very similar to corresponding stuff in C#-project...
Vasyl Yaremyn
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Re: NEWB Question-Region Selection
I think we are not communicating about the same thing ... what I posted was about "get_Pos()" and "get_Box". Something is really odd ... I set the project reference to the same PDXEdit.dll and it shows this code for the IUIX_Event from a NEW Project:
Namespace PDFXEdit
<ComConversionLoss> <Guid("482E54A4-F8E8-4C78-8472-1FA890ED1C3A")> <TypeLibTypeAttribute(4288)>
Public Interface IUIX_Event <DispId(1610743809)>
Property Code As Integer
<DispId(1610743811)>
Property Handled As Boolean
<ComAliasName("PDFXEdit.PARAM_T")> <DispId(1610743813)>
Property Param1 As <ComAliasName("PDFXEdit.PARAM_T")> UInteger
<ComAliasName("PDFXEdit.PARAM_T")> <DispId(1610743815)>
Property Param2 As <ComAliasName("PDFXEdit.PARAM_T")> UInteger
<ComAliasName("PDFXEdit.PARAM_T")> <DispId(1610743817)>
Property Result As <ComAliasName("PDFXEdit.PARAM_T")> UInteger
<DispId(1610743820)>
Property Pos As tagPOINT
<DispId(1610743808)>
Function _GetRawPtr() As IntPtr
<DispId(1610743819)>
Function Clone() As IUIX_Event
End Interface
End Namespace
************************
Notice that it does not have any reference to the set_Pos and get_Pos methods. Your SDK Help https://sdkhelp.pdf-xchange.com/view/PXV:IUIX_Event#Methods also does not show Get_Pos method. I am confused between the sample project code and the SDK documentation and the VS class definitions. If you create a NEW project the DLL pulls a different class definition for the IUIX_Event.
I think this is an issue of exposing managed vs unmanaged code base.
Current FULL DEMO Project Class Definition for IUIX_Event: notice ... it has the code reference for set_Pos and get_Pos
namespace PDFXEdit
{
[ComConversionLoss]
[Guid("482E54A4-F8E8-4C78-8472-1FA890ED1C3A")]
[TypeLibType(4288)]
public interface IUIX_Event
{
[DispId(1610743808)]
IntPtr _GetRawPtr();
[DispId(1610743819)]
IUIX_Event Clone();
[DispId(1610743820)]
tagPOINT get_Pos();
[DispId(1610743820)]
void set_Pos(ref tagPOINT stPos);
[DispId(1610743809)]
int Code { get; set; }
[DispId(1610743811)]
bool Handled { get; set; }
[ComAliasName("PDFXEdit.PARAM_T")]
[DispId(1610743813)]
uint Param1 { get; set; }
[ComAliasName("PDFXEdit.PARAM_T")]
[DispId(1610743815)]
uint Param2 { get; set; }
[ComAliasName("PDFXEdit.PARAM_T")]
[DispId(1610743817)]
uint Result { get; set; }
[DispId(1610743820)]
tagPOINT Pos { get; set; }
}
}
Namespace PDFXEdit
<ComConversionLoss> <Guid("482E54A4-F8E8-4C78-8472-1FA890ED1C3A")> <TypeLibTypeAttribute(4288)>
Public Interface IUIX_Event <DispId(1610743809)>
Property Code As Integer
<DispId(1610743811)>
Property Handled As Boolean
<ComAliasName("PDFXEdit.PARAM_T")> <DispId(1610743813)>
Property Param1 As <ComAliasName("PDFXEdit.PARAM_T")> UInteger
<ComAliasName("PDFXEdit.PARAM_T")> <DispId(1610743815)>
Property Param2 As <ComAliasName("PDFXEdit.PARAM_T")> UInteger
<ComAliasName("PDFXEdit.PARAM_T")> <DispId(1610743817)>
Property Result As <ComAliasName("PDFXEdit.PARAM_T")> UInteger
<DispId(1610743820)>
Property Pos As tagPOINT
<DispId(1610743808)>
Function _GetRawPtr() As IntPtr
<DispId(1610743819)>
Function Clone() As IUIX_Event
End Interface
End Namespace
************************
Notice that it does not have any reference to the set_Pos and get_Pos methods. Your SDK Help https://sdkhelp.pdf-xchange.com/view/PXV:IUIX_Event#Methods also does not show Get_Pos method. I am confused between the sample project code and the SDK documentation and the VS class definitions. If you create a NEW project the DLL pulls a different class definition for the IUIX_Event.
I think this is an issue of exposing managed vs unmanaged code base.
Current FULL DEMO Project Class Definition for IUIX_Event: notice ... it has the code reference for set_Pos and get_Pos
namespace PDFXEdit
{
[ComConversionLoss]
[Guid("482E54A4-F8E8-4C78-8472-1FA890ED1C3A")]
[TypeLibType(4288)]
public interface IUIX_Event
{
[DispId(1610743808)]
IntPtr _GetRawPtr();
[DispId(1610743819)]
IUIX_Event Clone();
[DispId(1610743820)]
tagPOINT get_Pos();
[DispId(1610743820)]
void set_Pos(ref tagPOINT stPos);
[DispId(1610743809)]
int Code { get; set; }
[DispId(1610743811)]
bool Handled { get; set; }
[ComAliasName("PDFXEdit.PARAM_T")]
[DispId(1610743813)]
uint Param1 { get; set; }
[ComAliasName("PDFXEdit.PARAM_T")]
[DispId(1610743815)]
uint Param2 { get; set; }
[ComAliasName("PDFXEdit.PARAM_T")]
[DispId(1610743817)]
uint Result { get; set; }
[DispId(1610743820)]
tagPOINT Pos { get; set; }
}
}
- Vasyl-Tracker Dev Team
- Site Admin
- Posts: 2352
- Joined: Thu Jun 30, 2005 4:11 pm
- Location: Canada
Re: NEWB Question-Region Selection
It is known 'issue'... When you have the C# project and add a reference to the ActiveX - it imports all interfaces and types from it. And for some internal purposes the usual COM-property, like IUIX_Event::Pos, can be imported as pair of simple get_Pos()/set_Pos() functions.
However, in VB.NET the standard COM-importer interprets it as a natural get/set Pos-property of IUIX_Event-object.
With the IPXV_Page::Box get/set property we have a slightly another case: Box is the property-with-parameter. So, to get/set the value you need to specify parameter too (VB.NET):
And, while VB.NET supports property-with-parameter - the C# doesn't! So in this case, with C# you have only one way to use such properties, via corresponding get/set functions:
Please note: here we talking about SDK based on ActiveX-technology that provides special COM-interfaces to share API from itself. And according to COM-standard, this SDK has built-in description of all public interfaces/features it carries inside (TypeLibrary).
So any programming language, that 'understands' such COM-stuff - may automatically import such API-description info to your project and in terms of your programming language when you just add the reference to that SDK. And the COM-importer on your side(in your IDE) has its own rules on how to import native COM-interfaces from an external SDK to your project and to your programming language. It is not controlled by the SDK at all.
But typically it is easier than it may look:
- in VB.NET: all COM-properties are imported as natural properties
- in C#:
a) when COM-property has simple type (int, double, string, interface) - it will be imported as natural C#-property
b) when COM-property sets/gets the structure - it will be imported as pair of functions
c) when COM-property has a parameter - it will be imported as pair of functions
However, in VB.NET the standard COM-importer interprets it as a natural get/set Pos-property of IUIX_Event-object.
With the IPXV_Page::Box get/set property we have a slightly another case: Box is the property-with-parameter. So, to get/set the value you need to specify parameter too (VB.NET):
Code: Select all
Dim page As PDFXEdit.IPXC_Page
Dim curViewBox As PDFXEdit.PXC_Rect
Dim newCropBox As PDFXEdit.PXC_Rect
...
pageBox = page.Box(PDFXEdit.PXC_BoxType.PBox_ViewBox)
page.Box(PDFXEdit.PXC_BoxType.PBox_CropBox) = newCropBox
Code: Select all
curViewBox = page.get_Box(PDFXEdit.PXC_BoxType.PBox_ViewBox);
page.set_Box(PDFXEdit.PXC_BoxType.PBox_CropBox, ref newCropBox);
So any programming language, that 'understands' such COM-stuff - may automatically import such API-description info to your project and in terms of your programming language when you just add the reference to that SDK. And the COM-importer on your side(in your IDE) has its own rules on how to import native COM-interfaces from an external SDK to your project and to your programming language. It is not controlled by the SDK at all.
But typically it is easier than it may look:
- in VB.NET: all COM-properties are imported as natural properties
- in C#:
a) when COM-property has simple type (int, double, string, interface) - it will be imported as natural C#-property
b) when COM-property sets/gets the structure - it will be imported as pair of functions
c) when COM-property has a parameter - it will be imported as pair of functions
Vasyl Yaremyn
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Re: NEWB Question-Region Selection
WOW! Great explanation ... I learned something new on this one. Thank you for all your patience!
- Chris - Tracker Supp
- Site Admin
- Posts: 795
- Joined: Tue Apr 14, 2009 11:33 pm
NEWB Question-Region Selection
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Chris Attrell
Tracker Sales & Support North America
http://www.tracker-software.com
Chris Attrell
Tracker Sales & Support North America
http://www.tracker-software.com