PXC_TextOutA and PCX_PointF

This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

PXC_TextOutA and PCX_PointF

Post by jeffp »

It looks like the y coordinate of PCX_PointF starts from the bottom and moves up rather than most points I deal with that start and the top and move down.

Is there anyway to make PCX_PointF.y = 0 mean the top of the page and not the bottom of the page?

The problem with PCX_PointF.y = 0 meaning the bottom of the page is that I always have to know the height of the overall page to convert the coordinates given me by my ocr engine or other apps.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: PXC_TextOutA and PCX_PointF

Post by jeffp »

Ok. Now I'm more confused. In the PXCp_PageGetBox example below it appears that the Height is calculated by Bottom - Top meaning the the y coordinate starts at 0 at the top and not the bottom. However, when I run the procedure I get Bottom = 0 and Top = 792 which would indicate the opposite.

I'm using Delphi. How is the y coordinate calculated?

Code: Select all


PXCp_PageGetBox

 PDFDocument  hDocument;

  PXC_RectF    MediaBox = {0};

  // Retrieve the width and height of the first page in the document:

  HRESULT res = PXCp_PageGetBox(hDocument, 0, PB_MediaBox, &PageRect);

  if (IS_DS_FAILED(res))

  {

      // Report an error

  }

  double width  = MediaBox.right - MediaBox.left;

[b]  double height = MediaBox.bottom - MediaBox.top;[/b]

  ...

  // Clean up

  PXCp_Delete(hDocument);


Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PXC_TextOutA and PCX_PointF

Post by Walter-Tracker Supp »

I will assume that is an error in the help document, as PDF coordinates are indeed with the origin at the bottom left (similar to normal mathematical / cartesian coordinates, and opposite from windows API screen coordinates which begin at 0,0 in the top left and increase as you go down). The function PXC_GetPageSize() returns the equivalent dimensions as PXCp_PageGetBox() when you pass PB_MediaBox as in the example.

Note that there is a function OCRp_RasterRectToPDF() that converts output from other OCRp_ functions to PDF coordinates, if that is what you are using. It accounts for the difference in origin & Y axis conventions, and also the scaling (and non-integer coordinates) to go from some arbitrary DPI used for OCR rasterization to the PDF page coordinates in points (72.0 per inch). In fact if you pass it the OCR_RasterPageSettings structure returned from (e.g. OCRp_Page) it already contains the required information on page size to do this for you.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: PXC_TextOutA and PCX_PointF

Post by jeffp »

Ok. I can deal with the bottom left PDF coordinate system, but here's a strange one.

I've got a PDF that I ran through your OCR engine and I then open it in the other PDF DLLs to extract the text elements.

if I use PXCp_ET_GetElement to get both PXP_TextElement.Matrix.e and PXP_TextElement.Matrix.f (which are the X and Y coordinates respectively) and then try to create a new PDF with just text using PXC_TextOutA and setting

PXC_PointF.x = PXP_TextElement.Matrix.e
PXC_PointF.y = PXP_TextElement.Matrix.f

All the text boxes in the new PDF are placed a bit lower that they were in the original PDF. It appears to be off about the height of the text element, but I can't find anywhere in the Maxtix to give me that height adjustment.

I would have expected to perfectly reproduce the text element positions with the settings indicated above.

Why is it off?

Basically, I want to prove to myself that I cab extract existing text elements from a PDF and rebuild them in a new PDF exactly.

Any way you can post the code of OCRp_RasterRectToPDF? I'd like to write it in Delphi since my above example is independent of the ocrtools.dll. It deals with converting coordinates of other ocr engines into PDF. The other engines use the (0,0) as the top of the page.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: PXC_TextOutA and PCX_PointF

Post by jeffp »

Walter,

Have you or anyone been able to take a look at my reply here about not being about to grab and then reset the text boxes exactly.

Jeff
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PXC_TextOutA and PCX_PointF

Post by Walter-Tracker Supp »

Here's the calculation to convert a RECT SourceRect (in image coordinates: top left origin) to PDF coordinates (double variables left, right, top and bottom), in a nutshell:

Code: Select all

	left = (double)SourceRect.left / RasterSettings.scalefactor;
	right = (double)SourceRect.right / RasterSettings.scalefactor;
	top = (double)((RasterSettings.imgheight-1)-SourceRect.top) / RasterSettings.scalefactor;
	bottom = (double)((RasterSettings.imgheight-1)-SourceRect.bottom) / RasterSettings.scalefactor;
The scale factor is just imageDPI / 72.0.

As for the PXCp functions and placement of text I will have to look into that in a bit more detail and get back to you.

-Walter
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: PXC_TextOutA and PCX_PointF

Post by jeffp »

Ok. That's what I was doing so good.

But why don't these come out equal.

PXC_PointF.x = PXP_TextElement.Matrix.e
PXC_PointF.y = PXP_TextElement.Matrix.f
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PXC_TextOutA and PCX_PointF

Post by Walter-Tracker Supp »

It sounds to me as if you are using the wrong text placement option. You can set this with a call to PXC_SetTextOptions():

Code: Select all

HRESULT PXC_SetTextOptions( _PXCContent* content, const PXC_TextOptions* options
);

Parameters:
content [in] Parameter content specifies the identifier of the page content to which the function will be applied.
options [in] Pointer to PXC_TextOptions structure which specifies text options.
content is the same as your first argument to PXC_SetTextA(). options is a pointer to a PXC_TextOptions struct (see the SDK help for a full list of members) which contains a member called "nTextPosition".

Possible settings (from the SDK help) are:

Code: Select all

TextPosition_Top = 0 - The reference point will be on the top edge of the bounding rectangle.
TextPosition_Baseline = 1 -  The reference point will be on the base line of the text.
TextPosition_Bottom = 2 - The reference point will be on the bottom edge of the bounding rectangle.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: PXC_TextOutA and PCX_PointF

Post by jeffp »

Ok. Using TextPosition_Baseline did the trick.

However, what exactly does this mean? I understand _Top and _Bottom. What exactly is the base line of the text?

--Jeff
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: PXC_TextOutA and PCX_PointF

Post by Walter-Tracker Supp »

jeffp wrote:Ok. Using TextPosition_Baseline did the trick.

However, what exactly does this mean? I understand _Top and _Bottom. What exactly is the base line of the text?

--Jeff
It's a font / typography thing.

Basically the difference between a bounding box and a baseline is that the baseline runs along the bottom of the letters that don't have descenders (i.e., letters other than g, p, q, y, j, although it depends on font) and the bounding box covers the entire word / line. So in "elephantitis", the baseline runs along the bottom of e,l,e,h,a,n,t,i,t,i,s and the bounding box bottom is at the bottom of the "p".

There's a good summary here on wikipedia: http://en.wikipedia.org/wiki/Baseline_%28typography%29