How do I select text while accounting for ligatures?

PDF-XChange Editor SDK for Developers

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Paul - Tracker Supp, Chris - Tracker Supp, Tracker Supp-Stefan, Ivan - Tracker Software

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
Woodgnome
User
Posts: 27
Joined: Tue Oct 10, 2017 11:25 am

How do I select text while accounting for ligatures?

Post by Woodgnome » Wed Mar 25, 2020 4:38 pm

ae.pdf
(213.24 KiB) Not downloaded yet
The use case:

  1. Extract text using IPXC_Page.GetText(null).GetChars(nFirstCharIndex, nCharsCount)
  2. Do something with the extracted text (in this case parse the text and determine a word to highlight).
  3. Highlight the word using the following code:

Code: Select all

IPXV_TextSelection textSelection = (IPXV_TextSelection)IPXV_Document.CreateStdSel((uint)IPXV_Inst.Str2ID("selection.text"));
IPXV_PageTextSelection pageTextSelection = textSelection.GetSel(pageIndex, true);
pageTextSelection.SelectChars(charIndex, length);
textSelection.OnAdd(IPXV_Document);
IPXV_Document.ActiveSel = textSelection;
textSelection.Show(true);
This works fine as long as GetText() is called with no parameters (i.e. leave ligatures as separate characters). I, however, am interested in getting the text with ligatures, like so:

Code: Select all

IPXC_GetPageTextOptions getPageTextOptions = IPXC_Inst.CreateGetPageTextOptions(2);
getPageTextOptions.Flags = 2; // With ligatures
IPXC_Page.GetText(getPageTextOptions).GetChars(nFirstCharIndex, nCharsCount);
When I use the text with ligatures this way, the indices for the characters no longer match those of the select methods if the text actually contains ligatures.

For example:

  • Text with separate characters: "Jeg spiser aebler"
    • I want to select the word "aebler", so I select with charIndex = 11 and length = 6. Works fine
  • Text with ligatures: "Jeg spiser æbler"
    • I want to select the word "æbler", so I select with charIndex = 11 and length = 5. This incorrectly selects only the letters "æble".
How can I avoid this problem while still extracting the text with ligatures?

This is a made up example, but I've attached a PDF with ligatures. Attempting to highlight the word "ægtefæller" in the first line will cause the issue.

User avatar
Sasha - Tracker Dev Team
User
Posts: 4709
Joined: Fri Nov 21, 2014 8:27 am
Contact:

Re: How do I select text while accounting for ligatures?

Post by Sasha - Tracker Dev Team » Thu Mar 26, 2020 8:14 am

Hello Woodgnome,

This setting should give you an idea why this behaves like that:
image.png
Cheers,
Alex
Join us at Google+:
https://plus.google.com/+PDFXChangeEditorTS
Subscribe at:
https://www.youtube.com/channel/UC-TwAMNi1haxJ1FX3LvB4CQ

Woodgnome
User
Posts: 27
Joined: Tue Oct 10, 2017 11:25 am

Re: How do I select text while accounting for ligatures?

Post by Woodgnome » Thu Mar 26, 2020 9:18 am

As far as I can tell this setting only affects copying text from the viewer - not the programmatic extraction of text or setting selection.

See https://www.youtube.com/watch?v=UIgeUEiEimg

Post Reply