Cyryllic text not searchable with some fonts in generated PDF

This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
jacekP
User
Posts: 1
Joined: Wed Dec 08, 2021 1:49 pm

Cyryllic text not searchable with some fonts in generated PDF

Post by jacekP »

Hello,

We're using pxclib40.dll (4.0.201.0) for generating pdf docs. Some time ago We had a problem with non searchable text in cyryllic in generated PDF. I've found solution on this forum that
PXC_SetEmbeddingOptions(mydoc, TRUE, TRUE, TRUE); should help and indeed it helped.
Unfortunately now I got request that it doesn't work with some fonts (e.g. GOST_A font)
Is there any additional setting that should I use?

I've attached generated PDF with two texts exported with exactly the same settings (but one with MS Arial Unicode font and the second with GOST Type A font) and GOST_A font file
CyrrylicText_GOST_A_font.7z
(84.97 KiB) Downloaded 123 times
This is how more or less our code resposible for generating text looks like (We have a C# wrapper for pxclib40 library)

int eCode = PDFWrapper.PXC_SetEmbeddingOptions(this.pdfPtr, true, true, true);
if (PdfHelper.IS_DS_FAILED(eCode))
{
return;
}

eCode = PDFWrapper.PXC_SetFontEmbeddW(this.pdfPtr, font.TTFFileKey.FamilyName, PDFWrapper.PXC_EmbeddType.EmbeddType_ForceEmbedd);
if (PdfHelper.IS_DS_FAILED(eCode))
{
return;
}

eCode = PDFWrapper.PXC_AddFontW(this.pdfPtr, tm.tmWeight, font.TTFFileKey.IsItalic, font.TTFFileKey.FamilyName, out fntID);
if (PdfHelper.IS_DS_FAILED(eCode))
{
return;
}

PDFWrapper.PXC_TextOptions newTextOpt = PDFWrapper.PXC_GetTextOptions(this.pdfPage, out newTextOpt);
newTextOpt.fontID = fntID;
newTextOpt.nTextPosition = PDFWrapper.PXC_TextPosition.TextPosition_Baseline;
newTextOpt.fontSize = PdfHelper.MM2PsP(lenX);
PDFWrapper.PXC_SetTextOptions(this.pdfPage, ref newTextOpt);
PDFWrapper.PXC_TextOutW(this.pdfPage, ref origin, charTxt, -1);
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Cyryllic text not searchable with some fonts in generated PDF

Post by TrackerSupp-Daniel »

Hello, jacekP

Thank you for the report, I am afraid that this topic goes beyond my personal knowledge, but I have asked our Dev team to take a look. Someone should come along and post here today or tomorrow to help with this.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3549
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: Cyryllic text not searchable with some fonts in generated PDF

Post by Ivan - Tracker Software »

I'm afraid that the problem is with the font, not with the library.

For some reason, this font maps two ranges of codes into Cyrillic characters as shown below:
"GOST type A" characters map
"GOST type A" characters map
And, a non-Unicode range was used when the text was rendered.
For example, when you copy text from your PDF file and paste it into notepad with Arial font selected you will see this
font "Arial" used
font "Arial" used
But once you change the font in Notepad to "GOST type A" you will see readable text
font "GOST type A" used
font "GOST type A" used
At the moment I'm not ready to answer why the incorrect code range was used, and I'm afraid there is no simple solution for that.
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Cyryllic text not searchable with some fonts in generated PDF

Post by Lzcat - Tracker Supp »

Hi, jacekP
Problem is that this font has incorrect character mapping. As you can see in screenshot below font has 3 cmap subtables, with different mappings. Windows and most programs use subtable 3.1 when possible, so we take look on it first. As you can see this table have specific mapping - many times two characters are mapped into same glyph. For example Cyrillic letter 'Б' (U+0411) is mapped into glyph 120 (0x78), but also character Aacute 'Á' (U+00C1) is mapped into same glyph. So if you type one of them you will see Cyrillic letter 'Б' in both cases. To properly display embedded font programs must embed correct glyph, and that is all. To make text searchable they may also embed additional information which will map embedded glyphs to correct Unicode characters. Normally it is not a problem, but with this font we have a problem - some glyphs are mapped into two Unicode characters, and programs should choose one of them. Our software in such cases use first mapping, and this is Aacute 'Á' (U+00C1). I'm afraid that this cannot be changed, because we have also other fonts, where two or even more characters are mapped into same glyphs, and in many cases selecting first mapping is correct.
Well, even we will try to find workaround for your font, we have other problem: all other cmap subtables and even post table state that glyph 120 (0x78) correspond to character Aacute 'Á' (U+00C1). So yes, this font has three tables with incorrect mapping, and one tricky table with double mapping, which allow 'correct' mapping work too.
image.png
Kind regards,
Lzcat - Tracker Supp
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Post Reply