ContentItem.Text_GetData[2|SA]()

A forum for questions or concerns related to the PDF-XChange Core API SDK

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Sean - Tracker, Chris - Tracker Supp, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
JesseH
User
Posts: 18
Joined: Fri Jul 22, 2016 8:47 pm
Location: TN

ContentItem.Text_GetData[2|SA]()

Post by JesseH » Wed Nov 09, 2016 7:26 pm

Hi,

I'm trying to extract text from a content item using the Text_GetData methods. It seems to work with some text, but not so much with others.

For instance, in the attached PDF there are two lines of text. When extracting the text using PageText, it reads like this:

Code: Select all

Footer around this locationThis is a second line.
But when either constructing the string byte-by-byte using GetDataSA() or just getting the string using Text_GetData2().GetString(), the first line reads:

Code: Select all

"\0'\0P\0P\0U\0F\0S\0\0B\0S\0P\0V\0O\0E\0\0U\0I\0J\0T\0\0M\0P\0D\0B\0U\0J\0P\0O"
Is there a flag or encoder/decoder that needs to be set here?

The (C#) code I'm currently working on is a simple analysis tool that uses CoreAPI to extract various info from the PDF document. Here is the code I'm using to (try to) get the text form the ContentItem interface:

Code: Select all

static public string GetContentText(IPXC_ContentItem pSrcContent)
        {
            var sRet = ""; 
            Array oByteBuffer = null;
            
            // Simpler code to get the text, but has same results as original code.
            sRet = pSrcContent.Text_GetData2().GetString();
            var oFlags = pSrcContent.Text_GetData2().GetStringFlags();

            // Original workaround for the missing Text_GetText method.
            pSrcContent.Text_GetDataSA(out oByteBuffer);
            var oByteList = oByteBuffer.OfType<byte>().ToList();
            for (int i = 0; i < oByteList.Count(); i++)
            {
                sRet = sRet + (char)oByteList[i];
            }

            return sRet;
        }

In case it's at all helpful, here is the output of my tool. The first run lists all the content items in the file, and the second run just dumps the PageText from each page.

Code: Select all

C:\Temp\PDF Editing Sandbox\Active>PDFAnalysis.exe -c
Extracting ContentItems from %d documents
##### Processing file [C:\Temp\PDF Editing Sandbox\Active\ImageTest.pdf]#####################################################################
-------------------- Page: [1] ------------------------------------------------------------------------

--- Content Item 0 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 1 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 2 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 3 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 4 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 5 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 6 -----
  Type: [CIT_Text]
  BBox: [45.327, 72.060, 205.524, 61.515]
  Value: [ ' P P U F S ☺ B S P V O E ☺ U I J T ☺ M P D B U J P O]
--- Content Item 7 -----
  Type: [CIT_Text]
  BBox: [30.927, 71.916, 172.752, 47.115]
  Value: [ 5 I J T ☺ J T ☺ B ☺ T F D P O E ☺ M J O F ☼]
--- Content Item 8 -----
  Type: [CIT_Image]
  BBox: [730.488, 428.972, 531.932, 762.888]
 Height: [135]
 Width:  [429]


-------------------- Page: [2] ------------------------------------------------------------------------

--- Content Item 0 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 1 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 2 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 3 -----
  Type: [CIT_BeginContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 4 -----
  Type: [CIT_XForm]
  BBox: [0.000, 0.000, 0.000, 0.000]
--- Content Item 5 -----
  Type: [CIT_EndContainer]
  BBox: [0.000, 0.000, 0.000, 0.000]



C:\Temp\PDF Editing Sandbox\Active>PDFAnalysis.exe -t
Extracting Text from %d documents
##### Processing file [C:\Temp\PDF Editing Sandbox\Active\ImageTest.pdf]#####################################################################
--------------------[0]------------------------------------------------------------------------

Footer around this locationThis is a second line.

--------------------[1]------------------------------------------------------------------------





C:\Temp\PDF Editing Sandbox\Active>
Attachments
ImageTest.pdf
(56.48 KiB) Downloaded 65 times

Lzcat - Tracker Supp
Site Admin
Posts: 712
Joined: Thu Jun 28, 2007 8:42 am

Re: ContentItem.Text_GetData[2|SA]()

Post by Lzcat - Tracker Supp » Thu Nov 10, 2016 6:43 am

Hi JesseH.
Actually text in PDF files stored as multibyte :!: strings, and how to interpret that data depends on which font is used. You are trying to get text using low level functions and therefore receiving raw data, which must be than translated to real Unicoddes. for now you have two options:
1. Use higher level functions to deal with text on page (or in content) using IPXC_Page::GetText (IPXC_Content::GetText) methods.
2. Read PDF specification section 9 Text and especially subsection 9.10 Extraction of text content and learn how to interpret raw data.
HTH.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.

Post Reply