Removing Text Elements

This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Removing Text Elements

Post by jeffp »

I'm looking through your Text Extraction functions to see if I can remove a Text Element.

I need the ability to remove or delete a text element from a PDF document. Is this possible?

Here's my situation: I'm creating a PDF document using OCR Text. I place the OCR text onto a page using PXC_TextOutA. I then merge this PDF using PXCp_PlaceContents with the original image PDF thus producing a searchable PDF. HOWEVER, my users often want to Re-OCR the document, in which case I want to remove all the text elements that I placed in the document the first time around, and then start over.

Thanks.
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Removing Text Elements

Post by Lzcat - Tracker Supp »

Hi.
Unfortunately none of our libraries allow to remove some part of page content (at least without additional coding). Maybe we will add such funtionality into V5 or later versions (it is complex task, much more complex that text extraction).
There are several possible workarrounds:
1. If your original PDF contain only images it is possible to do following: extract images with positions using xcpro40 and than create new pdf file from this images using pxclib40.

Way 2 and 3 require knowledge of Low-Level API (and thus PDF Specification).
2. It is possible to extract all page content in "native" pdf format, modify it and replace original. Also you may modify Resources dictionary to remove unused objects (removing embedded font may seriously reduce file size).
3. Since you are using PXCp_PlaceContents we may aasume that you wil need just to remove last content stream from Contents array (and also maybe remove XObject used in this content from Resources).

HTH
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Removing Text Elements

Post by jeffp »

I'd be in favor of your adding Text Element removal in a later version!

As to the workarounds, I'm somewhat familiar with your low level functions. Any way you can give me a simple code snippet on what functions to use for #2 to get me started.

Thanks.
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Removing Text Elements

Post by Lzcat - Tracker Supp »

You may start from PXCp_llGetPageByIndex - it will give you handle of page object.
Then you should find two items in page dictionary - Contents and Resources.
Contents is a single stream or array of streams where actual page content is stored.
Resources is a resource dictionary - there are listed the named objects used on the page - fonts, XObjects (see below), etc. For example page content can contain record like /F1 10 Tf. It means that the next text will be shown using font which is listed in Resources Font sub dictionary with name F1 and size 10 pt (actual size may differ regarding to CTM and TM matrix).
Another important content operator is Do - by using it on a page you can place an Image or XForm objects. First is simply raster image, but second is container for same content as in page, so it may contain images, drawings, text - anything. You may read more about this in PDF Reference. For example in PDF Reference 1.7 you should read sections 3.7 Content Streams and Resources, 4.7 External Objects and take look at 3.6 Document Structure and 4 Graphics sections.
Regarding to your specific files you may seriously simplify your code. Function PXCp_PlaceContents place content over page as XForm object, so it ad one more content to page with single operator Do (preceded by cm operator to set size and position) and one record to Resources XObject subdictionary. To effectively remove all added content you should remove last content stream from streams array (this will prevent added content from drawing, but not actually remove it from file!) and corresponding XObject from Resources dictionary (this is needed to reduce file size - just remove reference to object and it will skipped during saving). Unfortunately name of XObject may vary, so you will need to read and analyse content of last stream (which you should remove from Contents). You should find operator Do, which will look like <XObject_Name> Do, and remember <XObject_Name> - this is required name. Than simply clear element with this name from Resources XObject subdictionary. But if you will have other files (or not sure that last modification was using PXCp_PlaceContents function) - you will need more general (and more complex) algorithm.
HTH.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
jeffp
User
Posts: 914
Joined: Wed Sep 30, 2009 6:53 pm

Re: Removing Text Elements

Post by jeffp »

I'm afraid this may be too low level for me. I got as far as the code below.

What will it take to have you include a new Text Extraction function that would delete a text element by index. Something like this:

HRESULT PXCp_ET_DeleteElement(PDFDocument pDocument, DWORD index)

Also, is there a way to change the RenderingMode value of an existing Text Element?

Code: Select all

procedure TPDFLibEx.Test;
var
  i: Integer;
  hObject: HPDFOBJECT;
  hDict: HPDFDICTIONARY;
  ACount: DWORD;
  hKeyName: HPDFSTRING;
  hVariant: HPDFVARIANT;
begin
  PXCp_llGetPageByIndex(FDocID, 0, @hObject);
  PXCp_ObjectGetDictionary(hObject, @hDict);
  PXCp_DictionaryGetCount(hDict, @ACount);
  for i := 0 to ACount - 1 do
  begin
    hKeyName := PXCp_StringCreate;
    PXCp_DictionaryGetPair(hDict, i, hKeyName, @hVariant);
    // ??
  end;
end;
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Removing Text Elements

Post by Lzcat - Tracker Supp »

Sorry, but it is impossible to make such functions in the current realization. It is much easier to read the PDF and then modify (and much easier to create a new one).

Regarding your code - you don't need to enumerate the entire page dictionary.

For the first step I would like to recommend you to take a look at the PDF basics - syntax and file structure.

A deep knowledge is not needed, but the basics will help understand what to do with the Low-Level API.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Post Reply