Removing Text Elements

jeffp · Post by **jeffp** » Thu Nov 12, 2009 8:09 pm

I'm looking through your Text Extraction functions to see if I can remove a Text Element.

I need the ability to remove or delete a text element from a PDF document. Is this possible?

Here's my situation: I'm creating a PDF document using OCR Text. I place the OCR text onto a page using PXC_TextOutA. I then merge this PDF using PXCp_PlaceContents with the original image PDF thus producing a searchable PDF. HOWEVER, my users often want to Re-OCR the document, in which case I want to remove all the text elements that I placed in the document the first time around, and then start over.

Thanks.

Fri Nov 13, 2009 7:41 am

Hi.
Unfortunately none of our libraries allow to remove some part of page content (at least without additional coding). Maybe we will add such funtionality into V5 or later versions (it is complex task, much more complex that text extraction).
There are several possible workarrounds:
1. If your original PDF contain only images it is possible to do following: extract images with positions using xcpro40 and than create new pdf file from this images using pxclib40.

Way 2 and 3 require knowledge of Low-Level API (and thus PDF Specification).
2. It is possible to extract all page content in "native" pdf format, modify it and replace original. Also you may modify Resources dictionary to remove unused objects (removing embedded font may seriously reduce file size).
3. Since you are using PXCp_PlaceContents we may aasume that you wil need just to remove last content stream from Contents array (and also maybe remove XObject used in this content from Resources).

HTH

jeffp · Post by **jeffp** » Fri Nov 13, 2009 4:38 pm

I'd be in favor of your adding Text Element removal in a later version!

As to the workarounds, I'm somewhat familiar with your low level functions. Any way you can give me a simple code snippet on what functions to use for #2 to get me started.

Thanks.

Sat Nov 14, 2009 9:20 am

You may start from PXCp_llGetPageByIndex - it will give you handle of page object.
Then you should find two items in page dictionary - Contents and Resources.
Contents is a single stream or array of streams where actual page content is stored.
Resources is a resource dictionary - there are listed the named objects used on the page - fonts, XObjects (see below), etc. For example page content can contain record like /F1 10 Tf. It means that the next text will be shown using font which is listed in Resources Font sub dictionary with name F1 and size 10 pt (actual size may differ regarding to CTM and TM matrix).
Another important content operator is Do - by using it on a page you can place an Image or XForm objects. First is simply raster image, but second is container for same content as in page, so it may contain images, drawings, text - anything. You may read more about this in PDF Reference. For example in PDF Reference 1.7 you should read sections 3.7 Content Streams and Resources, 4.7 External Objects and take look at 3.6 Document Structure and 4 Graphics sections.
Regarding to your specific files you may seriously simplify your code. Function PXCp_PlaceContents place content over page as XForm object, so it ad one more content to page with single operator Do (preceded by cm operator to set size and position) and one record to Resources XObject subdictionary. To effectively remove all added content you should remove last content stream from streams array (this will prevent added content from drawing, but not actually remove it from file!) and corresponding XObject from Resources dictionary (this is needed to reduce file size - just remove reference to object and it will skipped during saving). Unfortunately name of XObject may vary, so you will need to read and analyse content of last stream (which you should remove from Contents). You should find operator Do, which will look like <XObject_Name> Do, and remember <XObject_Name> - this is required name. Than simply clear element with this name from Resources XObject subdictionary. But if you will have other files (or not sure that last modification was using PXCp_PlaceContents function) - you will need more general (and more complex) algorithm.
HTH.

jeffp · Post by **jeffp** » Wed Nov 18, 2009 6:46 am

I'm afraid this may be too low level for me. I got as far as the code below.

What will it take to have you include a new Text Extraction function that would delete a text element by index. Something like this:

HRESULT PXCp_ET_DeleteElement(PDFDocument pDocument, DWORD index)

Also, is there a way to change the RenderingMode value of an existing Text Element?

Code: Select all

procedure TPDFLibEx.Test;
var
  i: Integer;
  hObject: HPDFOBJECT;
  hDict: HPDFDICTIONARY;
  ACount: DWORD;
  hKeyName: HPDFSTRING;
  hVariant: HPDFVARIANT;
begin
  PXCp_llGetPageByIndex(FDocID, 0, @hObject);
  PXCp_ObjectGetDictionary(hObject, @hDict);
  PXCp_DictionaryGetCount(hDict, @ACount);
  for i := 0 to ACount - 1 do
  begin
    hKeyName := PXCp_StringCreate;
    PXCp_DictionaryGetPair(hDict, i, hKeyName, @hVariant);
    // ??
  end;
end;

Wed Nov 18, 2009 9:56 am

Sorry, but it is impossible to make such functions in the current realization. It is much easier to read the PDF and then modify (and much easier to create a new one).

Regarding your code - you don't need to enumerate the entire page dictionary.

For the first step I would like to recommend you to take a look at the PDF basics - syntax and file structure.

A deep knowledge is not needed, but the basics will help understand what to do with the Low-Level API.

Removing Text Elements

Removing Text Elements

Re: Removing Text Elements

Re: Removing Text Elements

Re: Removing Text Elements

Re: Removing Text Elements

Re: Removing Text Elements