Merging original PDF with text-only OCR pdf

GeoffW · Post by **GeoffW** » Tue Jul 24, 2018 6:55 am

As per the subject line, I've been trying merge text-only pages created with a separate OCR process with the original document. The page sizes are identical between the two.

I found an example on here from 2012, but it uses PDF Tools v4, a totally different API.

I thought Core API SDK should be simple, just grab the content from the text-only page, and:

Code: Select all

ASrcPage.PlaceContent(  textcontent, PlaceContent_After);

But this produces an invalid PDF (in PDF Editor the added text content does not show in the content view at all).

Is there an example somewhere on how to do this with Core API SDK?

YouTube · Wed Jul 25, 2018 9:22 am

Hello GeoffW,

From what you describe, everything should be working OK. Please provide a fully working sample along with the resulting file so that we can investigate further.

Cheers,
Alex

GeoffW · Post by **GeoffW** » Wed Jul 25, 2018 2:14 pm

Not sure what limits you may have on attachments to this forum, but here goes ... okay so I had to split it.

pdftest1.zip holds the project and sample pdf files needed to reproduce the problems. (I have removed the serial number so you need to insert your own in the source.)

pdftest2.zip holds two result files from running this project on my system. Why two result files? Well...

As part of producing this demonstration I found out that the problem is different between production and evaluation modes. Full details are in the _README_.txt file in pdftest1, but in short: in production mode (with a serial number) the PDF result is invalid and has no selectable text; in evaluation mode (with no serial number) the PDF is valid and has selectable text but the text is not fully consistent with the starting OCR text overlay (in Adobe it is particularly bad - all characters spaced).

While I was producing a demo, I also included the IPXC_Document double-delete problem from my other thread.

YouTube · Fri Jul 27, 2018 2:10 pm

Hello GeoffW,

Your mistake is that you've copied the content between the documents - and that's not allowed. The correct procedure would be to insert the needed pages from the source document into the destination document, then copy the content from them to needed pages and then remove them. Then everything should work correctly.
https://sdkhelp.pdf-xchange.com/vi ... gesFromDoc

Cheers,
Alex

GeoffW · Post by **GeoffW** » Fri Jul 27, 2018 4:26 pm

I had no idea that I could not copy content between documents. I thought content should be content, and the worst that might happen would be that it would not fit on the page. I dare say there are probably good reasons for the limitation but it certainly wasn't obvious to me.

Anyway, you're quite right. When I do what you suggest everything works smoothly and consistently in both evaluation and production modes. Thanks very much.

In the hope that it helps others I have included the working code below. (This can replace the function in the demonstration project linked above.)

Alex, in this code I chose to insert all pages at once, do the content copy, and then delete the extra pages. I could have done things one page at a time. Do you know if it matters much which way it is done? (In terms of performance and memory use for large PDFs).

Code: Select all

procedure TPdfCoreApiTestMainForm.MergeTextWithOriginalBtnClick(
  Sender: TObject);
var
  origfn, textfn, resultfn: string;
  origdoc, textdoc: IPXC_Document;
  origpage, textpage: IPXC_Page;
  textcontent: IPXC_Content;
  AuxInst: IAUX_Inst;
  origpagebits: IBitSet;
  undodata: IPXC_UndoRedoData;
  i, j, origpc, textpc: DWORD;

  exceptionMask: TFPUExceptionMask; // bug in PDFX
begin
  origfn := BasePath + 'Sample_pages.pdf';
  textfn := BasePath + 'Sample_pages_text.pdf';
  resultfn := BasePath + 'Sample_pages_result.pdf';

  origdoc := PdfXInst.OpenDocumentFromFile(PChar(origfn), nil, nil, 0, 0);
  if not Assigned(origdoc) then
    raise Exception.Create('Failed to open original PDF document.');

  textdoc := PdfXInst.OpenDocumentFromFile(PChar(textfn), nil, nil, 0, 0);
  if not Assigned(origdoc) then
    raise Exception.Create('Failed to open text-only PDF document.');

  if origdoc.Pages.Get_Count(origpc) <> 0 then
    raise Exception.Create('Failed to get page count from original PDF document.');

  if textdoc.Pages.Get_Count(textpc) <> 0 then
    raise Exception.Create('Failed to get page count from text-only PDF document.');

  // We could check page counts here, or we could just let it catch in the loop.
  //if origpc <> textpc then raise Exception.Create('Page count mismatch.');

  // I think page-flags 0 will give us everything except bookmarks which seems
  // appropriate, although perhaps redundant in this case.
  origdoc.Pages.InsertPagesFromDoc(textdoc, origpc, 0, textpc, 0, nil);

  j := origpc; // index of text page in origdoc that should match i.
  for i := 0 to origpc - 1 do
  begin
    if origdoc.Pages.Get_Item(i, origpage) <> 0 then
      raise Exception.CreateFmt('Failed to load page %d from original document.', [i]);
    if origdoc.Pages.Get_Item(j, textpage) <> 0 then
      raise Exception.CreateFmt('Failed to load page %d from original document (text for page %d).', [j, i]);

    // I'm assuming a weak-clone will be sufficient for this
    if textpage.GetContent(CAccessMode_WeakClone, textcontent) <> 0 then
      raise Exception.CreateFmt('Failed to obtain content for page %d of original document (text for page %d', [j, i]);
    if origpage.PlaceContent(textcontent, PlaceContent_After) <> 0 then
      raise Exception.CreateFmt('Failed to place text content on page %d of original document', [i]);

    Inc(j);
  end;

  // Everything seems to work without clearing these, but it seems like a bad
  // idea to leave live references to things that should soon be disappearing.
  textcontent := nil;
  origpage := nil;
  textpage := nil;

  // Don't forget to delete all those pages we inserted but don't really want.
  // There's no need to make this too easy, we're forced to build a bit map to
  // indicate which pages we want deleted.
  AuxInst := PdfXInst.GetExtension('AUX') as IAUX_Inst;
  if not Assigned(AuxInst) then
    raise Exception.Create('Failed to load PDF-XChange Auxiliary Extension.');
  origpagebits := AuxInst.CreateBitSet(origpc + textpc);
  origpagebits.Set_(origpc, textpc, true);
  undodata := nil; // needed to match "out" parameter even if not used.
  if origdoc.Pages.DeletePages(origpagebits, nil, undodata) <> 0 then
      raise Exception.CreateFmt('Failed to delete redundant text pages (%d+) from original document.', [origpc]);

  exceptionMask := GetExceptionMask;
  SetExceptionMask(exceptionMask + [exZeroDivide, exInvalidOp]);
  try
    origdoc.WriteToFile(PChar(resultfn), nil, 0);
  finally
    SetExceptionMask(exceptionMask);
  end;
  textdoc.Close(0);
  origdoc.Close(0);

end;

YouTube · Sat Jul 28, 2018 5:38 am

Hello GeoffW,

Well, my colleague suggested copying all of the pages - thus your algorithm should be OK.

Cheers,
Alex

GeoffW · Post by **GeoffW** » Sat Jul 28, 2018 6:02 am

Thanks. My guess is that it probably won't make a huge difference either way, but I suppose I should test some large PDFs before going into production.

YouTube · Sat Jul 28, 2018 6:04 am

Merging original PDF with text-only OCR pdf

Merging original PDF with text-only OCR pdf

Re: Merging original PDF with text-only OCR pdf

Re: Merging original PDF with text-only OCR pdf

Re: Merging original PDF with text-only OCR pdf

Re: Merging original PDF with text-only OCR pdf

Re: Merging original PDF with text-only OCR pdf

Re: Merging original PDF with text-only OCR pdf

Re: Merging original PDF with text-only OCR pdf