OCR: Outputfile does not contain all pages

HiThere44 · Post by **HiThere44** » Tue Apr 06, 2021 1:42 pm

Hello,
I need help with the following request:

I want to apply OCR to a PDF document and save it as a new PDF document.
The problem now is that the output file contains only pages of the original, which have no text. Since pages that contain text are skipped, they are also not included in the output file.

What I need is the following setting:
- An OCR is applied to a document
- Pages that already contain text are excluded from OCR
- The output file must contain all pages of the original. Just so that previously textless pages now contain text.

My setting looks like this:

Code: Select all

Private Sub SetOptionsNode(ByRef optionNode As ICabNode, ByVal argInfo As ArgumentInfo)
        optionNode("PagesRange.Type").v = "All"
        optionNode("ExtParams.Accuracy").v = argInfo.dpi
        optionNode("ExtParams.Language").v = argInfo.languages
        optionNode("ExtParams.AutoDeskew").v = argInfo.autoDeskew
        optionNode("OCRNoTextPagesOnly").v = argInfo.ocrNoTextPagesOnly
        optionNode("OutputType").v = 1
        optionNode("OutputDPI").v = 0
    End Sub

I have already searched this forum for appropriate posts, but have not found anything that helps me.
The difference to the other OCR posts is the setting for the OutputType. So far I have only found posts where the OutputType is set to 0. But to be able to skip pages with text, OutputType = 1 is needed. Accordingly in my case "IOperation.Params.Root("Output").v" is written as output file.

I hope you can help me

YouTube · Wed Apr 07, 2021 6:23 am

Hello HiThere44,

We'll look into this one, whether there is some setting that you can include.
Meanwhile, you can manually analyze for pages that contain at least one text content item, and then insert those at an appropriate position after the recognition is done.

Cheers,
Alex

HiThere44 · Post by **HiThere44** » Mon May 10, 2021 1:51 pm

Well, I got that problem solved. I thought I'd post the solution in here. Maybe someone is facing the same problem and is happy about it:

Code: Select all

Public Class MissingsOutputPagesInserter
        Private Shared _instance As MissingsOutputPagesInserter
        Private Shared ReadOnly _lock As New Object()

        Private _pdfx As PDFXEditFactory
        Private _input As IPXC_Document
        Private _output As IPXC_Document
       
        Private Sub New(input As IPXC_Document, ByRef output As IPXC_Document)
            _pdfx = PDFXEditFactory.GetInstance()
            _input = input
            _output = output
        End Sub
     
        Public Shared Function GetInstance(input As IPXC_Document, ByRef output As IPXC_Document) As MissingsOutputPagesInserter
            If _instance Is Nothing Then
                SyncLock _lock
                    If _instance Is Nothing Then
                        _instance = New MissingsOutputPagesInserter(input, output)
                    End If
                End SyncLock
            End If

            Return _instance
        End Function
     
        <MethodImpl(MethodImplOptions.Synchronized)>
        Public Sub CompareAndMergeToOutput()
            Try
                Dim textPages = GetInputPagesWithText()
                InsertMissingPagesFromInputToOutput(textPages)
            Catch comEx As COMException
                Dim message = _pdfx.auxInst.FormatHRESULT(comEx.HResult)
                GetAppLogger.Error(message, comEx)
            Catch ex As Exception
                GetAppLogger.Error(ex)
            End Try
        End Sub
        
        Private Function GetInputPagesWithText() As List(Of UInteger)
            Dim output As New List(Of UInteger)
            Dim page As IPXC_Page
            Dim pagetext As IPXC_PageText
            Dim pagetextoptions = Options.CreatePageTextOptions(_pdfx)

            If _input?.Pages.Count <> _output?.Pages.Count Then
                For pagenumber As UInteger = 0 To _input.Pages.Count - CUInt(1)                  
                    page = _input.Pages.Item(pagenumber)                
                    pagetext = page?.GetText(pagetextoptions, False)
                    Marshal.ReleaseComObject(page)                 
                    If pagetext?.LinesCount > 0 Then
                        output.Add(pagenumber)
                    End If
                Next
            End If

            Return output
        End Function

        
        Private Sub InsertMissingPagesFromInputToOutput(pages As List(Of UInteger))
            Dim outputPages As IPXC_Pages = _output.Pages
            For i As Integer = 0 To pages.Count - 1
                outputPages.InsertPagesFromDoc(_input, pages.Item(i), pages.Item(i), 1,
                                               CInt(PXC_InsertPagesFlags.IPF_Annots_Copy) Or CInt(PXC_InsertPagesFlags.IPF_Widgets_Copy))
            Next
            Marshal.ReleaseComObject(outputPages)
        End Sub

    End Class

I determined the missing pages by identifying all pages of the incoming document that contain text. I then added these pages back to the outgoing document.

Maybe it will help someone.

YouTube · Tue May 11, 2021 3:20 pm