OCR only Image on PDF and Insert OCR'ed text back to original co-ordinates on original PDF

A forum for questions or concerns related to the PDF-XChange Core API SDK

Moderators: Tracker Support, TrackerSupp-Daniel, Sean - Tracker, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Tracker Supp-Stefan

Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.

When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Post Reply
Rajas
User
Posts: 2
Joined: Fri Sep 23, 2022 4:40 am

OCR only Image on PDF and Insert OCR'ed text back to original co-ordinates on original PDF

Post by Rajas »

Hello,

We are working to OCR only images present on the PDF document, then the OCR'ed Image text would be inserted back on to the original PDF document at their original coordinates(position).

So far we have managed to parse the PDF content using IPXC_ContentItems and get the Images from PDF with their positional data, further we are stuck where we need to OCR only these images and then next step would be to insert the OCR'ed image text back on to original PDF using our positional data.

Any help would be appreciated, thanks !
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2245
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: OCR only Image on PDF and Insert OCR'ed text back to original co-ordinates on original PDF

Post by Vasyl-Tracker Dev Team »

Hi Rajas.

Sorry for the delay with the answer. Here is code that uses EditorSDK, its OCRPlugin.pvp and related "op.document.OCRPages2" operation to do what you want:

Code: Select all

double CalcDist(PXC_Point p1, PXC_Point p2)
{
    double dx = p2.x - p1.x;
    double dy = p2.y - p1.y;
    return Math.Sqrt(dx * dx + dy * dy);
}

void OcrImagesOnPageRecursive(PDFXEdit.IPXC_Document doc, PDFXEdit.IPXC_ContentCreator conCreator, PDFXEdit.IPXC_Content pageCon, PXC_Matrix ctm, PDFXEdit.IPXC_Content con, ref bool bTextAdded)
{
    if (con == null)
        return;

    PDFXEdit.IPXC_ContentItems items = con.Items;
    uint itemsCnt = items.Count;
    PDFXEdit.PXV_Inst inst = pdfCtl.Inst;
    PDFXEdit.IMathHelper mh = auxInst.MathHelper;
    int ocrOpId = inst.Str2ID("op.document.OCRPages2");

    for (uint i = 0; i < itemsCnt; i++)
    {
        PDFXEdit.IPXC_ContentItem item = items[i];
        PDFXEdit.PXC_CIType itemType = item.Type;

        if ((PDFXEdit.PXC_CIType.CIT_Image == itemType) || (PDFXEdit.PXC_CIType.CIT_InlineImage == itemType))
        {
            PDFXEdit.PXC_Rect bbox = item.BBox;
            double bboxWidth = bbox.right - bbox.left;
            double bboxHeight = bbox.top - bbox.bottom;
            if (bboxWidth < 10 || bboxHeight < 10)
                continue; // skip too small images

            PDFXEdit.PXC_Matrix image2page = mh.Matrix_Multiply(ctm, item.GetCTM());

            PXC_Point a1, b1, c1;

            a1.x = 0;
            a1.y = 0;
            b1.x = 0;
            b1.y = 1;
            c1.x = 1;
            c1.y = 0;

            mh.Point_Transform(image2page, ref a1);
            mh.Point_Transform(image2page, ref b1);
            mh.Point_Transform(image2page, ref c1);

            double imageHeight = CalcDist(b1, a1);
            double imageWidth = CalcDist(c1, a1);
            if (imageWidth < 10 || imageHeight < 10)
                continue; // skip too small images

            PDFXEdit.IIXC_Page image = item.Image_CreateIXCPage(true);
            if (image == null)
                continue;

            PDFXEdit.IOperation op = inst.CreateOp(ocrOpId);
            {
                op.Params.Root["Input"].v = doc;
                PDFXEdit.ICabNode opts = op.Params.Root["Options"];
                opts["Image"].v = image;
                opts["ImageViewWidth"].v = imageWidth;
                opts["ImageViewHeight"].v = imageHeight;
                opts["MultiThreaded"].v = false;
                opts["Languages"].v = "eng";
            }

            PDFXEdit.IPXC_Content text = null;
            try
            {
                op.Do(0);
                text = (PDFXEdit.IPXC_Content)op.Params.Root["Output"].v;
            }
            catch { }

            Marshal.FinalReleaseComObject(image); // release unused image to avout overusing the memory

            if (text != null)
            {
                if (!bTextAdded)
                {
                    bTextAdded = true;
                    conCreator.Attach(pageCon.Clone(false));
                    conCreator.ResetAllStatesToDefault();
                }

                PXC_Rect textBBox;
                textBBox.left = textBBox.bottom = 0;
                textBBox.right = imageWidth;
                textBBox.top = imageHeight;
                text.set_BBox(textBBox);

                IPXC_XForm xf = doc.CreateNewXForm(textBBox);
                xf.SetContent(text);
                xf.set_BBox(textBBox);

                PXC_Point a2, b2, c2;
                a2.x = 0;
                a2.y = 0;
                b2.x = 0;
                b2.y = imageHeight;
                c2.x = imageWidth;
                c2.y = 0;

                PXC_Matrix text2page = mh.Matrix_ParlToParl(a2, b2, c2, a1, b1, c1);

                conCreator.SaveState();
                conCreator.ConcatCS(text2page);
                conCreator.PlaceXForm(xf, "");
                conCreator.RestoreState();

                Marshal.FinalReleaseComObject(text); // release unused text-content to avout overusing the memory
            }
        }
        else if (PDFXEdit.PXC_CIType.CIT_XForm == itemType)
        {
            // go deep into the XForm-content
            PDFXEdit.PXC_Rect bbox = item.BBox;
            double bboxWidth = bbox.right - bbox.left;
            double bboxHeight = bbox.top - bbox.bottom;
            if (bboxWidth < 10 || bboxHeight < 10)
                continue; // skip too small images

            IPXC_XForm xf = doc.GetXFormByHandle(item.XForm_Handle);
            if (xf != null)
            {
                IPXC_Content xcon = xf.GetContent();
                if (xcon != null)
                {
                    PXC_Matrix m = mh.Matrix_Multiply(xf.get_Matrix(), item.GetCTM());
                    PXC_Matrix xctm = mh.Matrix_Multiply(m, ctm);
                    OcrImagesOnPageRecursive(doc, conCreator, pageCon, xctm, xcon, ref bTextAdded);
                }
            }
        }
    }
}

void OcrImagesOnPage(PDFXEdit.IPXC_Page page)
{
    PDFXEdit.IPXC_Document doc = page.Document;
    PDFXEdit.IPXC_Content con = page.GetContent(PXC_ContentAccessMode.CAccessMode_ReadOnly);
    if (con == null)
        return;

    PDFXEdit.IPXC_ContentItems items = con.Items;
    uint itemsCnt = items.Count;
    if (itemsCnt == 0)
        return;

    PDFXEdit.IPXC_ContentCreator conCreator = doc.CreateContentCreator();

    PXC_Matrix ctm;
    ctm.a = ctm.d = 1;
    ctm.b = ctm.c = ctm.e = ctm.f = 0;

    bool bTextAdded = false;
    OcrImagesOnPageRecursive(doc, conCreator, con, ctm, con, ref bTextAdded);

    if (bTextAdded)
    {
        PDFXEdit.IPXC_Content newCon = conCreator.Detach();
        page.PlaceContent(newCon, (uint)PXC_PlaceContentFlags.PlaceContent_Replace);
    }
}

================================

PXV_Inst inst;

...

inst.StartLoadingPlugins();
inst.AddPluginFromFile("<PluginsFolder>\OCRPlugin.pvp");
inst.FinishLoadingPlugins();

...

uint pagesCnt = doc.Pages.Count;
for (uint i = 0; i < pagesCnt; i++)
   OcrImagesOnPage(doc.Pages[i]);
HTH
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Rajas
User
Posts: 2
Joined: Fri Sep 23, 2022 4:40 am

Re: OCR only Image on PDF and Insert OCR'ed text back to original co-ordinates on original PDF

Post by Rajas »

Hello Vasyl,

Thank you for your reply. I went through the code snippet and I'm facing a issue under function OcrImagesOnPageRecursive() for PDFXEdit.IPXV_Inst inst = pdfCtl.Inst; where it is displayed as "pdfCtl" does not exist in current context, could you please let me know how should we declare it.

Also for most of the functions we are getting exceptions as ambiguous reference to PDFXCoreApi and PDFXEdit, additionally it is stated that 'cannot convert from PDFXCoreApi to PDFXEdit'. Could you please let us know how we can resolve this.

Regards,
Rajas
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2245
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: OCR only Image on PDF and Insert OCR'ed text back to original co-ordinates on original PDF

Post by Vasyl-Tracker Dev Team »

The EditorSDK includes the whole CoreAPI SDK as well (as PXC/PXS-sublayer). So in your app you do not need to make two independent references to such two SDKs. It's enough to make just one reference to the EditorSDK only. According to pdfCtl-object - please look to our FullDemo SDK-example(s) -there you will see it. As option, you may use EditorSDK without Editor's UI. You need to just initialize properly the PXV_Inst object (look to C# FullDemo-project) and then you will be able to get PXC_Inst (via PXV_Inst::GetExtension("PXC")) and then open any document using it...
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Post Reply