Unable to get much text from OCR

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

Post Reply
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Unable to get much text from OCR

Post by whoit »

Hi -

I'm using the sample OCR code provided with v7 of Pro SDK,
and when I test either of the two PDFs, the result is about 10% text recognition.
I used the defaults set in the sample app, but changed the DPI = 600.

I've tested the same files using a competitor's app and they get near 100% (I'm pretty certain they use your ocrtools.dll)

Can you review the PDFs, dwg, and resulting files and tell me why my results are so bad?

(I would prefer to email the files and not post them here - where can I send them?)

Thanks!
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Hello whoit,
We would be happy to review them for you, you can send them via email to support@pdf-xchange.com

We will take a look as soon as we can.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Post by whoit »

OK, I just sent a zip file ~4mb...
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Hello Whoit,

I appears to me that we are not OCR'ing the text because the color is too light, I will bring the file as an example to the developers so that they can look for solutions for you. In the meantime, can you try making the text a darker color and let us know if the OCR results improve for you?

Hope to hear back soon!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Post by whoit »

Hi Daniel -

I can make it darker but one of my concerns that I mentioned in the original post is that I
have results from another competitor that are nearly 100% using the exact same drawing, and
no changes in color.
I'm pretty sure they are using OCRTools too....
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Hello Whoit,
As mentioned before, I've brought this to the Development team, so now they are aware of this issue and will begin working to resolve it.
If you are finding different results between softwares, it is unlikely that they are using our handlers. As if it was the same handler, it should have the same results and restrictions.

How did the process go with the darker text.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Post by whoit »

I was able to get better results, however still not as good as I was hoping.
"They" are getting about 95% of the words using a color image, and the resulting
fidelity is still great.
I am getting about 80% of the words but sacrificing fidelity greatly.

I can typically get more words at 200 dpi than 300 dpi, for example,
however the resulting image is too low-fidelity to be used - particularly
since the original PDF is vector, and using OCR seems to change it to raster.
(Another problem since we lose image fidelity)

Some info:
Theirs: 1,423 words
Mine: 759 words

Text from the same locations:
Theirs:
1. ALL PIPING TO BE STAINLESS STL SA312-TP304.
2. SOCKET WELDS TO HAVE 1/16" MIN. CLEAR SPACE BETWEEN END OF PIPE & BOTTOM OF SOCKET.
3. OIL RETURN LINES TO BE PITCHED IN DIRECTION OF OIL FLOW. 1" PER FOOT MINIMUM.
4. ROUTING OF PIPING MAY BE CHANGED AT THE DISCRETION OF THE ERECTOR TO CLEAR
LOCAL OBSTRUCTIONS. FINAL ROUTING IS THE RESPONSIBILITY OF THE ERECTOR.
5. ERECTOR TO LOCATE, CUT & DRILL SUPPORTS AS TO ELIMINATE ALL STRAIN & LOAD
ON EQUIPMENT CONNECTIONS.
6. ALL PIPE & FITTINGS SHOULD BE CLEANED OF DEBRIS BEFORE CONNECTING.
7. FOR PRESSURE TO PRESSURE FIELD SCHEDULE SEE DRAWING 357573C.
Image

Mine (missing one entire line):
ALL PLPLNC T0 BE STALNLESS STL SASTZrTPwA.
0R0UN0 EL
EL L00‘70'
4; T
55/
WLu. W T“ e
0LL RETURN LLNES To BE PLTCHE0 LN 0LRECTL0N 0E 0LL EL0w L" PER E00T MTNTMUM
. RoUTLNC 0E PLPLNC MAY BE CHANGED AT THE 0LSCRETL0N 0E THE ERECT0R T0 CLEAR
L00AL 0BSTRUCTL0NS ELNAL R0UTLNC LS THE RESPONSTBTUW OE THE ERECTOR
0ENRECET0O0RLPNTE0NTLOCCA0TNEN,ECCTULT0NLS2. DRLLL SUPPORTS AS TO EUMTNATE ALL STRALN & L0A0
ALL RLPE a: ELTTLNCS SHOULD BE CLEANED 0E DEBRLS BEEORE CONNECTLNC
EOR PRESSURE T0 PRESSURE ELELD SCHEDULE SEE 0RAWLNO 3575730.
Image


I've got a lot more info if necessary.

Thanks.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Hello Whoit,
I've gone ahead and created an internal development ticket for this.
I cannot promise it will be done soon, but the devs have taken interest in this thread and told me that it is something we will continue trying to improve.

Have a good day!
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Post by whoit »

Thanks Daniel...
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Post by whoit »

One more thing - I've noticed that the original PDF output from AutoCAD has all of the text
available as annotations - you can roll your cursor over any text and see the popup.

Can the annotations be transformed into a searchable text layer?

Thanks.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Hi Whoit,
Currently we do not have much control over the creation of text from CAD softwares.
In most cases these appear as Curve objects, making it difficult edit them as they are not really text.

You could try performing the same OCR function as above, and then removing the objects from behind the text. It may work well, depending on the font and font size you had originally used.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Post by whoit »

I think you are focusing on "autocad"
when the real question was whether or not you can convert annotations to a searchable text layer....
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Hi Whoit,
You are right, I has assumed that AutoCAD was a key part of that question due to your example.
My thought train from you example was that each annotation is in that case a separate letter(object), which would not be searchable, so if it is in a case similar to the output of a CAD software, the above would apply.

If they are annotations such as those we can place, I.E. typewriter, textbox, sticky note, etc. You could use the summarize comments function to create a duplicate of the document with all the annotations written out as easily searchable text.
You should also note that most of our commenting tools are already searchable when placed from within the Editor.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
whoit
User
Posts: 269
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Post by whoit »

Summarize Comments?
Not aware of that one....does it just extract all the comment text into a string?

Typically we would want something like the results from OCR - "invisible" text boxes over the
top of existing text that could be selected and/or searched....

(We only use the CoreAPI - not the editor)
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Well, in that case, then running an OCR pass over it should indeed help,
The Core API should still be functionally the same as the Editor from that standpoint, unless you have made drastic changes.
Again though, If these comments have been created within our products, they should be searchable by default.
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Post by TrackerSupp-Daniel »

Also,
About the summarizing comments, there are a few options for it, it is located under the comment tab.
Image
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply