PDF-XChange - Tracker PDF Viewer - TIFF-XChange - Image-XChange - XMF-XChange - Raster-XChange - Support

Moderators: TrackerSupp-Daniel, Tracker Support, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Tracker Supp-Stefan

 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Unable to get much text from OCR

Thu Apr 26, 2018 12:59 pm

Hi -

I'm using the sample OCR code provided with v7 of Pro SDK,
and when I test either of the two PDFs, the result is about 10% text recognition.
I used the defaults set in the sample app, but changed the DPI = 600.

I've tested the same files using a competitor's app and they get near 100% (I'm pretty certain they use your ocrtools.dll)

Can you review the PDFs, dwg, and resulting files and tell me why my results are so bad?

(I would prefer to email the files and not post them here - where can I send them?)

Thanks!
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Thu Apr 26, 2018 5:16 pm

Hello whoit,
We would be happy to review them for you, you can send them via email to Support@tracker-software.com

We will take a look as soon as we can.
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Thu Apr 26, 2018 5:37 pm

OK, I just sent a zip file ~4mb...
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Thu Apr 26, 2018 7:13 pm

Hello Whoit,

I appears to me that we are not OCR'ing the text because the color is too light, I will bring the file as an example to the developers so that they can look for solutions for you. In the meantime, can you try making the text a darker color and let us know if the OCR results improve for you?

Hope to hear back soon!
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Thu Apr 26, 2018 8:17 pm

Hi Daniel -

I can make it darker but one of my concerns that I mentioned in the original post is that I
have results from another competitor that are nearly 100% using the exact same drawing, and
no changes in color.
I'm pretty sure they are using OCRTools too....
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Thu Apr 26, 2018 10:46 pm

Hello Whoit,
As mentioned before, I've brought this to the Development team, so now they are aware of this issue and will begin working to resolve it.
If you are finding different results between softwares, it is unlikely that they are using our handlers. As if it was the same handler, it should have the same results and restrictions.

How did the process go with the darker text.
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Fri Apr 27, 2018 3:23 pm

I was able to get better results, however still not as good as I was hoping.
"They" are getting about 95% of the words using a color image, and the resulting
fidelity is still great.
I am getting about 80% of the words but sacrificing fidelity greatly.

I can typically get more words at 200 dpi than 300 dpi, for example,
however the resulting image is too low-fidelity to be used - particularly
since the original PDF is vector, and using OCR seems to change it to raster.
(Another problem since we lose image fidelity)

Some info:
Theirs: 1,423 words
Mine: 759 words

Text from the same locations:
Theirs:
1. ALL PIPING TO BE STAINLESS STL SA312-TP304.
2. SOCKET WELDS TO HAVE 1/16" MIN. CLEAR SPACE BETWEEN END OF PIPE & BOTTOM OF SOCKET.
3. OIL RETURN LINES TO BE PITCHED IN DIRECTION OF OIL FLOW. 1" PER FOOT MINIMUM.
4. ROUTING OF PIPING MAY BE CHANGED AT THE DISCRETION OF THE ERECTOR TO CLEAR
LOCAL OBSTRUCTIONS. FINAL ROUTING IS THE RESPONSIBILITY OF THE ERECTOR.
5. ERECTOR TO LOCATE, CUT & DRILL SUPPORTS AS TO ELIMINATE ALL STRAIN & LOAD
ON EQUIPMENT CONNECTIONS.
6. ALL PIPE & FITTINGS SHOULD BE CLEANED OF DEBRIS BEFORE CONNECTING.
7. FOR PRESSURE TO PRESSURE FIELD SCHEDULE SEE DRAWING 357573C.

Image

Mine (missing one entire line):
ALL PLPLNC T0 BE STALNLESS STL SASTZrTPwA.
0R0UN0 EL
EL L00‘70'
4; T
55/
WLu. W T“ e
0LL RETURN LLNES To BE PLTCHE0 LN 0LRECTL0N 0E 0LL EL0w L" PER E00T MTNTMUM
. RoUTLNC 0E PLPLNC MAY BE CHANGED AT THE 0LSCRETL0N 0E THE ERECT0R T0 CLEAR
L00AL 0BSTRUCTL0NS ELNAL R0UTLNC LS THE RESPONSTBTUW OE THE ERECTOR
0ENRECET0O0RLPNTE0NTLOCCA0TNEN,ECCTULT0NLS2. DRLLL SUPPORTS AS TO EUMTNATE ALL STRALN & L0A0
ALL RLPE a: ELTTLNCS SHOULD BE CLEANED 0E DEBRLS BEEORE CONNECTLNC
EOR PRESSURE T0 PRESSURE ELELD SCHEDULE SEE 0RAWLNO 3575730.

Image


I've got a lot more info if necessary.

Thanks.
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Fri May 04, 2018 7:52 pm

Hello Whoit,
I've gone ahead and created an internal development ticket for this.
I cannot promise it will be done soon, but the devs have taken interest in this thread and told me that it is something we will continue trying to improve.

Have a good day!
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Fri May 04, 2018 10:20 pm

Thanks Daniel...
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Fri May 04, 2018 10:49 pm

:)
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Tue May 08, 2018 4:27 pm

One more thing - I've noticed that the original PDF output from AutoCAD has all of the text
available as annotations - you can roll your cursor over any text and see the popup.

Can the annotations be transformed into a searchable text layer?

Thanks.
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Tue May 08, 2018 4:55 pm

Hi Whoit,
Currently we do not have much control over the creation of text from CAD softwares.
In most cases these appear as Curve objects, making it difficult edit them as they are not really text.

You could try performing the same OCR function as above, and then removing the objects from behind the text. It may work well, depending on the font and font size you had originally used.
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Tue May 08, 2018 5:11 pm

I think you are focusing on "autocad"
when the real question was whether or not you can convert annotations to a searchable text layer....
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Tue May 08, 2018 5:53 pm

Hi Whoit,
You are right, I has assumed that AutoCAD was a key part of that question due to your example.
My thought train from you example was that each annotation is in that case a separate letter(object), which would not be searchable, so if it is in a case similar to the output of a CAD software, the above would apply.

If they are annotations such as those we can place, I.E. typewriter, textbox, sticky note, etc. You could use the summarize comments function to create a duplicate of the document with all the annotations written out as easily searchable text.
You should also note that most of our commenting tools are already searchable when placed from within the Editor.
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Tue May 08, 2018 6:44 pm

Summarize Comments?
Not aware of that one....does it just extract all the comment text into a string?

Typically we would want something like the results from OCR - "invisible" text boxes over the
top of existing text that could be selected and/or searched....

(We only use the CoreAPI - not the editor)
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Tue May 08, 2018 7:28 pm

Well, in that case, then running an OCR pass over it should indeed help,
The Core API should still be functionally the same as the Editor from that standpoint, unless you have made drastic changes.
Again though, If these comments have been created within our products, they should be searchable by default.
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 692
Joined: Wed Jan 03, 2018 6:52 pm

Re: Unable to get much text from OCR

Tue May 08, 2018 7:31 pm

Also,
About the summarizing comments, there are a few options for it, it is located under the comment tab.
Image
Daniel McIntyre
Support Technician
Tracker Software Products (Canada) LTD

Sales: +1 (250) 324-1621
Fax: +1 (250) 324-1623
 
whoit
User
Topic Author
Posts: 213
Joined: Tue Jul 07, 2015 3:30 pm

Re: Unable to get much text from OCR

Mon May 21, 2018 6:38 pm

Hi -
I'm trying to resolve my issue by reading each comment, then creating a TextBlock to represent it, including position, etc.

My code is working fine, however, it is painfully slow.
It takes 2+ minutes to process a single page with 568 comments.

I've tested the same process by reusing my code with a competing library and it processes in under 2 seconds.

When I analyze my code performance in Visual Studio, the slowdown occurs with this statement which takes over 85%
of the code processing time:

myPage.PlaceContent(contentCreator.Detach(), (UInt32)PXC_PlaceContentFlags.PlaceContent_After);


Why is this so slow?
How can I speed this up?

Who is online

Users browsing this forum: No registered users and 1 guest