Hi
I have a PDF document which has been OCR handled. I can see texts inside the container in the content view as below.
Container<Div> : Text("t h i s i s a t e s t d o c u m e n t")
Tt t
Tt h
Tt i
Tt s
Tt i
Tt s
Tt t
Tt e
Tt s
Tt t
Tt d
Tt t.......
this text is shown in the pdf as "this is a test document"
I am reading this text using c# application ( Text_GetText()), but as this text is given in character by character i am unable to make the whole text correctly. becuase i suppose to add space between each text element.
So can anayone let me know how can i read the contianer text once without iterating though each texts inside the container.
Thank you
Prasantha
PDF text with character by character
Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan
Forum rules
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
- Ivan - Tracker Software
- Site Admin
- Posts: 3549
- Joined: Thu Jul 08, 2004 10:36 pm
- Location: Vancouver Island - Canada
- Contact:
Re: PDF text with character by character
Can you send us the document?
Tracker Software (Project Director)
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
Re: PDF text with character by character
HI I cannot provide the original document because of the sensitivity of the document, But i have attached a sample document here it does not have character by character but it contains the texts with splitted the word.
1) As shown in the image the word "large" is splitted into two different texxt segments "...... l" and "arge".
file is attached as 1.pdf 2) As shown int he image it is splitted the word into two seperate words.
I have attache the file for this as well 2,pdf I tried to retrieve the whole text using GetPageText of the page, in this case it ommited all the line breaks so in concat the two lines.
But i can see that as an end user using pdf excange view , by selecting the page text i can copy whole text as expected without any issue.(copy and past into the notepad works as expected, )
can you please look this?
.1) As shown in the image the word "large" is splitted into two different texxt segments "...... l" and "arge".
file is attached as 1.pdf 2) As shown int he image it is splitted the word into two seperate words.
I have attache the file for this as well 2,pdf I tried to retrieve the whole text using GetPageText of the page, in this case it ommited all the line breaks so in concat the two lines.
But i can see that as an end user using pdf excange view , by selecting the page text i can copy whole text as expected without any issue.(copy and past into the notepad works as expected, )
can you please look this?
- Vasyl-Tracker Dev Team
- Site Admin
- Posts: 2353
- Joined: Thu Jun 30, 2005 4:11 pm
- Location: Canada
Re: PDF text with character by character
Hi Prasantha.
You may try to use the IPXC_Page::GetText feature. Please look there:
https://github.com/tracker-software/PDF ... RoboReader
Tip: also please look to IPXC_PageText::LinesCount, IPXC_PageText::LineInfo - it will allow you to get information about text lines composed by the Editor from pdf-content..
HTH.
You may try to use the IPXC_Page::GetText feature. Please look there:
https://github.com/tracker-software/PDF ... RoboReader
Tip: also please look to IPXC_PageText::LinesCount, IPXC_PageText::LineInfo - it will allow you to get information about text lines composed by the Editor from pdf-content..
HTH.
Vasyl Yaremyn
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software Products
Project Developer
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.