Forum rules DO NOT post your license/serial key, or your activation code - these forums, and all posts within, are public and we will be forced to immediately deactivate your license.
When experiencing some errors, use the IAUX_Inst::FormatHRESULT method to see their description and include it in your post along with the error code.
Dim myDoc As PDFXCoreAPI.IPXC_Document = g_Inst.OpenDocumentFromFile(Me.TextBox1.Text, Nothing)
Try
Dim bHasDoc As Boolean = myDoc IsNot Nothing
Dim docStringBuilder As New StringBuilder
If bHasDoc Then
For pageNum As UInteger = 0 To CUInt(myDoc.Pages.Count - 1)
Dim curPage As IPXC_Page = myDoc.Pages(pageNum)
Dim MyPageText As IPXC_PageText
MyPageText = curPage.GetText(Nothing, False)
Dim FirstChar As UInteger = 0
Dim CharCount As UInteger = 0
For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
CharCount = MyPageText.LineInfo(i).nCharsCount
Dim pdfWord As String = Regex.Replace(MyPageText.GetChars(FirstChar, CharCount), " {2,}", " ")
docStringBuilder.AppendLine(pdfWord)
Next
Next
End If
Dim file As New System.IO.StreamWriter("C:\temp\PDFExport.txt", False)
file.WriteLine(docStringBuilder.ToString())
file.Close()
docStringBuilder.Clear()
Catch ex As Exception
Console.WriteLine(ex)
End Try
The issue that I have is that sometimes the text in the export text file does not seem to be in order, please see screen shot below:
Is there a way to resolve this, or a way that I can maybe use the text line Y position to output a correctly ordered text file?
Being a bit lazy as away from my computer, but wanted to work on this over the weekend. Do you have an example of how to get the Y position of each line of text?
Dim nowTime As DateTime = DateTime.Now
Console.WriteLine("Start: " & nowTime.ToLongTimeString & ":" & nowTime.Millisecond.ToString)
Dim myDoc As PDFXCoreAPI.IPXC_Document = g_Inst.OpenDocumentFromFile(Me.TextBox1.Text, Nothing)
Try
Dim bHasDoc As Boolean = myDoc IsNot Nothing
Dim docStringBuilder As New StringBuilder
If bHasDoc Then
For pageNum As UInteger = 0 To CUInt(myDoc.Pages.Count - 1)
Dim curPage As IPXC_Page = myDoc.Pages(pageNum)
Dim MyPageText As IPXC_PageText
MyPageText = curPage.GetText(Nothing, False)
Dim FirstChar As UInteger = 0
Dim CharCount As UInteger = 0
For i As UInteger = 0 To CUInt(MyPageText.LinesCount - 1)
FirstChar = MyPageText.LineInfo(i).nFirstCharIndex
CharCount = MyPageText.LineInfo(i).nCharsCount
Dim pdfWord As String = Regex.Replace(MyPageText.GetChars(FirstChar, CharCount), " {2,}", " ")
docStringBuilder.AppendLine(pdfWord & " Top: " & MyPageText.LineInfo(i).rcBBox.top.ToString & " Bottom: " & MyPageText.LineInfo(i).rcBBox.bottom.ToString & " Left: " & MyPageText.LineInfo(i).rcBBox.left.ToString & " Right: " & MyPageText.LineInfo(i).rcBBox.right.ToString)
Next
Next
End If
Dim file As New System.IO.StreamWriter("C:\temp\PDFExport.txt", False)
file.WriteLine(docStringBuilder.ToString())
file.Close()
docStringBuilder.Clear()
Catch ex As Exception
Console.WriteLine(ex)
End Try
nowTime = DateTime.Now
Console.WriteLine("End: " & nowTime.ToLongTimeString & ":" & nowTime.Millisecond.ToString)
And I have attached the PDF I am using along with the text output file.
Of course the coordinates would be like that in your case - those are the coordinates of text in line's coordinate system. To convert them into the visual coordinate representation, the line and page matrices should be used: