Extracting Text from Forms

This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.

Moderators: Tracker Support, TrackerSupp-Daniel, Chris - Tracker Supp, Vasyl-Tracker Dev Team, Sean - Tracker, Tracker Supp-Stefan

Post Reply
Curt
User
Posts: 6
Joined: Mon Nov 06, 2006 7:13 pm

Extracting Text from Forms

Post by Curt » Wed Oct 08, 2008 6:57 am

I want to use the PDF-to-text functions in Pro SDK 4 to extract text entered on fillable forms. These are standard fillable forms and we have no control of their design. They are filled out in another company’s program. We use PDF XCHANGE to convert their printer output to a flat PDF file.
We want to use your PDF- to -Text code to extract the entered text and use the extracted data to fill out fields in our database. I am including a sample form page together with the resulting text file I get when running your example PDF-to-text program on the form.

As you can see, separating the entries is a problem. Look at the Make, Model, Body Type, etc. lines. For example, each word in the Make field are separated by a single space, and there is only a single space between the Make field and the Body Type field.

If there were at least 2 spaces between each entry field, then I would know where to separate each field’s entry. By the way, I selected the proportional spacing option but it still only put a single space between entry fields.

Optimally, the PDF converter would insert a special symbol for the vertical lines that separate each entry field. Next best would be having words in an entry field separated by single spaces (as they are) with multiple spaces between entry fields along a line. Can this be done?

I would also like to examine the text extraction facility of your viewer. However, it only seems to work for acroforms and not for flat PDF forms. Is this correct or am I missing something with the viewer?

I appreciate your assistance on these issues.
Sutter Files.zip
(67.33 KiB) Downloaded 166 times

Ivan - Tracker Software
Site Admin
Posts: 3591
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: Extracting Text from Forms

Post by Ivan - Tracker Software » Wed Oct 08, 2008 8:01 am

Curt wrote:Optimally, the PDF converter would insert a special symbol for the vertical lines that separate each entry field.
I'm afraid it is impossible, because the form fields were flattened, the extraction algorithm knows nothing about fields and where they start or finish.

Best way to export form fields values from the PDF is to use the Viewer and its function Export PDF form to FDF file.

In build 40 or 41 the Viewer will support exporting forms fields to XFDF file (XML based) which is a bit easier for parsing.

Also, you can use JavaScript in the Viewer to get the fields values.
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.

Curt
User
Posts: 6
Joined: Mon Nov 06, 2006 7:13 pm

Re: Extracting Text from Forms

Post by Curt » Wed Oct 08, 2008 7:16 pm

Could you take the PDF form I sent and run it through your Export pdf function in the new Viewer, returning the resulting fdf file to me so I can see if that is usable for our needs?

Ivan - Tracker Software
Site Admin
Posts: 3591
Joined: Thu Jul 08, 2004 10:36 pm
Location: Vancouver Island - Canada
Contact:

Re: Extracting Text from Forms

Post by Ivan - Tracker Software » Mon Oct 13, 2008 8:20 am

To be honest, the document you sent doesn't contains fields -- all of them are flattened.
Tracker Software (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.

Post Reply