Two consecutive words not found if line break present

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
Seeker45
User
Posts: 162
Joined: Wed Dec 18, 2013 2:32 pm
Location: Germany

Two consecutive words not found if line break present

Post by Seeker45 »

Hello,

I have noted that two consecutive words are not found if a line break is present between these two words.

E.g., I am searching for "both aerobic" in the following text:

Code: Select all

The antimicrobial activity of ... under both 
aerobic and anaerobic conditions was ...
grown in biofilm under both aerobic and anaerobic ...
This will return only one match on the last line, even though - from the human perspective - the first occurrence should surely be a valid match as is the second occurrence.

Thanks!

Ralf
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Two consecutive words not found if line break present

Post by Tracker Supp-Stefan »

Hi Ralf,

Depending on the structure of the PDF file these two words might be in separate "objects" - and as such won't be recognized as consecutive. Please try in the advanced search options -> Proximity to set the search criteria to e.g. "Same Paragraph or "Same Page".

In a document I tested where the text is in a single block - these two words are properly recognized (check the attachment).

Regards,
Stefan
Attachments
two_line_search.zip
(142.62 KiB) Downloaded 148 times
Seeker45
User
Posts: 162
Joined: Wed Dec 18, 2013 2:32 pm
Location: Germany

Re: Two consecutive words not found if line break present

Post by Seeker45 »

Hi Stefan,

Thanks for your reply. I noted something by "accident": Searching for "both aerobic" without the quotes, does not return a match across lines in my document. However, using "both aerobic" with the quotes finds the desired matches. I maintained the setting "Proximity -> Only adjacent words" for both searches.

While I have solution now, this behaviour does not seem very logical to me. Or is there a good explanation?

Thank you.

Best regards
Ralf
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: Two consecutive words not found if line break present

Post by Vasyl-Tracker Dev Team »

Hi Ralf.

I tried to reproduce your issue but couldn't. It looks like that issue is depended with document. Can you send simple example document for reproducing? You may need to extract only one page..

Best
Regards.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Seeker45
User
Posts: 162
Joined: Wed Dec 18, 2013 2:32 pm
Location: Germany

Re: Two consecutive words not found if line break present

Post by Seeker45 »

Sure. Please find attached.

Ralf
Attachments
Testpage.zip
(51.92 KiB) Downloaded 113 times
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Two consecutive words not found if line break present

Post by Tracker Supp-Stefan »

Thanks for the sample Ralf.

It will need Vasyl's attention to say exactly why this is happening, but I see that this is a scanned file with OCR text layer on top. What I noticed is that if I rotate the OCR text object containing the words "both" and "aerobic" - then the Editor finds both occurrences without using the quotation marks and when I use the quotes then it only finds them when they are on the same line - the opposite of what is observed with your original file.

Regards,
Stefan
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: Two consecutive words not found if line break present

Post by Vasyl-Tracker Dev Team »

Hi, Ralf.

The problem has been reproduced, a ticket created in our internal system (RT#2246) and we will investigate what is wrong.
Thanks for sample document.

Best
Regards.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Seeker45
User
Posts: 162
Joined: Wed Dec 18, 2013 2:32 pm
Location: Germany

Re: Two consecutive words not found if line break present

Post by Seeker45 »

Hi,

I looked into this again for the latest version, and the issue seems to have been fixed. Is that also what your ticket shows? Thank you.

Cheers
Ralf
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17824
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Two consecutive words not found if line break present

Post by Tracker Supp-Stefan »

Hi Ralf,

Glad to hear it's working.
There wasn't much info in the ticket, so I've requested an update.

Regards,
Stefan
ironick
User
Posts: 27
Joined: Mon Oct 28, 2019 9:40 pm

Re: Two consecutive words not found if line break present

Post by ironick »

I am having the same or similar issue with PDF-Xchange v 9.4, build 363.0. For example, suppose my pdf has the following lines on a page:
LINE 1:...word1
LINE 2:word2...

Here are the two kinds of searches I do:
Search 1: Proximity: "Words from the Same Page", Search Field: ["word1 word2"]
Search 2: Proximity: "Only Adjacent Words", Search Field: [word1 word2]

Search 1 returns no results, but Search 2 returns one result.

IMO, both kinds of searches should manifest the same behavior, ie find the two word phrase even though the its words are split across lines. I use Search-1-style searches all the time because I only want the search results to list one page per hit even if the phrase appears multiple times on a single page.

Note that I've looked into the Contents tab and examined how the words are broken up into text objects and this behavior seems to have nothing to do with how the words are broken up into text objects.

I've attached a sample document. Try Search 1 and Search 2 with word1=human and word2=nature. Search 1 returns 2 results, and Search 2 returns 3 results.
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: Two consecutive words not found if line break present

Post by Vasyl-Tracker Dev Team »

Hi ironick.

We tried to reproduce the issue you described, but couldn't:

1. Proximity: "Words from the Same Page":
image.png
- 5 entries (you have 2)

2. Proximity: "Only Adjacent Words":
image(1).png
- 3 entries (you also have 3)

Also we made simple example with line-break:
TheHumanNatureIs.pdf
(4.17 KiB) Downloaded 32 times
which gives us the same result for 1 and 2 cases as it should. Just search for simple "human nature" string...

Cheers.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
User avatar
rakunavi
User
Posts: 871
Joined: Sat Sep 11, 2021 5:04 am

Re: Two consecutive words not found if line break present

Post by rakunavi »

Hi Vasyl,

Please note the double quotes in Search 1 criteria shown by ironick. You should be able to reproduce it.

1. Proximity: "Words from the Same Page":
Search_1.png
- 2 entries

2. Proximity: "Only Adjacent Words":
Search_2.png
- 3 entries

I am currently waiting for improvements to the issue reported in the following topic, so I keep an eye out for topics with related content. Hoping that the above information will be of some help to you.

  • RT#6135: Japanese text "search" does not find multi-line results.
    https://forum.pdf-xchange.com/viewtopic.php?p=161068
Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
ironick
User
Posts: 27
Joined: Mon Oct 28, 2019 9:40 pm

Re: Two consecutive words not found if line break present

Post by ironick »

Hi Vasyl,

As rakunavi has pointed out, your first search, Proximity: "Words from the Same Page", left out the double quotes around the search phrase, ie the search box should contain ["human nature"], not [human nature]. That's the bug: putting double quotes around a phrase, which should be interpreted by the search engine as "Only Adjacent Words", fails to find adjacent words separated by a line break.

-- Nick

PS Thanks rakunavi for chiming in!
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Two consecutive words not found if line break present

Post by TrackerSupp-Daniel »

Hello, ironick

Placing double quotes around a phrase forces the search function to include all characters within the string (including invisible character such as spaces). As such, if you use double quotes, it is expected that word wrapping will not function as desired, becuase there is no "space" character between the words, it is either the end of a text box or there is a return character in place, which would not be caught by either possibility.

If you need to search for specific adjacent items through line breaks, the most reliable method would be to instead use the advanced search criteria, use adjacent words, and define the words in the "all of these words" field:
image.png
image.png (19.35 KiB) Viewed 1026 times
This should allow you to find two specific words, when side by side (in any order) regardless of the presence of a space, return, or end of block.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
ironick
User
Posts: 27
Joined: Mon Oct 28, 2019 9:40 pm

Re: Two consecutive words not found if line break present

Post by ironick »

Hi Daniel,

Thanks for your quick response. While I see the logic of the behavior, it certainly violates the design principle of least surprise: https://www.wikiwand.com/en/Principle_of_least_astonishment

Typical (non-developer) search engines interpret the spaces in a quoted phrase as generic whitespace so that the spaces will match tabs, linebreaks, etc. So the PDF-XChange Editor search engine does not do what the vast majority of people "expect". Is there any chance that the search engine behavior will be changed to better match user's expectations?

As it is, PDF-XChange now lacks any way to search for a series of words separated by whitespace in the exact order they occur. Again, many search engines either use quoted phrases for this or offer an advanced search option such as "this exact phrase". See this example from Twitter:
Twitter Adv Search - Screenshot 2022-08-17 113803.jpg
Is there any chance that "This exact phrase" will be added as an Advanced Criteria?

In the meantime, since this behavior is so contrary to most user expectations, I suggest that it be documented here: https://help.pdf-xchange.com/pdfxe9/index.html?search_ed_2.html

Thanks,

-- Nick
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Two consecutive words not found if line break present

Post by TrackerSupp-Daniel »

Hello, ironick

Ive spoken with the team on this again to confirm, and it seems my understanding was incorrect.
You are correct that the search should work fine if the text is well formatted in your docuemnt, which it appears to be from my investigation. Vasyl will be doing some further testing with your document to see what he can find, and why the space is not being correct substituted as word wrapping in this instance as should have been expected.

Following that, you request for "exactly this phrase" is already present, it would be the simple search with quotation marks in place.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
ironick
User
Posts: 27
Joined: Mon Oct 28, 2019 9:40 pm

Re: Two consecutive words not found if line break present

Post by ironick »

Thank you, Daniel. Glad to hear it.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Two consecutive words not found if line break present

Post by TrackerSupp-Daniel »

:)
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Vasyl-Tracker Dev Team
Site Admin
Posts: 2352
Joined: Thu Jun 30, 2005 4:11 pm
Location: Canada

Re: Two consecutive words not found if line break present

Post by Vasyl-Tracker Dev Team »

Hi Nick.
As it is, PDF-XChange now lacks any way to search for a series of words separated by whitespace in the exact order they occur. Again, many
search engines either use quoted phrases for this or offer an advanced search option such as "this exact phrase".
You already have this possibility. Just put your exact phrase in dbl-quotes to the first editbox that says 'Enter word or phrase'. OR you may put the same in Advanced > AllOfTheseWords-editbox. Both ways should work as expected.

BUT I agree that using dbl-quotes isn't an obvious way. So seem you are right and adding addtional ThisExactPhrase-edibox is a good idea, thanks.

P.S. The 'Two consecutive words not found if line break present' issue will be fixed in the next build (RT 2246, 6135).

Cheers.
Vasyl Yaremyn
Tracker Software Products
Project Developer

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
ironick
User
Posts: 27
Joined: Mon Oct 28, 2019 9:40 pm

Re: Two consecutive words not found if line break present

Post by ironick »

Vasyl,

I think you've misunderstood something in this thread of replies.

Searching for "human nature" (with double quotes) in the first editbox does NOT match instances where human is at the end of a line and nature is at the beginning of the following line. "human nature" in double quotes ONLY matches instances where there is a SPACE between human and nature. It does NOT match instances where there is a line break between human and nature. That's what I mean by WHITESPACE, ie a space, tab, line break, multiple spaces, etc.

So I'm pretty sure my claim stands: "PDF-XChange now lacks any way to search for a series of words separated by WHITESPACE (including line breaks) in the exact order they occur."

I hope this clears up any confusion regarding the current search behavior. I'm glad to hear this will be fixed in an upcoming release.

Thanks again,

-- Nick
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 8440
Joined: Wed Jan 03, 2018 6:52 pm

Re: Two consecutive words not found if line break present

Post by TrackerSupp-Daniel »

Hello, ironick

Vasyl understood, he was re-iterating that in most cases it does already work as you desire/explained (for example, see in his sample document, where the same search, with quotes, does find the linebreak). Your document seems to have some special aspect to it that is interrupting/breaking this logic, which is where this note at the end of his post comes in:
Vasyl-Tracker Dev Team wrote: Thu Aug 18, 2022 1:38 am P.S. The 'Two consecutive words not found if line break present' issue will be fixed in the next build (RT 2246, 6135).
It is already on the block, he has spotted the cause of the issue, and is working on a solution for the next release.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply