Issues with search engine bundled with DMC

Useful stuff for users and developers of the XML based dynamic publishing system "Arbortext". All information in this wiki came from posts to the Arbortext Adepters mailing list that were identified by members of that list as being extremely useful because the mailing list has moved around over the years. For more information see the group FAQ.

This forum and its sub-forums are open to the public.
Forum rules
Discussion posts to this group should be relevant to Arbortext users. Posts on other products or from vendors are welcome, so long as the post also mentions or includes Arbortext in the subject matter. General posts on dynamic publishing and posts from vendors other than PTC will be subject to higher scrutiny and should generally avoid posting here (especially if the referenced material does not mention Arbortext). Advertisement posts for other products (e.g., "check out my product as an alternative") are not welcome. This forum is for Arbortext users to communicate with other Arbortext users about working with Arbortext products.

All posts to this forum are moderated.

User avatar
liz
Posts: 355
Joined: Sun May 31, 2015 2:34 am

Issues with search engine bundled with DMC

Postby liz » Thu Jan 12, 2017 7:12 pm

Updated: 2/2012

Question:

Search results for the DMC are very poor after some high level testing.

While both * and ? will return some results, it will not always work. It does work in both alpha and numeric searches.

From the DMC Search Help: a wildcard character that represents zero or more characters.

part number 12-*

Returns files that include part numbers that begin with 12-, such as part number 12-344, part number 12–78884, or part number 12-1.

The * wildcard cannot be used in the following circumstances:
  • As the first character in a search string
  • In phrase searches
  • In the same search as the ? wildcard.

?(a wildcard character that represents one character)

part number 1?3

Returns files with part numbers that begin with 1 and end with 3, such as part number 123, part number 113, or part number 103. The ? wildcard cannot be used in the following circumstances:
  • As the first or last character in a search string
  • In phrase searches
  • In the same search as the * wildcard.

You can search successfully for fa?m (farm), but not for pa?ty (party).

Discussion:

The DMC uses the Apache Lucene engine for indexing and, presumably, for executing searches, but with a lot of custom code wrapped around it. We should research the various use cases that have been highlighted in this ticket and determine if there is a way to achieve them that might just not be obvious to users (i.e., Lucene and the DMC's wrapper around it support some syntax that the users just need to understand), or if there is a way to do it that Lucene supports but the wrapper somehow hides, or at least does not expose (a subtle, but important, nuance).

Beyond this, a quick look over the documentation about Lucene on the Apache site turned up a couple of its selling points that might bear some investigation to see if the DMC's search (and update) processes could make better use of the engine. Specifically, it mentions being able to search multiple indexes and produce merged results, which suggests that the current update process, which appears to rebuild a monolithic index across both the baseline and updated content, could instead build a separate index for each update pack, with occasional consolidation of index over multiple packs, if needed to maintain performance in the case of a large number of updates being applied over a given baseline.

Another feature worth some consideration is the capability for simultaneous updating and searching. This seems to suggest that indexing processes, due to updates or the possibility of occasional reindexing, as mentioned above, could run in the background during normal DMC use, possibly at a lower processing priority, thereby shortening the update process and making the reindexing a more casual, when-you-have-time-but-not-while-I'm-busy sort of affair. Any sort of "searching won't turn up new stuff" issues that may occur due to this delayed indexing approach could be mediated with some sort of indicator being shown when indexes are out of date and offering the user a button to "index everything now at highest priority", if they're willing to have their system tied up for a bit in order to make sure they get the latest and greatest in their searches.

PTC Case #C10581645

Regarding "fa?rm" vs "pa?rty":

1) I don’t understand how the word party would be a stemmed word. Do they stem everything that ends in a y since it may be parties, partying as well?

Words that can take endings are stemmed so that all variations of the word are encoded as the stem. So, in the original content, all instances of "party", "parties", "partied", "partying", etc. would be indexed as instances of "parti". In order for searches to work, the search terms also have to be stemmed, so when the user searches for "party" the search algorithm knows to search for "parti" under the hood. That way, when you search for "party", you also get hits for instances of "parties", "partying", etc. This is where the process falls down, when you try to search for "p*rty", because the stemming algorithm doesn't understand wildcards, so it doesn't convert it to "p*rti" which is really what you need to get the desired results.

2) Which character or groups of letters would induce the tool to stem the word?

It's a fairly sophisticated algorithm (at least for English), that watches for certain patterns of consonants and vowels, particular endings like "-tion", and various other things. If you want the gory details, the algorithm is available at http://programmingpraxis.com/2009/09/08 ... temming/2/.

3) Why does p*rty return 0 results, but party returns 400+. Wouldn’t the stemming make it that p*rty returns at least as my results as “party”?

As I indicated above, the problem is that the wildcard prevents the stemmer on the searching side from properly stemming the word, so it can't match the terms that were stemmed during indexing. The rule for when to stem a word that ends in "y" depends on their being another vowel before the "y". But it doesn't see the wildcard as a (potential) vowel, so it doesn't end up stemming the word.

Return to “Arbortext Code Archive (adepters.org) (Public)”

Who is online

Users browsing this forum: No registered users and 15 guests