Search results for the DMC are very poor after some high level testing.
While both * and ? will return some results, it will not always work. It does work in both alpha and numeric searches.
From the DMC Search Help: a wildcard character that represents zero or more characters.
part number 12-*
Returns files that include part numbers that begin with 12-, such as part number 12-344, part number 12–78884, or part number 12-1.
The * wildcard cannot be used in the following circumstances:
- As the first character in a search string
- In phrase searches
- In the same search as the ? wildcard.
?(a wildcard character that represents one character)
part number 1?3
Returns files with part numbers that begin with 1 and end with 3, such as part number 123, part number 113, or part number 103. The ? wildcard cannot be used in the following circumstances:
- As the first or last character in a search string
- In phrase searches
- In the same search as the * wildcard.
You can search successfully for fa?m (farm), but not for pa?ty (party).
The DMC uses the Apache Lucene engine for indexing and, presumably, for executing searches, but with a lot of custom code wrapped around it. We should research the various use cases that have been highlighted in this ticket and determine if there is a way to achieve them that might just not be obvious to users (i.e., Lucene and the DMC's wrapper around it support some syntax that the users just need to understand), or if there is a way to do it that Lucene supports but the wrapper somehow hides, or at least does not expose (a subtle, but important, nuance).
Beyond this, a quick look over the documentation about Lucene on the Apache site turned up a couple of its selling points that might bear some investigation to see if the DMC's search (and update) processes could make better use of the engine. Specifically, it mentions being able to search multiple indexes and produce merged results, which suggests that the current update process, which appears to rebuild a monolithic index across both the baseline and updated content, could instead build a separate index for each update pack, with occasional consolidation of index over multiple packs, if needed to maintain performance in the case of a large number of updates being applied over a given baseline.
Another feature worth some consideration is the capability for simultaneous updating and searching. This seems to suggest that indexing processes, due to updates or the possibility of occasional reindexing, as mentioned above, could run in the background during normal DMC use, possibly at a lower processing priority, thereby shortening the update process and making the reindexing a more casual, when-you-have-time-but-not-while-I'm-busy sort of affair. Any sort of "searching won't turn up new stuff" issues that may occur due to this delayed indexing approach could be mediated with some sort of indicator being shown when indexes are out of date and offering the user a button to "index everything now at highest priority", if they're willing to have their system tied up for a bit in order to make sure they get the latest and greatest in their searches.
PTC Case #C10581645
Regarding "fa?rm" vs "pa?rty":
1) I don’t understand how the word party would be a stemmed word. Do they stem everything that ends in a y since it may be parties, partying as well?
Words that can take endings are stemmed so that all variations of the word are encoded as the stem. So, in the original content, all instances of "party", "parties", "partied", "partying", etc. would be indexed as instances of "parti". In order for searches to work, the search terms also have to be stemmed, so when the user searches for "party" the search algorithm knows to search for "parti" under the hood. That way, when you search for "party", you also get hits for instances of "parties", "partying", etc. This is where the process falls down, when you try to search for "p*rty", because the stemming algorithm doesn't understand wildcards, so it doesn't convert it to "p*rti" which is really what you need to get the desired results.
2) Which character or groups of letters would induce the tool to stem the word?
It's a fairly sophisticated algorithm (at least for English), that watches for certain patterns of consonants and vowels, particular endings like "-tion", and various other things. If you want the gory details, the algorithm is available at http://programmingpraxis.com/2009/09/08 ... temming/2/.
3) Why does p*rty return 0 results, but party returns 400+. Wouldn’t the stemming make it that p*rty returns at least as my results as “party”?
As I indicated above, the problem is that the wildcard prevents the stemmer on the searching side from properly stemming the word, so it can't match the terms that were stemmed during indexing. The rule for when to stem a word that ends in "y" depends on their being another vowel before the "y". But it doesn't see the wildcard as a (potential) vowel, so it doesn't end up stemming the word.