4 What are the languages ayfie works in and what are the options for those languages that are not included?
Ayfie works in all major European languages, among them English, German, Spanish and French. We are continuously working on expanding language support even further. In case of an unsupported language, ayfie falls back to a more basic matching algorithm that recognizes less variants.
5 How is this process different from LSI?
In theory, LSI applies a singular value decomposition to the matrix of documents and the terms therein. However, due to the high dimensionality of the initial matrix, a lot of prefiltering and reduction of the term space is usually applied to make it computationally feasible for larger document sets.
Thus, it usually misses a lot of obvious variants of words (often not even collapsing inflectional forms into one concept) and collapses a lot of terms that are not related. Additionally, it normally works on the space made up of single terms of all documents while a lot of the salient vocabulary comes in the form of multi-term expressions. LSI will therefore normally not consider "reimbursement and disclosure agreement" a dimension, but look at the individual words instead.
ayfie's term space is made up of the (multi-term) expressions that were found in certain linguistic contexts in the documents. Variant reduction and synonym expansion is applied to these entities rather than to individual terms.
The effect of LSI is largely dependent on term distribution in the document space. LSI might therefore work very well for one specific use case and fail opaquely in another setting. ayfie does not depend on statistical properties of the documents at all.
6 How can it complete the same functions as an LSI engine?
Where an LSI engine deals with the "latent semantics" of documents, ayfie is based on 30 years of research into the actual semantics of documents. By applying codified linguistic knowledge to the document set, it can detect term variants without even computing a document, term-matrix.
An additional advantage of this approach is that it is transparent to the user because it is possible to document why a certain document matched, reporting all the different processing stages that led to the match.
Additionally, the linguistic knowledge can continuously be extended to yield higher precision or recall depending on the requirements of the specific use case while LSI is basically untuneable.
7 What is the fuzzy logic or the logic that looks for misspellings based off of?
After extracting the salient vocabulary from the individual documents, a similarity matrix is computed across the full document set, considering misspellings based on Levensthein distance, known inflectional forms, stopwords and synonyms (optionally). Based on that matrix, all variants are folded into their most dominant representatives which are then used for all further computations.
8 How are inflection forms identified and calculated?
We have compiled large dictionaries, formalizing inflectional classes and the vocabulary for each language we support. Each of those dictionaries has millions of entries but only takes milliseconds to apply to a document thanks to a proprietary finite state representation of those dictionaries. Additionally, we have implemented heuristics for dealing with special phenomena in certain languages such as decomposition in German and Norwegian.
These resources also include synonym lists and taxonomies that can be used if broader matching is required.
9 How is the “weight” that is assigned to each term group calculated?
The weight corresponds to the average salience of that group in the content. For every group and document, a salience factor is calculated that describes the importance of that group for the current document. When doing searches, that factor average is used to weigh every term group occurring in the query.