Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
286 views
in Technique[技术] by (71.8m points)

algorithm - How, do I select the best match for a string in multiple documents, where the score is equal for both?

I have implemented an algorithm in Elm, where I compare a sentence (user input) to other multiple sentences (data). The algorithm is working in such a manner, where the user input and the data is converted to words, and then I compare them by words. the algorithm will mark any sentence from the data, which has the most words in the user input, as the best match.

Now, at the first run, the first sentence from the data will be counted as the best match and then going to the second sentence and looks for matches. If the matches number is greater than the previous one, then the second sentence will be counted as the best match, otherwise the previous one.

In case, if there are equal matches in two sentences, then currently I am comparing the size of these two sentences and select the one, which has the smaller size, as the best match.

There is no semantic meaning involved, so is this the best way to select the best match, which has the smaller size in this case? or are there some other better options available? I have tried to look for some scientific references, but couldn't find any.

Edit:

To summarize, if you want to compare one sentence to two other sentences, based on word occurrences, If both of the sentences have the same number of words, which also exist in your comparing sentence, then which one can be marked as the most similar? which methods are used to retrieve this similarity?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Some factors you can add in to improve the comparison:

  • String similarity (eg. Levensthein, Jaro-Winkler, ...)
  • Add a parameter for the sentence length by adding a linear or geometric penalty for a different sentence length (either on character or on word level)
  • Clean the strings (remove stopwords, special signs etc.)
  • Add the sequence (position) of words as a parameter. Thus which word is before/after another word.
  • Use Sentence Embeddings for similarity to also capture some semantics (https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/)

Finally, there will always be some sentences that have the same difference to your input, although they are different. That's OK, as long as they are actually similarly different to your input sentence.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...