angle-uparrow-clockwisearrow-counterclockwisearrow-down-uparrow-leftatcalendarcard-listchatcheckenvelopefolderhouseinfo-circlepencilpeoplepersonperson-fillperson-plusphoneplusquestion-circlesearchtagtrashx

Finding the closest matching sentence from a list of sentences

Fun with matching sentences using Sentence-Transformers.

6 February 2022 Updated 6 February 2022
post main image
https://www.pexels.com/nl-nl/@andrew

For a project I was looking for a way to match an incoming sentence with a list of fixed sentences. This is a complex subject but when searching the internet I bumped on the amazing project Sentence-Transformers.

I modified one of the examples on this site to use a text file with sentences and typed some input sentences as a demo. Then I downloaded a text file from the internet 'Choosing a Cat, by R. Roger Breton and Nancy J. Creek' and was amazed about the matches found.

I am not going say more, I just put the code below in case you want to try this yourself. Put in your favorite text and let the fun begin!

The code

Below is the code. Installing takes time and disk space .... Sentence-Transformers is a Deep Learning application meaning that the first time you run it, and also after you change the text file, it will start training. Training can take several minutes if the text file is over 100 kB. NLTK is used to get the sentences from the text file.

# based on the example on this page:
# https://www.sbert.net/docs/quickstart.html
from sentence_transformers import SentenceTransformer, util
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import re

# for a list of sentences i used the following file, change and have more fun:
# Choosing a Cat, by R. Roger Breton and Nancy J. Creek 
# http://www.textfiles.com/fun/choosing.cat
sentences_file = 'choosing.cat'

# get list of sentences from file
pc = re.compile(r'[a-zA-Z0-9 \.]*$')
sentences = []
with open(sentences_file) as fo:
    tokens = sent_tokenize(fo.read())
    for line in tokens:
        line = line.replace('\n', ' ')
        line = re.sub(r' \s+', ' ', line)
        if(pc.match(line)):
            sentences.append(line)
print('sentences =\n{}'.format(sentences))

model = SentenceTransformer('all-MiniLM-L6-v2')

# encode all sentences
embeddings = model.encode(sentences)

# match loaded sentences with these (add/change)
input_sentences = [
    'Have you seen my red shirt?',
    'No way sister.',
    'Every once in a while we go shopping for cheap meat.',
    'Carmaker fined for gas emissions cover-up.',
    'This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.',
    'You should never jump into the swimming pool.',    
    'Welcome to Hotel California.',
    'There is no spoon.',
    'I am looking for new shoes.',
    'The remote of my tv is not working.',
    'I cleaned my living room.',
    'I think it is time to read a book.',
    'Do you think I am a robot?',
]

print('\nResults:')
for sentence in input_sentences:
    emb = model.encode(sentence)
    cos_sim = util.cos_sim(embeddings, [emb])

    sentence_combinations = []
    for i in range(0, len(cos_sim)):
        sentence_combinations.append([i, cos_sim[i]])
    sorted_sentence_combinations = sorted(sentence_combinations, key=lambda x: x[1], reverse=True)
    # show
    print('> {}'.format(sentence))
    for i, sc in enumerate(sorted_sentence_combinations[0:3]):
        print('  {:.4f} {}'.format(sc[1][0], sentences[sc[0]]))

And the results:

Results:
> Have you seen my red shirt?
  0.2009 A visible or exposed haw indicates illness.
  0.1890 These little flakes of skin are dander.
  0.1780 The anus and vulva together form an inverted exclamation point.
> No way sister.
  0.2256 In housecats the odds are about twelve percent for death from this cause.
  0.2095 Neither you nor the cat will be happy in the long run.
  0.2006 Under no circumstances should a kitten be taken from its mother and littermates before it is six weeks old.
> Every once in a while we go shopping for cheap meat.
  0.2960 A corner of the kitchen is usually satisfactory.
  0.2396 Beware of the cat with tender feet.
  0.2136 We feel this to be too great a sacrifice to ask of anyone if there is any alternative at all.
> Carmaker fined for gas emissions cover-up.
  0.1645 Some shelters also ship excess animals to research laboratories.
  0.1529 Such people are not veterinarians and may not legally call themselves such.
  0.1327 Cleanliness is critical.
> This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
  0.1565 The anus and vulva together form an inverted exclamation point.
  0.1220 It really is rather simple.
  0.0933 A cardboard box with ample air holes can be used in an emergency.
> You should never jump into the swimming pool.
  0.3076 Always provide plenty of water.
  0.2973 There should be no hesitation or uncertainty in its movements even though the surface is irregular.
  0.2454 Neither you nor the cat will be happy in the long run.
> Welcome to Hotel California.
  0.2468 To be so selected is an honor.
  0.2449 Always provide plenty of water.
  0.2432 Cleanliness is critical.
> There is no spoon.
  0.2984 The tip is sometimes visible.
  0.2301 It should not be cleft.
  0.2156 Inexpensive hard plastic dishes such as those designed for babies are excellent.
> I am looking for new shoes.
  0.2407 Beware of the cat with tender feet.
  0.2169 Cats normally have five toes on each front foot and four on each rear.
  0.1960 All are suitable if large enough for the cat.
> The remote of my tv is not working.
  0.2256 Check the scratches again in six to eight hours.
  0.1441 Be certain the litterbox is sufficiently large for your cat.
  0.1323 Your cat should be able to enter the box and comfortably turn around in it.
> I cleaned my living room.
  0.3920 Cleanliness is critical.
  0.2711 Check the scratches again in six to eight hours.
  0.2536 Be certain the litterbox is sufficiently large for your cat.
> I think it is time to read a book.
  0.2390 Before taking the plunge there are a few things to take into account.
  0.2345 They should be well on the way to healing by then.
  0.2027 A lack of interest may indicate an ill or jaded animal.
> Do you think I am a robot?
  0.1979 Neither you nor the cat will be happy in the long run.
  0.1683 Be certain the litterbox is sufficiently large for your cat.
  0.1629 Usage is the only certainty.

Links / credits

Sentence-Transformers
https://www.sbert.net

textfiles.com
http://www.textfiles.com

Leave a comment

Comment anonymously or log in to comment.

Comments

Leave a reply

Reply anonymously or log in to reply.