
Поиск наиболее близкого по смыслу предложения из списка предложений

Позабавьтесь с предложениями, используя Sentence-Transformers.

6 февраля 2022
post main image

Для одного проекта я искал способ сопоставить входящее предложение со списком фиксированных предложений. Это сложная тема, но при поиске в интернете я наткнулся на удивительный проект Sentence-Transformers.

Я модифицировал один из примеров на этом сайте для использования текстового файла с предложениями и набрал несколько входных предложений в качестве демонстрации. Затем я загрузил текстовый файл из интернета 'Choosing a Cat, by R. Roger Breton and Nancy J. Creek' и был поражен найденными совпадениями.

Я не буду говорить больше, я просто поместил код ниже на случай, если вы захотите попробовать это сами. Вставьте свой любимый текст, и пусть веселье начнется!


Ниже приведен код. Установка займет время и место на диске .... Sentence-Transformers - это приложение Deep Learning , что означает, что при первом запуске, а также после изменения текстового файла, оно начнет обучение. Обучение может занять несколько минут, если размер текстового файла превышает 100 кБ. NLTK используется для получения предложений из текстового файла.

# based on the example on this page:
from sentence_transformers import SentenceTransformer, util
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import re

# for a list of sentences i used the following file, change and have more fun:
# Choosing a Cat, by R. Roger Breton and Nancy J. Creek 
sentences_file = ''

# get list of sentences from file
pc = re.compile(r'[a-zA-Z0-9 \.]*$')
sentences = []
with open(sentences_file) as fo:
    tokens = sent_tokenize(
    for line in tokens:
        line = line.replace('\n', ' ')
        line = re.sub(r' \s+', ' ', line)
print('sentences =\n{}'.format(sentences))

model = SentenceTransformer('all-MiniLM-L6-v2')

# encode all sentences
embeddings = model.encode(sentences)

# match loaded sentences with these (add/change)
input_sentences = [
    'Have you seen my red shirt?',
    'No way sister.',
    'Every once in a while we go shopping for cheap meat.',
    'Carmaker fined for gas emissions cover-up.',
    'This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.',
    'You should never jump into the swimming pool.',    
    'Welcome to Hotel California.',
    'There is no spoon.',
    'I am looking for new shoes.',
    'The remote of my tv is not working.',
    'I cleaned my living room.',
    'I think it is time to read a book.',
    'Do you think I am a robot?',

for sentence in input_sentences:
    emb = model.encode(sentence)
    cos_sim = util.cos_sim(embeddings, [emb])

    sentence_combinations = []
    for i in range(0, len(cos_sim)):
        sentence_combinations.append([i, cos_sim[i]])
    sorted_sentence_combinations = sorted(sentence_combinations, key=lambda x: x[1], reverse=True)
    # show
    print('> {}'.format(sentence))
    for i, sc in enumerate(sorted_sentence_combinations[0:3]):
        print('  {:.4f} {}'.format(sc[1][0], sentences[sc[0]]))

И результаты:

> Have you seen my red shirt?
  0.2009 A visible or exposed haw indicates illness.
  0.1890 These little flakes of skin are dander.
  0.1780 The anus and vulva together form an inverted exclamation point.
> No way sister.
  0.2256 In housecats the odds are about twelve percent for death from this cause.
  0.2095 Neither you nor the cat will be happy in the long run.
  0.2006 Under no circumstances should a kitten be taken from its mother and littermates before it is six weeks old.
> Every once in a while we go shopping for cheap meat.
  0.2960 A corner of the kitchen is usually satisfactory.
  0.2396 Beware of the cat with tender feet.
  0.2136 We feel this to be too great a sacrifice to ask of anyone if there is any alternative at all.
> Carmaker fined for gas emissions cover-up.
  0.1645 Some shelters also ship excess animals to research laboratories.
  0.1529 Such people are not veterinarians and may not legally call themselves such.
  0.1327 Cleanliness is critical.
> This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
  0.1565 The anus and vulva together form an inverted exclamation point.
  0.1220 It really is rather simple.
  0.0933 A cardboard box with ample air holes can be used in an emergency.
> You should never jump into the swimming pool.
  0.3076 Always provide plenty of water.
  0.2973 There should be no hesitation or uncertainty in its movements even though the surface is irregular.
  0.2454 Neither you nor the cat will be happy in the long run.
> Welcome to Hotel California.
  0.2468 To be so selected is an honor.
  0.2449 Always provide plenty of water.
  0.2432 Cleanliness is critical.
> There is no spoon.
  0.2984 The tip is sometimes visible.
  0.2301 It should not be cleft.
  0.2156 Inexpensive hard plastic dishes such as those designed for babies are excellent.
> I am looking for new shoes.
  0.2407 Beware of the cat with tender feet.
  0.2169 Cats normally have five toes on each front foot and four on each rear.
  0.1960 All are suitable if large enough for the cat.
> The remote of my tv is not working.
  0.2256 Check the scratches again in six to eight hours.
  0.1441 Be certain the litterbox is sufficiently large for your cat.
  0.1323 Your cat should be able to enter the box and comfortably turn around in it.
> I cleaned my living room.
  0.3920 Cleanliness is critical.
  0.2711 Check the scratches again in six to eight hours.
  0.2536 Be certain the litterbox is sufficiently large for your cat.
> I think it is time to read a book.
  0.2390 Before taking the plunge there are a few things to take into account.
  0.2345 They should be well on the way to healing by then.
  0.2027 A lack of interest may indicate an ill or jaded animal.
> Do you think I am a robot?
  0.1979 Neither you nor the cat will be happy in the long run.
  0.1683 Be certain the litterbox is sufficiently large for your cat.
  0.1629 Usage is the only certainty.

Ссылки / кредиты


Оставить комментарий

Комментируйте анонимно или войдите в систему, чтобы прокомментировать.


Оставьте ответ

Ответьте анонимно или войдите в систему, чтобы ответить.