angle-uparrow-clockwisearrow-counterclockwisearrow-down-uparrow-leftatcalendarcard-listchatcheckenvelopefolderhouseinfo-circlepencilpeoplepersonperson-fillperson-plusphoneplusquestion-circlesearchtagtrashx

Aus einer Liste von Sätzen den am besten passenden Satz finden

Spaß mit passenden Sätzen mit Sentence-Transformers.

6 Februar 2022
post main image
https://www.pexels.com/nl-nl/@andrew

Für ein Projekt suchte ich nach einer Möglichkeit, einen eingehenden Satz mit einer Liste von festen Sätzen abzugleichen. Dies ist ein komplexes Thema, aber bei der Suche im Internet bin ich auf das erstaunliche Projekt Sentence-Transformers gestoßen.

Ich modifizierte eines der Beispiele auf dieser Website, um eine Textdatei mit Sätzen zu verwenden, und tippte einige Eingabesätze als Demo ein. Dann habe ich eine Textdatei aus dem Internet heruntergeladen 'Choosing a Cat, by R. Roger Breton and Nancy J. Creek' und war erstaunt über die gefundenen Übereinstimmungen.

Ich werde nicht mehr sagen, ich habe nur den Code unten eingefügt, falls Sie das selbst ausprobieren wollen. Geben Sie Ihren Lieblingstext ein und lassen Sie den Spaß beginnen!

Der Code

Unten ist der Code. Die Installation nimmt Zeit und Speicherplatz in Anspruch .... Sentence-Transformers ist eine Deep Learning -Anwendung, was bedeutet, dass sie beim ersten Start und auch nach dem Ändern der Textdatei mit dem Training beginnt. Das Training kann mehrere Minuten dauern, wenn die Textdatei über 100 kB groß ist. NLTK wird verwendet, um die Sätze aus der Textdatei zu ermitteln.

# based on the example on this page:
# https://www.sbert.net/docs/quickstart.html
from sentence_transformers import SentenceTransformer, util
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import re

# for a list of sentences i used the following file, change and have more fun:
# Choosing a Cat, by R. Roger Breton and Nancy J. Creek 
# http://www.textfiles.com/fun/choosing.cat
sentences_file = 'choosing.cat'

# get list of sentences from file
pc = re.compile(r'[a-zA-Z0-9 \.]*$')
sentences = []
with open(sentences_file) as fo:
    tokens = sent_tokenize(fo.read())
    for line in tokens:
        line = line.replace('\n', ' ')
        line = re.sub(r' \s+', ' ', line)
        if(pc.match(line)):
            sentences.append(line)
print('sentences =\n{}'.format(sentences))

model = SentenceTransformer('all-MiniLM-L6-v2')

# encode all sentences
embeddings = model.encode(sentences)

# match loaded sentences with these (add/change)
input_sentences = [
    'Have you seen my red shirt?',
    'No way sister.',
    'Every once in a while we go shopping for cheap meat.',
    'Carmaker fined for gas emissions cover-up.',
    'This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.',
    'You should never jump into the swimming pool.',    
    'Welcome to Hotel California.',
    'There is no spoon.',
    'I am looking for new shoes.',
    'The remote of my tv is not working.',
    'I cleaned my living room.',
    'I think it is time to read a book.',
    'Do you think I am a robot?',
]

print('\nResults:')
for sentence in input_sentences:
    emb = model.encode(sentence)
    cos_sim = util.cos_sim(embeddings, [emb])

    sentence_combinations = []
    for i in range(0, len(cos_sim)):
        sentence_combinations.append([i, cos_sim[i]])
    sorted_sentence_combinations = sorted(sentence_combinations, key=lambda x: x[1], reverse=True)
    # show
    print('> {}'.format(sentence))
    for i, sc in enumerate(sorted_sentence_combinations[0:3]):
        print('  {:.4f} {}'.format(sc[1][0], sentences[sc[0]]))

Und die Ergebnisse:

Results:
> Have you seen my red shirt?
  0.2009 A visible or exposed haw indicates illness.
  0.1890 These little flakes of skin are dander.
  0.1780 The anus and vulva together form an inverted exclamation point.
> No way sister.
  0.2256 In housecats the odds are about twelve percent for death from this cause.
  0.2095 Neither you nor the cat will be happy in the long run.
  0.2006 Under no circumstances should a kitten be taken from its mother and littermates before it is six weeks old.
> Every once in a while we go shopping for cheap meat.
  0.2960 A corner of the kitchen is usually satisfactory.
  0.2396 Beware of the cat with tender feet.
  0.2136 We feel this to be too great a sacrifice to ask of anyone if there is any alternative at all.
> Carmaker fined for gas emissions cover-up.
  0.1645 Some shelters also ship excess animals to research laboratories.
  0.1529 Such people are not veterinarians and may not legally call themselves such.
  0.1327 Cleanliness is critical.
> This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
  0.1565 The anus and vulva together form an inverted exclamation point.
  0.1220 It really is rather simple.
  0.0933 A cardboard box with ample air holes can be used in an emergency.
> You should never jump into the swimming pool.
  0.3076 Always provide plenty of water.
  0.2973 There should be no hesitation or uncertainty in its movements even though the surface is irregular.
  0.2454 Neither you nor the cat will be happy in the long run.
> Welcome to Hotel California.
  0.2468 To be so selected is an honor.
  0.2449 Always provide plenty of water.
  0.2432 Cleanliness is critical.
> There is no spoon.
  0.2984 The tip is sometimes visible.
  0.2301 It should not be cleft.
  0.2156 Inexpensive hard plastic dishes such as those designed for babies are excellent.
> I am looking for new shoes.
  0.2407 Beware of the cat with tender feet.
  0.2169 Cats normally have five toes on each front foot and four on each rear.
  0.1960 All are suitable if large enough for the cat.
> The remote of my tv is not working.
  0.2256 Check the scratches again in six to eight hours.
  0.1441 Be certain the litterbox is sufficiently large for your cat.
  0.1323 Your cat should be able to enter the box and comfortably turn around in it.
> I cleaned my living room.
  0.3920 Cleanliness is critical.
  0.2711 Check the scratches again in six to eight hours.
  0.2536 Be certain the litterbox is sufficiently large for your cat.
> I think it is time to read a book.
  0.2390 Before taking the plunge there are a few things to take into account.
  0.2345 They should be well on the way to healing by then.
  0.2027 A lack of interest may indicate an ill or jaded animal.
> Do you think I am a robot?
  0.1979 Neither you nor the cat will be happy in the long run.
  0.1683 Be certain the litterbox is sufficiently large for your cat.
  0.1629 Usage is the only certainty.

Links / Impressum

Sentence-Transformers
https://www.sbert.net

textfiles.com
http://www.textfiles.com

Einen Kommentar hinterlassen

Kommentieren Sie anonym oder melden Sie sich zum Kommentieren an.

Kommentare

Eine Antwort hinterlassen

Antworten Sie anonym oder melden Sie sich an, um zu antworten.