use word2vec to determine which two words in a group of words is most similar

  • Last Update :
  • Techknowledgy :

For example, first use combinations() from itertools to generate all pairs of your candidate words:

from itertools
import combinations
candidate_words = ['architect', 'nurse', 'surgeon', 'grandmother', 'dad']
all_pairs = combinations(candidate_words, 2)

Then, decorate the pairs with their pairwise similarity:

scored_pairs = [(w2v_model.wv.similarity(p[0], p[1]), p)
   for p in all_pairs

Finally, sort to put the most-similar pair first, and report that score & pair:

sorted_pairs = sorted(scored_pairs, reverse = True)
print(sorted_pairs[0]) # first item is most - similar pair

Integrating @ryan-feldspar's suggestion about max(), and going for minimality, this should also work to report the best pair (but not its score):

print(max(combinations(candidate_words, 2),
   key = lambda p: w2v_model.wv.similarity(p[0], p[1])))

Load up or train the model for your embeddings and then, on your model, you can call:

min_distance = float('inf')
min_pair = None
word2vec_model_wv = model.wv # Unsure
if this can be done in the loop, but just to be safe efficiency - wise
for candidate_word1 in words:
   for candidate_word2 in words:
   if candidate_word1 == candidate_word2:
   continue # ignore when the two words are the same

distance = word2vec_model_wv.distance(candidate_word1, candidate_word2)
if distance < min_distance:
   min_pair = (candidate_word1, candidate_word2)
min_distance = distance

Suggestion : 2

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the similarity(...) function and passing in the relevant words.,Using this underlying assumption, you can use Word2Vec to:,The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are”.,Compute similarity between two words and more!

First, we start with our imports and get logging established:

# imports needed and logging
import gzip
import gensim
import logging
logging.basicConfig(format = ’ % (asctime) s: % (levelname) s: % (message) s’, level = logging.INFO)

Now, let’s take a closer look at this data below by printing the first line.

with, 'rb') as f:
   for i, line in enumerate(f):

You should see the following:

b "Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Beijing, then you will be ok.I chose to have some breakfast in the hotel, which was really tasty and there was a good selection of dishes. There are a couple of computers to use in the communal area, as well as a pool table. There is also a small swimming pool and a gym area.I would definitely stay in this hotel again, but only if I did not plan to travel to central Beijing, as it can take a long time. The location is ok if you plan to do a lot of shopping, as there is a big shopping centre just few minutes away from the hotel and there are plenty of eating options around, including restaurants that serve a dog meat!\t\r\n"

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step. So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

# build vocabulary and train model
model = gensim.models.Word2Vec(
   size = 150,
   window = 10,
   min_count = 2,
   workers = 10,
   iter = 10)

To train the model earlier, we had to set some parameters. Now, let’s try to understand what some of them mean. For reference, this is the command that we used to train the model.

model = gensim.models.Word2Vec(documents, size = 150, window = 10, min_count = 2, workers = 10, iter = 10)

Suggestion : 3

Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.,The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.,The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.,Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than training against the input words through reconstruction, as a restricted Boltzmann machine does, word2vec trains words against other words that neighbor them in the input corpus.

king: queen::man: [woman, Attempted abduction, teenager, girl]
//Weird, but you can kind of see it

China: Taiwan::Russia: [Ukraine, Moscow, Moldova, Armenia]
//Two large countries and their small, estranged neighbors

house: roof::castle: [dome, bell_tower, spire, crenellations, turrets]

knee: leg::elbow: [forearm, arm, ulna_bone]

New York Times: Sulzberger::Fox: [Murdoch, Chernin, Bancroft, Ailes]
//The Sulzberger-Ochs family owns and runs the NYT.
//The Murdoch family owns News Corp., which owns Fox News.
//Peter Chernin was News Corp.'s COO for 13 yrs.
//Roger Ailes is president of Fox News.
//The Bancroft family sold the Wall St. Journal to News Corp.

love: indifference::fear: [apathy, callousness, timidity, helplessness, inaction]
//the poetry of this single array is simply amazing...

Donald Trump: Republican::Barack Obama: [Democratic, GOP, Democrats, McCain]
//It's interesting to note that, just as Obama and McCain were rivals,
//so too, Word2vec thinks Trump has a rivalry with the idea Republican.

monkey: human::dinosaur: [fossil, fossilized, Ice_Age_mammals, fossilization]
//Humans are fossilized monkeys? Humans are what's left
//over from monkeys? Humans are the species that beat monkeys
//just as Ice Age mammals beat dinosaurs? Plausible.

building: architect::software: [programmer, SecurityCenter, WinPcap]
        WordVectors wordVectors = WordVectorSerializer.loadTxtVectors(new File("glove.6B.50d.txt"));