The New York Times’ Connections game: a fairly simple puzzle that has been rising in popularity. The objective? Find four groups for four within a larger sample of sixteen total words such that each subgroup has an overarching theme.
I thought this would be fairly easy to solve with some simple usage of word embedding and K-means clustering. After all, if it can figure out king – man + woman = queen, then surely it can figure out that these are all sandwich ingredients. There are enough models out there for topic modelling that it was easy to install a model in under 1 minute, and I just used a simple K-means.
However, I quickly ran into problems. The most major is the fact that K-means doesn’t always give four groups of four. Seeing as this was the case, I switched to a constrained K-means algorithm. Another thing I noticed is that the word embedding probably doesn’t account for the fact that repetition might be used (e.g. ‘tom tom’ rather than ‘tom’).
It’s curious to wonder what a better approach would be, as spending some 2 hours on this little question has proved to be not as fruitful, even for some relatively simple puzzles. Maybe a contextual embedding is needed, rather than just a GLOVE word2vec model.
I also thought a more curated, greedy algorithm might work rather than K-means. Take the two most similar words, and assume they must be a group. Average the two word vectors then find the next word from the now reduced list. I gave this a whack, but also didn’t turn out too well…
… maybe this is a more difficult puzzle than I originally thought.
Nevertheless, below is some sample code:
import gensim.downloader from sklearn.metrics.pairwise import cosine_similarity from k_means_constrained import KMeansConstrained words = [ 'plastic', 'foil', 'cellophane', 'lumber', 'organism', 'harpoon', 'trudge', 'limber', 'stomp', 'elastic', 'glove', 'bassinet', 'mask', 'plod', 'jacket', 'supple' ] # Load model model = gensim.downloader.load('glove-wiki-gigaword-300') # Generate similarity matrix word_vectors = [ model[word] for word in words # We assume all words exist in corpus ] sim_matrix = cosine_similarity(word_vectors) clf = KMeansConstrained(n_clusters=4, size_min=4, size_max=4, random_state=0) clf.fit_predict(sim_matrix) print([x for _, x in sorted(zip(clf.labels_, words))]) print(sorted(clf.labels_))
One Reply to “My (Not-So-Successful) Quest to Conquer the NYT Connections Game with Word2Vec”