Skip to content Skip to sidebar Skip to footer

Is It Possible To Re-train A Word2vec Model (e.g. Googlenews-vectors-negative300.bin) From A Corpus Of Sentences In Python?

I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', bi

Solution 1:

This is how I technically solved the issue:

Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/

sentences = MySentences('newcorpus')  

Setting up the model

model = gensim.models.Word2Vec(sentences)

Intersecting the vocabulary with the google word vectors

model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
                                lockf=1.0,
                                binary=True)

Finally executing the model and updating

model.train(sentences)

A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...

Solution 2:

Some folks have been working on extending gensim to allow online training.

A couple GitHub pull requests you might want to watch for progress on that effort:

It looks like this improvement could allow updating the GoogleNews-vectors-negative300.bin model.

Solution 3:

it is possible if model builder didn't finalize the model training . in python it is:

model.sims(replace=True) #finalize the model

if the model didn't finalize it is a perfect way to have model with large dataset.

Post a Comment for "Is It Possible To Re-train A Word2vec Model (e.g. Googlenews-vectors-negative300.bin) From A Corpus Of Sentences In Python?"