Is It Possible To Re-train A Word2vec Model (e.g. Googlenews-vectors-negative300.bin) From A Corpus Of Sentences In Python?
Solution 1:
This is how I technically solved the issue:
Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/
sentences = MySentences('newcorpus')
Setting up the model
model = gensim.models.Word2Vec(sentences)
Intersecting the vocabulary with the google word vectors
model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
lockf=1.0,
binary=True)
Finally executing the model and updating
model.train(sentences)
A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...
Solution 2:
Some folks have been working on extending gensim to allow online training.
A couple GitHub pull requests you might want to watch for progress on that effort:
It looks like this improvement could allow updating the GoogleNews-vectors-negative300.bin model.
Solution 3:
it is possible if model builder didn't finalize the model training . in python it is:
model.sims(replace=True) #finalize the model
if the model didn't finalize it is a perfect way to have model with large dataset.
Post a Comment for "Is It Possible To Re-train A Word2vec Model (e.g. Googlenews-vectors-negative300.bin) From A Corpus Of Sentences In Python?"