Term Weighting For Original Lda In Gensim
I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf... My question is, wha
Solution 1:
It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.
The correct format is that of the corpus
defined in the first tutorial on the Gensim webpage (these are really useful).
Namely, if you have a dictionary
as defined in Radim's tutorial, and the following documents,
doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]
then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count)
, where dk
refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with
corpus = [dictionary.doc2bow(doc) for doc in docs]
That doc2bow
function means "document to bag of words".
Post a Comment for "Term Weighting For Original Lda In Gensim"