Skip to content Skip to sidebar Skip to footer

Term Weighting For Original Lda In Gensim

I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf... My question is, wha

Solution 1:

It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.

The correct format is that of the corpus defined in the first tutorial on the Gensim webpage (these are really useful).

Namely, if you have a dictionary as defined in Radim's tutorial, and the following documents,

doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]

then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count), where dk refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with

corpus = [dictionary.doc2bow(doc) for doc in docs]

That doc2bow function means "document to bag of words".

Post a Comment for "Term Weighting For Original Lda In Gensim"