Why Perplexity For Padded Vocabulary Is Infinitive For Nltk.lm Bigram?
I am testing the perplexity measure for a language model for a text: train_sentences = nltk.sent_tokenize(train_text) test_sentences = nltk.sent_tokenize(test_text) train_to
Solution 1:
The input to perplexity is text in ngrams not a list of strings. You can verify the same by running
for x intest_text:
print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])
You should see that the tokens(ngrams) are all wrong.
You will still get inf in the perplexity if your words in test data are out of vocab (of train data)
train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)
train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']
train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab)
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all", model.perplexity(test))
# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all [oov]", model.perplexity(test))
Post a Comment for "Why Perplexity For Padded Vocabulary Is Infinitive For Nltk.lm Bigram?"