Why Perplexity For Padded Vocabulary Is Infinitive For Nltk.lm Bigram?

January 22, 2024 Post a Comment

I am testing the perplexity measure for a language model for a text: train_sentences = nltk.sent_tokenize(train_text) test_sentences = nltk.sent_tokenize(test_text) train_to

Solution 1:

The input to perplexity is text in ngrams not a list of strings. You can verify the same by running

for x intest_text:
    print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])

You should see that the tokens(ngrams) are all wrong.

You will still get inf in the perplexity if your words in test data are out of vocab (of train data)

train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab) 

test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all", model.perplexity(test))

# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all [oov]", model.perplexity(test))

Python Guru

Why Perplexity For Padded Vocabulary Is Infinitive For Nltk.lm Bigram?

Solution 1:

Post a Comment for "Why Perplexity For Padded Vocabulary Is Infinitive For Nltk.lm Bigram?"