Jakub Vrána - Lingvistika 1

(Statistické metody zpracování přirozeného jazyka - 12. 12. 2001)
Assign 1: Exploring Entropy and Language Modeling

1.1 The Shannon Game

Used book: burroughs_lostCont.tok

Due to malfunction of computer playing, this game was more irritating than funny. I'm not native English speaker and I didn't read the book, so I had problems with guessing letters - especially those on the beginnings of words. I was usually quite successful in guessing letters inside of the words, but at the beginning of word I sometimes tried about 20 - 25 (!) letters before success. I had this problem even with history length of 20.

My results are better a bit with longer history lengths; bigger difference can be examined especially in the progresses of charts. With longer histories, charts are sharper - this can be explained by difficult beginning-of-word guessing. Differences are best visible on included animated chart.

history = 2 history = 3
history = 4 history = 5
history = 20 animated

Final results

History lengthEntropy

At the history length of 20, the computer will probably guess letters badly due to data-sparsenes problem.

1.2 Entropy of a Text

Basic statistics and comparison of English and Czech files

Number of words in file221098222412
Number of characters in file (without \n)9729171030631
Average length of word4.404.63
Number of unique words960742826
Number of unique characters74117
Total frequency of 10 most frequent words29.76%23.17%
Number of words with frequency equal to 1381126315
Conditional entropy H(J|I)5.287454.74784
Perplexity PX(J|I)39.05526.869

In both files, there is the similar number of words and also the similar count of characters, so the average length of word is about 4.5 characters in both files.

The biggest difference is in the number of unique words and also in the number of unique characters. This is of course due to Czech diacritics and high inflection of the Czech language. Higher number of unique words in Czech is reason for much higher number of words with frequency equal to 1 (about one half in Czech against about one third in English). In my opinion, this is also reason for higher frequency of ten most frequent words in English.

Statistics of experiments


TEXTEN1 charactersoriginal0.001%0.010%0.100%1.000%5.000%10.000%
Number of unique words96079612970910493175604196764300
Number of unique characters:74747474747474
Frequency of 10 most frequent words65799657996577765666645165967453793
Number of words with frequency equal to 13811381639134701114333420955151
Average frequency of word23.0123.0022.7721.0712.595.273.44
Conditional entropy H(J|I): avg5.287455.287425.287105.283935.251005.056424.73124
Conditional entropy H(J|I): min 5.287345.286975.282905.249545.051964.72791
Conditional entropy H(J|I): max 5.287475.287295.284835.252355.060924.73737
Perplexity PX(J|I): avg39.05539.05539.04638.96038.08133.27626.561
Perplexity PX(J|I): min 39.05239.04238.93238.04333.17426.500
Perplexity PX(J|I): max 39.05639.05138.98538.11733.38026.674

Conditional entropy for TEXTEN1 due character mess-ups

TEXTCZ1 charactersoriginal0.001%0.010%0.100%1.000%5.000%10.000%
Number of unique words42826428334290743581500877360094438
Number of unique characters:117117117117117117117
Frequency of 10 most frequent words51533515335151151466509994884246173
Number of words with frequency equal to 126315263222640327158344346076984080
Average frequency of word5.
Conditional entropy H(J|I): avg4.747844.747754.746824.738534.658164.338774.00646
Conditional entropy H(J|I): min 4.747694.746454.737594.656154.335463.99927
Conditional entropy H(J|I): max 4.747794.747094.739704.660734.344094.01098
Perplexity PX(J|I): avg26.86926.86726.84926.69625.24920.23516.072
Perplexity PX(J|I): min 26.86626.84326.67825.21420.18915.992
Perplexity PX(J|I): max 26.86826.85526.71725.29420.31016.122

Conditional entropy for TEXTCZ1 due character mess-ups

TEXTEN1 wordsoriginal0.001%0.010%0.100%1.000%5.000%10.000%
Number of unique words9607960796079602957495249566
Number of unique characters:74747474747474
Frequency of 10 most frequent words65799657996579965732651586246059321
Number of words with frequency equal to 1381138113804371730291198456
Average frequency of word23.0123.0123.0123.0323.0923.2123.11
Conditional entropy H(J|I): avg5.287455.287475.287645.289485.307055.379835.45848
Conditional entropy H(J|I): min 5.287425.287495.289265.305605.376615.45529
Conditional entropy H(J|I): max 5.287525.287775.290055.308785.385405.46334
Perplexity PX(J|I): avg39.05539.05639.06139.11039.59041.63843.971
Perplexity PX(J|I): min 39.05539.05739.10439.55041.54543.874
Perplexity PX(J|I): max 39.05739.06439.12639.63741.79944.119

Conditional entropy for TEXTEN1 due word mess-ups

TEXTCZ1 wordsoriginal0.001%0.010%0.100%1.000%5.000%10.000%
Number of unique words42826428264282542802425494181541282
Number of unique characters:117117117117117117117
Frequency of 10 most frequent words51533515335151151466510214895346395
Number of words with frequency equal to 126315263142630426179248742002915869
Average frequency of word5.
Conditional entropy H(J|I): avg4.747844.747844.747794.747014.739054.699264.63712
Conditional entropy H(J|I): min 4.747804.747714.746704.738284.697614.63184
Conditional entropy H(J|I): max 4.747864.747914.747704.740894.701574.64195
Perplexity PX(J|I): avg26.86926.86826.86726.85326.70525.97924.884
Perplexity PX(J|I): min 26.86826.86626.84726.69125.94924.793
Perplexity PX(J|I): max 26.86926.87026.86626.73926.02024.967

Conditional entropy for TEXTCZ1 due word mess-ups

We can see that differences between minimal and maximal entropies are very small for each experiment. The reason is that I used good pseudo-random number generator (Mersenne Twister). Also, the number of random values used for getting entropies is very high. So I will talk just about average values from now. Statistics for mess-ups are get from random experiment (the tenth one).

The number of unique characters is the same for all mess-ups. Practical, it could change (decrease) only if there was a single word with some unique character and this word was chosen to be replaced and not to replace another word. Other possibilities are almost impossible and as we can see, described possibility is also almost impossible.

The number of words in file can't change by messing up. Also the number of characters in file can't change by character messing up and I predicted that also it doesn't change much for word messing up. So I didn't watch these values.

Some more interesting facts

The number of unique words rapidly raises with character messing-up. For 10% messing-up factor, the number of unique words raises more than twice for Czech and more than six times for English. So the rate of number of unique words between English and Czech moves from about one quarter to about two thirds. The number of unique words lowers a bit with word messing-up.

The number of words with frequency equal-to-one lowers for word messing-up and raises for character messing-up. For 10% character messing-up factor, it raises three times for Czech and fourteen times for English. For 10% word messing-up factor, it lowers 1.6 times for Czech and eight times for English. In my opinion, the number of words with frequency equal-to-one lowers more in English due to the smaller absolute number of unique words. Similar reason can be used for disproportion in raising the number of words with frequency equal-to-one in character messing-up.

Final result of the experiments - entropy and perplexity

Generally, entropy is lower in Czech due to higher number of unique words and due to higher number of words with frequency equal-to-one (call them isolated words). We can sum it up that the average frequency of words is lower in Czech than in English. So it is easier to predict words from one-word context because there is only one possibility after isolated words. Generally, it is easier to predict words after words with low frequency. This fact is reason for lowering the entropy when messing-up characters.

The trend of entropy during word messing-up is strange at a glance. It raises for English but lowers for Czech. I think that this can be explained like that: In English, the number of unique words remains almost the same (due to the small absolute number of these words). In opposite, the number of isolated words lowers. So it is harder to predict next word. In Czech, the number of unique words lowers. So it is easier to predict next word, because there are fewer choices. But - in my opinion, decreasing the number of unique words (by 4%) shouldn't be so important as decreasing the number of isolated words (by 40%). So I can't explain lowering the entropy in the Czech text sufficiently.

Behavior of perplexity is the same as entropy (better visible due to the exponential type of this function). Perplexity equal to 27 just says something like that we need about 27 attempts to guess next word from one-word context.

Paper and pencil exercise

By appending text with conditional entropy E to the end of another text with the same conditional entropy, the conditional entropy of this new text remains almost the same in case that these two texts don't share any vocabulary items. We can take it in like that: It is equally hard to predict the next word from one-word context because texts don't share any vocabulary items.

Mathematical prove: P(j|i) remains the same (due to different vocabularies), so the log part of the sum remains also the same. P(i,j) lowers to P(i,j) * |T1| / (|T1| + |T2|) for words from first text, so sum of words from first text lowers to E1 * |T1| / (|T1| + |T2|). Similarly the sum for words from second text lowers to E2 * |T2| / (|T1| + |T2|). Since the original entropies are equal, the final sum is E * (|T1| + |T2|) / (|T1| + |T2|) = E.

I wrote that entropy remains almost the same. This relates to the boundary of the texts. If the last word from first text is unique in this text, the final entropy lowers a bit (or remains the same if we included the start-of-text symbol) otherwise it raises a bit.

1.3 Cross-Entropy and Language Modeling


The coverage of n-gram is defined as the number of n-grams from the test data seen in the training data divided by the size of the test data.

Generally, the coverage is higher in English all for unigrams, bigrams and trigrams. So the cross entropy will be probably higher in Czech regardless of other facts.


The reliability of n-gram is defined as the average of the conditional probabilities of n-grams with seen context. So, for all words J with context I in the test data, do the following: If we have seen context I of word J in the training data, add P(J|I) to the average.

Unigram reliability is much higher in English due to significantly higher average frequency of word than in Czech. In opposite, trigram reliability is higher in Czech due mainly to the inflection (e.g. if adjective is in accusative, the following noun should be also in accusative).

Coverage and reliability

Weights and cross entropyEnglishCzech
lambda0 (uniform weight)0.070.14
lambda1 (unigram weight)0.250.43
lambda2 (bigram weight)0.490.25
lambda3 (trigram weight)0.190.19
Cross entropy7.4710.22

Bigram weight is higher in English. To compensate it, unigram and uniform weights are higher in Czech. These are higher also due to the small coverage of bigrams and trigrams in Czech. Trigram weight is the same for English and for Czech because lower trigram coverage in Czech is compensated by its higher reliability (and also because it is used uniform probability for unseen contexts). Cross entropy is lower in English, so it is easier to predict English on this size of data.

I used the EM Algorithm to count the weight parameters. For unseen context, this algorithm uses uniform probability. I thought about this algorithm for long time - about the correctness, accuracy and about the relation of weights with coverage and reliability. I think that weights are deformed a bit due to usage of uniform probabilities for unseen contexts. I tried to use zero probability for unseen contexts (regardless of it doesn't sum to 1) and the trigram weight was 0.14 for English and 0.06 for Czech. I also improved EM algorithm by usage of (n-1)-gram for unseen contexts instead of uniform probability. There are the results:

(n-1)-gram for unseen contextsEnglishCzech
lambda0 (uniform weight)0.100.26
lambda1 (unigram weight)0.230.37
lambda2 (bigram weight)0.450.20
lambda3 (trigram weight)0.220.17
Cross entropy7.4210.21

Uniform weights are higher than in previous case because it was "distributed" to the other weights over there. I was surprised that the bigram weight is smaller than trigram weight at a glance: For unseen context in trigram, it does use the bigram probability. And trigrams are more reliable than bigrams, so trigram weight should be greater than bigram weight. The reason is that reliability of bigrams used concurrent to trigrams is higher - 28.7% for both English and Czech.

Cross entropy is lower a bit by usage of this algorithm. Anyway, I used the original EM algorithm for tweaks to provide comparable data.



Cross entropy for TEXTEN1 due tweaks


Cross entropy for TEXTCZ1 due tweaks

Cross entropy raises both for increasing the trigram weight and also for decreasing it. This could be expected because computed weights are supposed to be the best. Cross entropy raises much more by increasing the trigram weight than by decreasing it. It says that it is better to use the trigram weights less (or not at all) than depends only on them. And, of course, the absolute value of the weight is less then one half, so increasing e.g. 0.2 to 0.6 (by tweak factor +50%) is more visible than decreasing it to 0.1 (by tweak factor -50%).

By decreasing the trigram weight, cross entropy raises only a little. This could be explained that on this (small) data the trigrams are not so important, nearly all information is hidden in bigrams.

Held-out data

If we count lambda values on training data, we should get 100% for trigram weight and 0% for the others. I tested the correctness of my program by running on training data with knowledge of this fact. If we try to count weights on testing data, it will be similar as on held-out data. But we would not be allowed to use these data for testing because weights will be adjusted to this data. But, in my opinion, we can add the held-out data to the training data after computation of weights.

Source codes
zpět na nadřazenou stránku