**Description:**
In this exercise, we apply basic counts and some association
statistics to a small corpus. We will:

- Count unigrams
- Count bigrams
- Compute mutual information for bigrams
- Compute likelihood ratios for bigrams

**Prerequisites:** This exercise assumes basic
familiarity with typical Unix commands, and the ability to create text
files (e.g. using a text editor such as *vi* or
*emacs*). No programming is required.

**Notational Convention:** The symbols <== will be used to
identify a comment from the instructor, on lines where you're typing
something in. So, for example, in

% cp file1.txt file2.txt <== The "cp" is short for "copy"

what you're supposed to type at the prompt (identified by the percent sign, here) is

cp file1.txt file2.txtfollowed by a carriage return.

Download the ngrams.tgz compressed tar file by clicking on this link. Then move the ngrams.tgz file to a suitable location, such as ~/src.

If you need to create ~/src, you can do so by typing:

% cd % mkdir src

From a unix command line, navigate to where you have placed the ngrams.tgz file and type the following:

% tar xvzf ngrams.tgz <== Extract code from the file % cd ngrams <== Move into the new ngrams subdirectory % chmod u+x *.pl <== Make perl scripts executable % cc -o lr_simple lr_simple.c -lm <== Compile the lr_simple program % cc -o filter_stopwords filter_stopwords.c -lm <== Compile the filter_stopwords programThis will create an

- Take a look at file
**corpora/GEN.EN**. You can do this as follows:% more corpora/GEN.EN

(Type spacebar for more pages, and "q" for "quit".) This contains an annotated version of the book of Genesis, King James version. It is a small corpus, by current standards -- somewhere on the order of 40,000 words. What words (unigrams) would you expect to have high frequency in this corpus? What bigrams do you think might be frequent? - Create a subdirectory called genesis to contain the files with
statistics generated from this corpus:
% mkdir genesis

Then run the

**Stats**program to analyze the corpus. The program requires an input file, and a "prefix" to be used in creating output files. The input file will be**corpora/GEN.EN**, and the prefix will be**genesis/out**, so that output files will be created in the genesis subdirectory. That is, you should execute the following:% Stats corpora/GEN.EN genesis/out

The program will tell you what it's doing, as it counts unigrams, counts bigrams, computes mutual information, and computes likelihood ratio statistics. Depending on the machine you're working on, this may take differing amounts of time to run, but it should be less than a minute (probably less than half that).Note: The program will remove all the <*> annotations at the beginnings of the lines in the text before gathering statistics.

- You
should now have a subdirectory called
**genesis**containing a bunch of files that begin with**out**.

- Go into directory
**genesis**.% cd genesis

- Look at file
**out.unigrams**:% more out.unigrams

Seeing the vocabulary in alphabetical order isn't very useful, so let's sort the file by the unigram frequency, from highest to lowest:% sort -nr out.unigrams > out.unigrams.sorted % more out.unigrams.sorted

Now examine out.unigrams.sorted. Are the high frequency words what you would expect? - Analogously, look at the bigram counts
**out.bigrams**:% sort -nr out.bigrams > out.bigrams.sorted % more out.bigrams.sorted

Markup aside, again, are the high frequency bigrams what you would expect?

- Now let's look at mutual information. File
**out.mi**contains bigrams sorted by mutual information value. Each line contains:- I(wordX,wordY)
- freq(wordX)
- freq(wordY)
- freq(wordX,wordY)
- wordX
- wordY

Low-frequency bigrams (bigram count less than 5) were excluded.

As an exercise, compute mutual information by hand for the first bigram on the list, "savoury meat". Recall that

I(x,y) = log2 [p(x,y)/(p(x)p(y))]

and that the simplest estimates of probabilities, the maximum likelihood estimates, are given byp(x) = freq(x)/N p(y) = freq(y)/N p(x,y) = freq(x,y)/N

where N is the number of observed words in the corpus, 38516. (You can get this by counting the words in file**out.words**; it's also what you get by summing the frequencies in either**out.unigrams**or**out.bigrams**.)Using a standard scientific calculator on your system (one that has logarithms and scientific notation), here is a sequence you can use to do the calculation:

Compute p(savoury) = freq(savoury)/N Compute p(meat) = freq(meat)/N Compute p(savoury meat) = freq(savoury,meat)/N Compute p(savoury)p(meat) = p(savoury) * p(meat) Divide p(savoury,meat) by this value Take the log of the result (which in xcalc is log to the base 10) Convert that result to log base-2 by dividing by 0.30103 This uses the fact that for all M, N: logM(x) = logN(x)/logN(M).

At some point, the calculator may give you scientific notation for a number. If you need to*enter*a number in scientific notation, you use EE:EE Used for entering exponential numbers. For example to get "-2.3E-4" you'd enter "2 . 3 +/- EE 4 +/-".

The number you some up with should be close to the mutual information reported in**out.mi**. It may be slightly different because your calculation used different precision than the program's. - As you've just seen, probabilities can be very low numbers. All
the more so when using n-grams for n=3 or above! Underflow can be a
problem in these sorts of calculations: when the probabilities are too
low, they exceed the representational capacity of the computer. For
this reason it's very common to do such calculations using the logs of
the probability values (often called "log probabilities" or
"logprobs"), using the following handy identities:
log(a * b) = log(a) + log(b) log(a / b) = log(a) - log(b)

Try converting the formula for mutual information using these identities so that probabilities are never multiplied or divided, before reading further.**Solution:**log[p(x,y)/p(x)p(y)] = log p(x,y) - log p(x) + log p(y)To really get a feel for things, first rewrite this in terms of frequencies and

*then*convert to using log probabilities, i.e.log[ (freq(x,y)/N)/(freq(x)/N)(freq(y)/N) ]

- Look at
**out.mi**and the bigrams selected by mutual information as being strongly associated. What do you think of them? Notice how very many of them are low-frequency bigrams: it's well known that mutual information has overly high values for bigrams of low frequency, i.e. it reports word pairs as associated when they probably are not really that strongly associated after all. - Compare this to
**out.lr**, where the leftmost column is the likelihood ratio. There are a lot of common words of English in there, so try filtering those out using the**filter_stopwords**program. First, access the program so it's easy to run in this directory:% ln -s ../filter_stopwords <== Creates a symbolic link % ln -s ../stop.wrd <== Creates a symbolic link

Then run it:% filter_stopwords stop.wrd < out.lr > out.lr.filtered

How does

**out.lr.filtered**look as a file containing bigrams that are characteristic of this corpus?

- One thing you may have noticed is that there's more data
sparseness because uppercase and lowercase are distinct, e.g. "Door"
is treated as a different word from "door". In the corpora directory,
you can create an all- lowercase version of GEN.EN by doing this:
% cat GEN.EN | tr "A-Z" "a-z" > GEN.EN.lc

Try re-doing the exercise with this version. What, if anything, changes? - Ok, perhaps that last one wasn't exactly fun. But this probably
will be. Download one (or more) of these three Sherlock Holmes stories:

*A Study in Scarlet*

*The Hound of the Baskervilles*

*Adventures of Sherlock Holmes*

(The last of which contains 12 different stories.) Then place it (them) in your**corpora**subdirectory.Now get back into your

**ngrams**directory, create an output directory, say,**holmes1**, and run the**Stats**program for the file of interest, e.g.:% cd .. % mkdir holmes1 % Stats corpora/study.dyl holmes1/out % cd holmes1

Or perhaps convert to lowercase before running

**Stats**:% cd corpora % cat study.dyl | tr "A-Z" "a-z" > study.lc % cat hound.dyl | tr "A-Z" "a-z" > hound.lc % cat adventures.dyl | tr "A-Z" "a-z" > adventures.lc % cd .. % mkdir holmes1 % Stats corpora/study.lc holmes1/out % cd holmes1

Look at

**out.lr**, etc. for this corpus. Now go through the same process again, but creating a directory**holmes2**and using a different file. Same author, same main character, same genre... how do the high-association bigrams compare between the two cases? If you use filter_stopwords, how do the results look -- what kinds of bigrams are you getting? What natural language processing problems might this be useful for?