Charming Python #b18: The Natural Language Toolkit

Using Python in Computational Linguistics


David Mertz, Ph.D.
Linguistic Hack, Gnosis Software
June, 2004

The Natural Language Toolkit is a Python library for analyzing and otherwise processing collections of textual data, particularly in terms of the concepts and techniques developed in academic linguistics. Some of these techniques overlap with what goes by the name "text processing"--or perhaps to lexing/parsing in computer science--but other capabilities for syntactic and even semantic analysis are specialized to the more subtle texts and grammars of natural languages.

What NLTK Includes

Your humble writer knows a little bit about a lot of things; but despite writing a fair amount about text processing (a book, for example), linguistic processing is a relatively novel are for me. Forgive me if I stumble through my explanations of the quite remarkable Natural Language Toolkit (NLTK), a wonderful tool for teaching, and working in, computational linguistics using Python.

It is natural to think of NLTK as a stacked series of layers that build on each other. Readers how are familiar with lexing and parsing of artificial languages (like, say, Python) will not have too much of a leap to understand the similar--but deeper--layers involved in natural language modelling. While NLTK comes with a number of corpora that have been pre-processed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first; then words are tagged; then groups of words are parsed into grammatical elements, like noun phrases or sentences (according to one of several techniques, each with advantages and drawbacks); finally sentences or other grammatical units can be classified. Along the way, NLTK gives you the ability to generate statistics about occurences of various elements, and draw graphs that represent either the processing itself, or statistical aggregates in results.

For this first article, I will present some comparatively fleshed out examples from the lower-level capabilities, but simply describe abstractly most of the higher level capabilities. If I have the opportunity to return to NLTK in a later installment, I will give more detailed descriptions of parsing and graphing; for now, let us take the first steps past text processing, narrowly construed.

Tokenization

Much of what you can do with NLTK, particularly at its lower levels, is not that much different from what you can do with Python's basic data structures. But NLTK provides a set of regularized interfaces that are relied on and utilized at higher levels, as well as simply providing convenient classes to hold tokenized and tagged texts.

In particular the class nltk.tokenizer.Token is used very widely to store annotated segments of text; these annotations can mark a number of different features, including: parts-of-speech: subtoken stuctures: offsets of a token within a larger text; morphological stems; grammatical sentence component; and so on. In fact, a Token is a special kind of dictionary--and is accessed in the fashion of a dictionary--so it can contain whatever keys you like. A few special keys are used in NLTK, different ones by the various subpackages.

Let us look briefly at creating a token and breaking it into subtokens:

A first look at the nltk.tokenizer.Token class

>>> from nltk.tokenizer import *
>>> t = Token(TEXT='This is my first test sentence')
>>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace
>>> print t['TEXT']
This is my first test sentence
>>> print t['SUBTOKENS']
[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]
>>> t['foo'] = 'bar'
>>> t
<TEXT='This is my first test sentence', foo='bar',
SUBTOKENS=[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]>
>>> print t['SUBTOKENS'][0]
<This>@[0:4c]
>>> print type(t['SUBTOKENS'][0])
<class 'nltk.token.SafeToken'>

Probability

One fairly simple thing you are likely to with linguistic corpora is analyze frequencies of various events within them, and make probability predictions based on these known frequencies. NLTK supports a variety of techniques for projecting probabilities based on raw frequency data; I will not cover those here (see the Probability Tutorial mentioned in Resources), but suffice it to say that what you are warranted to expect has a slightly fuzzy relationship to what you already know (beyond the obvious scaling/normalization).

In essence, there are two types of frequency supported by NLTK: histograms and conditional frequencies. The class nltk.probability.FreqDist is used to create histograms, for example, a word histogram can be created with:

Basic histogram creation with nltk.probability.FreqDist

>>> from nltk.probability import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> freq = FreqDist()
>>> for word in article['SUBTOKENS']:
...     freq.inc(word['TEXT'])
>>> freq.B()
1194
>>> freq.count('Python')
12

The tutorial discusses creation of histograms on more complex features, like "the length of words following words ending in vowels." The class nltk.draw.plot.Plot is useful for visualization of histograms. Of course, you can equally analyze frequencies of higher level grammatical features, or even of data sets unrelated to NLTK.

Conditional frequencies are perhaps more interesting than plain histograms. A conditional frequency is a kind of two dimensional histogram--it gives you one histogram per initial condition, or "context." For example, the tutorial suggests the question of what the distribution of word lengths is for each first letter. We can analyze that with:

Conditional frequencies: word length per initial letter

>>> cf = ConditionalFreqDist()
>>> for word in article['SUBTOKENS']:
...     cf[word['TEXT'][0]].inc(len(word['TEXT']))
...
>>> init_letters = cf.conditions()
>>> init_letters.sort()
>>> for c in init_letters[44:50]:
...     print "Init %s:" % c,
...     for length in range(1,6):
...         print "len %d/%.2f," % (length,cf[c].freq(n)),
...     print
...
Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03,
Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12,
Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18,
Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25,

A nice linguistic use of conditional frequencies is to analyze the syntagmatic distributions in corpora--for example, given the occurrence of a particular word, what words are most likely to come next. Grammars provide some constraints here, of course; but the study of selection among syntactic options falls within the fields of semantics, pragmatics, and register.

Stemming

The class nltk.stemmer.porter.PorterStemmer is a wonderfully handy tool to derive grammatical (prefix) stems from English words. This capability struck a particular chord for me, having previously created a public domain full-text indexed search tool/library in Python (covered in this column, see Resources; and used by a moderately large number of other projects).

While it is quite useful to be able to search a large collection of documents almost instantly for a joint occurrence of a collection of exact words (what gnosis.indexer does), for many searching purposes a little fuzziness would help. Perhaps you are not quite sure whether the old email you are looking for used the word "complicated," "complications," "complicating," or "complicates," but you remember that was one of the general concepts involved (probably with a few others to perform a useful search).

NLTK includes an excellent algorithm for word stemming, and lets you customize stemming algorithms further to your liking. E.g.:

Stemming words for morpohological roots

>>> from nltk.stemmer.porter import PorterStemmer
>>> PorterStemmer().stem_word('complications')
'complic'

Exactly how you might utilize stemming within gnosis.indexer, a tool derived from it, or a wholly different indexing tool, depends on your usage scenarios. Fortunately, gnosis.indexer has an open interface that is easy to specialize. Do you want an index composed entirely of stems? Or do you want to include both full words and stems in the index? Are matches on exact words to be ranked higher than matches on stems? Do you want to separate stem matches form exact matches in your results? I will include some sort of stemming capability in future versions of gnosis.indexer, but end users might still want to customize differently.

In general, however, adding stemming is very simple: First, derive stems from a document by specializing gnosis.indexer.TextSplitter; second, when you perform a search (optionally) stem the search terms before using them for index lookup, probably by customizing your MyIndexer.find() method.

I made a discovery, while playing with the PorterStemmer: the class nltk.tokenizer.WSTokenizer really is as bad as NLTK's tutorial warns. It is fine to occupy a conceptual role, but for real-life texts, you can do a lot better at identifying what a "word" is. Fortunately, gnosis.indexer.TextSplitter is one such more robust tokenizer. For example, :

Stemming based on poor NLTK tokenization

>>> from nltk.tokenizer import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> from nltk.probability import *
>>> from nltk.stemmer.porter import *
>>> stemmer = PorterStemmer()
>>> stems = FreqDist()
>>> for word in article['SUBTOKENS']:
...     stemmer.stem(word)
...     stems.inc(word['STEM'].lower())
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[20:40]
['"generator-bas', '"implement', '"lazili', '"magic"', '"partial',
'"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."',
'"state', '"understand', '"weightless', '"whatev', '#', '#-----',
'#----------', '#-------------', '#---------------', '#b17:']

Looking at a few stems, the collection does not look all that useful for indexing. Many are not really words at all, others are compound phrases with dashes, and extraneous punctuation makes it in with the words. Let us try it with better tokenization:

Stemming using clever heuristics in tokenization

>>> article = TS().text_splitter(open('cp-b17.txt').read())
>>> stems = FreqDist()
>>> for word in article:
...     stems.inc(stemmer.stem_word(word.lower()))
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[60:80]
['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte',
'call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast',
'certain', 'certainli', 'chang', 'charm']

Here you can see several words that have several possible expansions, and all the words look like words, or like morphemes. Tokenization matters a lot for random text collections; in fairness to NLTK, its bundled corpora have been packaged for easy and accurate tokenization with WSTokenizer(). For a robust real-world indexer, use robust tokenization.

Tagging, Chunking And Parsing

The largest part of NLTK consists of various parsers, of varying levels of sophistication. For the most part, this introduction will not explain their details, but I would like to give a first brush at what they hope to achive.

As background, remember that tokens are special dictionaries--in particular, ones that can contain a key TAG to indicate the grammatical role of a word. NLTK corpora documents often come pre-tagged for parts-of-speech; but you can certainly add your own tags to untagged documents.

"Chunking" is something like "parsing lite." That is, chunking is based either on existing markup of grammatical components, or is something you add manually--or semi-automatically using regular expressions and program logic. But it is not really parsing, properly speaking (no production rules as such). For example:

Chunk parsing/tagging: words and and bigger bits

>>> from nltk.parser.chunk import ChunkedTaggedTokenizer
>>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]"
>>> sentence = Token(TEXT=chunked)
>>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP')
>>> tokenizer.tokenize(sentence)
>>> sentence['SUBTOKENS'][0]
(NP: <the/DT> <little/JJ> <cat/NN>)
>>> sentence['SUBTOKENS'][0]['NODE']
'NP'
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]
<the/DT>
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]['TAG']
'DT'
>>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS'])
(S:
  (NP: <the/DT> <little/JJ> <cat/NN>)
  <sat/VBD>
  <on/IN>
  (NP: <the/DT> <mat/NN>))

Chunking, as mentioned, can be done with the class nltk.tokenizer.RegexpChunkParser using pseudo-regular-expressions to describe series of tags making up a grammatical element, e.g. (from tutorial):

Chunking with regular expressions on tags

>>> rule1 = ChunkRule('<DT>?<JJ.*>*<NN.*>',
...               'Chunk optional det, zero or more adj, and a noun')
>>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S')
>>> chunkparser.parse(sentence)
>>> print sent['TREE']
(S: (NP: <the/DT> <little/JJ> <cat/NN>)
  <sat/VBD> <on/IN>
  (NP: <the/DT> <mat/NN>))

True parsing gets us into a lot of theoretical areas. For example, top-down parsers are guaranteed to find every possible production, but can be extremely slow because of frequence (exponention order) backtracking. Shift-reduce parsing is much more efficient, but can miss some productions. In either case, a grammar is declared in a manner similar to those created to parse artificial languages. This column has looked at some of those: SimpleParse, mx.TextTools, Spark, gnosis.xml.validity.

Even past top-down and shift-reduce parser, NLTK also offers "chart parsers" that create partial hypotheses that a given sequence can be continued to fulfill a rule. This approach can be both efficient and complete. A quick (toy) example illustrates:

Defining basic productions for a context-free grammar

>>> from nltk.parser.chart import *
>>> grammar = CFG.parse('''
...    S -> NP VP
...    VP -> V NP | VP PP
...    V -> "saw" | "ate"
...    NP -> "John" | "Mary" | "Bob" | Det N | NP PP
...    Det -> "a" | "an" | "the" | "my"
...    N -> "dog" | "cat" | "cookie"
...    PP -> P NP
...    P -> "on" | "by" | "with"
...    ''')
>>> sentence = Token(TEXT='John saw a cat with my cookie')
>>> WSTokenizer().tokenize(sentence)
>>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT')
>>> parser.parse_n(sentence)
>>> for tree in sentence['TREES']: print tree
(S:
  (NP: <John>)
  (VP:
    (VP: (V: <saw>) (NP: (Det: <a>) (N: <cat>)))
    (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>)))))
(S:
  (NP: <John>)
  (VP:
    (V: <saw>)
    (NP:
      (NP: (Det: <a>) (N: <cat>))
      (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))))

A probabilistic context free grammar (or PCFG) is a context free grammar that associates a probability with each of its productions. Again, parsers for probabilistic parsing are also bundled with NLTK.

Conclusion

There are important features of NLTK this brief introduction could not get to. For example, NLTK has a whole framework for text classification using statistical techniques like "naive Bayesian" and "maximum entropy" models. Heady stuff that I cannot yet explain, even if I had space. But I think even NLTK's lower layers make it look like a useful framework for both pedagogical and practical applications.

Resources

The Natural Language Toolking is hosted by Sourceforge, and both its home page and associated documentation, downloads, and various other resources can be found there. The home page itself is at:

http://nltk.sourceforge.net/index.html

The documentation for NLTK is rooted at:

http://nltk.sourceforge.net/index.html

From there you can find API reference guides to several versions of the library. At the time I wrote this, 1.3 was stable, and 1.4 was in alpha; but when you read it, most likely later versions will exist.

Of particular use to the new user of NLTK--including me, as I wrote this article--are the series of tutorials at:

http://nltk.sourceforge.net/tutorial/index.html

Nine tutorials (as of this writing) generally cover respective subpackages of NLTK; three supplementary tutorials introduce Python more generally for linguistics students who may not already know the language (or for other folks). These tutorials are helpful and well written, but occasional details seem not to match the most current API version.

An earlier Charming Python column was about "Developing a full-text indexer in Python", and presented the tool gnosis.indexer:

http://www-106.ibm.com/developerworks/xml/library/l-pyind.html

Several earlier installments have looked at parsers for artificial languages:

SimpleParse: http://www-106.ibm.com/developerworks/linux/library/l-simple.html
Spark: http://www-106.ibm.com/developerworks/linux/library/l-spark.html
gnosis.xml.validity: http://www-106.ibm.com/developerworks/linux/library/l-cpdec.html
mx.TextTools: http://gnosis.cx/publish/programming/charming_python_14.html

About The Author

Picture of Author David Mertz had no idea he was writing prose this whole time. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Check out David's book Text Processing in Python (http://gnosis.cx/TPiP/).