David Mertz, Ph.D.
Linguistic Hack, Gnosis Software
June, 2004
The Natural Language Toolkit is a Python library for analyzing and otherwise processing collections of textual data, particularly in terms of the concepts and techniques developed in academic linguistics. Some of these techniques overlap with what goes by the name "text processing"--or perhaps to lexing/parsing in computer science--but other capabilities for syntactic and even semantic analysis are specialized to the more subtle texts and grammars of natural languages.
NLTK
IncludesYour humble writer knows a little bit about a lot of things; but despite writing a fair amount about text processing (a book, for example), linguistic processing is a relatively novel are for me. Forgive me if I stumble through my explanations of the quite remarkable Natural Language Toolkit (NLTK), a wonderful tool for teaching, and working in, computational linguistics using Python.
It is natural to think of NLTK as a stacked series of layers that build on each other. Readers how are familiar with lexing and parsing of artificial languages (like, say, Python) will not have too much of a leap to understand the similar--but deeper--layers involved in natural language modelling. While NLTK comes with a number of corpora that have been pre-processed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first; then words are tagged; then groups of words are parsed into grammatical elements, like noun phrases or sentences (according to one of several techniques, each with advantages and drawbacks); finally sentences or other grammatical units can be classified. Along the way, NLTK gives you the ability to generate statistics about occurences of various elements, and draw graphs that represent either the processing itself, or statistical aggregates in results.
For this first article, I will present some comparatively fleshed out examples from the lower-level capabilities, but simply describe abstractly most of the higher level capabilities. If I have the opportunity to return to NLTK in a later installment, I will give more detailed descriptions of parsing and graphing; for now, let us take the first steps past text processing, narrowly construed.
Much of what you can do with NLTK, particularly at its lower levels, is not that much different from what you can do with Python's basic data structures. But NLTK provides a set of regularized interfaces that are relied on and utilized at higher levels, as well as simply providing convenient classes to hold tokenized and tagged texts.
In particular the class nltk.tokenizer.Token
is used very widely to
store annotated segments of text; these annotations can mark a number
of different features, including: parts-of-speech: subtoken stuctures:
offsets of a token within a larger text; morphological stems;
grammatical sentence component; and so on. In fact, a Token
is a
special kind of dictionary--and is accessed in the fashion of a
dictionary--so it can contain whatever keys you like. A few special
keys are used in NLTK, different ones by the various subpackages.
Let us look briefly at creating a token and breaking it into subtokens:
>>> from nltk.tokenizer import * >>> t = Token(TEXT='This is my first test sentence') >>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace >>> print t['TEXT'] This is my first test sentence >>> print t['SUBTOKENS'] [<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c], <test>@[17:21c], <sentence>@[22:30c]] >>> t['foo'] = 'bar' >>> t <TEXT='This is my first test sentence', foo='bar', SUBTOKENS=[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c], <test>@[17:21c], <sentence>@[22:30c]]> >>> print t['SUBTOKENS'][0] <This>@[0:4c] >>> print type(t['SUBTOKENS'][0]) <class 'nltk.token.SafeToken'>
One fairly simple thing you are likely to with linguistic corpora is analyze frequencies of various events within them, and make probability predictions based on these known frequencies. NLTK supports a variety of techniques for projecting probabilities based on raw frequency data; I will not cover those here (see the Probability Tutorial mentioned in Resources), but suffice it to say that what you are warranted to expect has a slightly fuzzy relationship to what you already know (beyond the obvious scaling/normalization).
In essence, there are two types of frequency supported by NLTK:
histograms and conditional frequencies. The class
nltk.probability.FreqDist
is used to create histograms, for example,
a word histogram can be created with:
>>> from nltk.probability import * >>> article = Token(TEXT=open('cp-b17.txt').read()) >>> WSTokenizer().tokenize(article) >>> freq = FreqDist() >>> for word in article['SUBTOKENS']: ... freq.inc(word['TEXT']) >>> freq.B() 1194 >>> freq.count('Python') 12
The tutorial discusses creation of histograms on more complex
features, like "the length of words following words ending in vowels."
The class nltk.draw.plot.Plot
is useful for visualization of
histograms. Of course, you can equally analyze frequencies of higher
level grammatical features, or even of data sets unrelated to NLTK.
Conditional frequencies are perhaps more interesting than plain histograms. A conditional frequency is a kind of two dimensional histogram--it gives you one histogram per initial condition, or "context." For example, the tutorial suggests the question of what the distribution of word lengths is for each first letter. We can analyze that with:
>>> cf = ConditionalFreqDist() >>> for word in article['SUBTOKENS']: ... cf[word['TEXT'][0]].inc(len(word['TEXT'])) ... >>> init_letters = cf.conditions() >>> init_letters.sort() >>> for c in init_letters[44:50]: ... print "Init %s:" % c, ... for length in range(1,6): ... print "len %d/%.2f," % (length,cf[c].freq(n)), ... print ... Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03, Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12, Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06, Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06, Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18, Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25,
A nice linguistic use of conditional frequencies is to analyze the syntagmatic distributions in corpora--for example, given the occurrence of a particular word, what words are most likely to come next. Grammars provide some constraints here, of course; but the study of selection among syntactic options falls within the fields of semantics, pragmatics, and register.
The class nltk.stemmer.porter.PorterStemmer
is a wonderfully handy
tool to derive grammatical (prefix) stems from English words. This
capability struck a particular chord for me, having previously created
a public domain full-text indexed search tool/library in Python
(covered in this column, see Resources; and used by a moderately large
number of other projects).
While it is quite useful to be able to search a large collection of
documents almost instantly for a joint occurrence of a collection of
exact words (what gnosis.indexer
does), for many searching purposes
a little fuzziness would help. Perhaps you are not quite sure whether
the old email you are looking for used the word "complicated,"
"complications," "complicating," or "complicates," but you remember
that was one of the general concepts involved (probably with a few
others to perform a useful search).
NLTK includes an excellent algorithm for word stemming, and lets you customize stemming algorithms further to your liking. E.g.:
>>> from nltk.stemmer.porter import PorterStemmer >>> PorterStemmer().stem_word('complications') 'complic'
Exactly how you might utilize stemming within gnosis.indexer
, a tool
derived from it, or a wholly different indexing tool, depends on your
usage scenarios. Fortunately, gnosis.indexer
has an open interface
that is easy to specialize. Do you want an index composed entirely of
stems? Or do you want to include both full words and stems in the
index? Are matches on exact words to be ranked higher than matches on
stems? Do you want to separate stem matches form exact matches in your
results? I will include some sort of stemming capability in future
versions of gnosis.indexer
, but end users might still want to
customize differently.
In general, however, adding stemming is very simple: First, derive
stems from a document by specializing gnosis.indexer.TextSplitter
;
second, when you perform a search (optionally) stem the search terms
before using them for index lookup, probably by customizing your
MyIndexer.find()
method.
I made a discovery, while playing with the PorterStemmer
: the class
nltk.tokenizer.WSTokenizer
really is as bad as NLTK's tutorial
warns. It is fine to occupy a conceptual role, but for real-life
texts, you can do a lot better at identifying what a "word" is.
Fortunately, gnosis.indexer.TextSplitter
is one such more robust
tokenizer. For example, :
>>> from nltk.tokenizer import * >>> article = Token(TEXT=open('cp-b17.txt').read()) >>> WSTokenizer().tokenize(article) >>> from nltk.probability import * >>> from nltk.stemmer.porter import * >>> stemmer = PorterStemmer() >>> stems = FreqDist() >>> for word in article['SUBTOKENS']: ... stemmer.stem(word) ... stems.inc(word['STEM'].lower()) ... >>> word_stems = stems.samples() >>> word_stems.sort() >>> word_stems[20:40] ['"generator-bas', '"implement', '"lazili', '"magic"', '"partial', '"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."', '"state', '"understand', '"weightless', '"whatev', '#', '#-----', '#----------', '#-------------', '#---------------', '#b17:']
Looking at a few stems, the collection does not look all that useful for indexing. Many are not really words at all, others are compound phrases with dashes, and extraneous punctuation makes it in with the words. Let us try it with better tokenization:
>>> article = TS().text_splitter(open('cp-b17.txt').read()) >>> stems = FreqDist() >>> for word in article: ... stems.inc(stemmer.stem_word(word.lower())) ... >>> word_stems = stems.samples() >>> word_stems.sort() >>> word_stems[60:80] ['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte', 'call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast', 'certain', 'certainli', 'chang', 'charm']
Here you can see several words that have several possible expansions,
and all the words look like words, or like morphemes. Tokenization
matters a lot for random text collections; in fairness to NLTK, its
bundled corpora have been packaged for easy and accurate tokenization
with WSTokenizer()
. For a robust real-world indexer, use robust
tokenization.
The largest part of NLTK consists of various parsers, of varying levels of sophistication. For the most part, this introduction will not explain their details, but I would like to give a first brush at what they hope to achive.
As background, remember that tokens are special dictionaries--in
particular, ones that can contain a key TAG
to indicate the
grammatical role of a word. NLTK corpora documents often come
pre-tagged for parts-of-speech; but you can certainly add your own
tags to untagged documents.
"Chunking" is something like "parsing lite." That is, chunking is based either on existing markup of grammatical components, or is something you add manually--or semi-automatically using regular expressions and program logic. But it is not really parsing, properly speaking (no production rules as such). For example:
>>> from nltk.parser.chunk import ChunkedTaggedTokenizer >>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]" >>> sentence = Token(TEXT=chunked) >>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP') >>> tokenizer.tokenize(sentence) >>> sentence['SUBTOKENS'][0] (NP: <the/DT> <little/JJ> <cat/NN>) >>> sentence['SUBTOKENS'][0]['NODE'] 'NP' >>> sentence['SUBTOKENS'][0]['CHILDREN'][0] <the/DT> >>> sentence['SUBTOKENS'][0]['CHILDREN'][0]['TAG'] 'DT' >>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS']) (S: (NP: <the/DT> <little/JJ> <cat/NN>) <sat/VBD> <on/IN> (NP: <the/DT> <mat/NN>))
Chunking, as mentioned, can be done with the class
nltk.tokenizer.RegexpChunkParser
using pseudo-regular-expressions to
describe series of tags making up a grammatical element, e.g. (from
tutorial):
>>> rule1 = ChunkRule('<DT>?<JJ.*>*<NN.*>', ... 'Chunk optional det, zero or more adj, and a noun') >>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S') >>> chunkparser.parse(sentence) >>> print sent['TREE'] (S: (NP: <the/DT> <little/JJ> <cat/NN>) <sat/VBD> <on/IN> (NP: <the/DT> <mat/NN>))
True parsing gets us into a lot of theoretical areas. For example,
top-down parsers are guaranteed to find every possible production, but
can be extremely slow because of frequence (exponention order)
backtracking. Shift-reduce parsing is much more efficient, but can
miss some productions. In either case, a grammar is declared in a
manner similar to those created to parse artificial languages. This
column has looked at some of those: SimpleParse
, mx.TextTools
,
Spark
, gnosis.xml.validity
.
Even past top-down and shift-reduce parser, NLTK also offers "chart parsers" that create partial hypotheses that a given sequence can be continued to fulfill a rule. This approach can be both efficient and complete. A quick (toy) example illustrates:
>>> from nltk.parser.chart import * >>> grammar = CFG.parse(''' ... S -> NP VP ... VP -> V NP | VP PP ... V -> "saw" | "ate" ... NP -> "John" | "Mary" | "Bob" | Det N | NP PP ... Det -> "a" | "an" | "the" | "my" ... N -> "dog" | "cat" | "cookie" ... PP -> P NP ... P -> "on" | "by" | "with" ... ''') >>> sentence = Token(TEXT='John saw a cat with my cookie') >>> WSTokenizer().tokenize(sentence) >>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT') >>> parser.parse_n(sentence) >>> for tree in sentence['TREES']: print tree (S: (NP: <John>) (VP: (VP: (V: <saw>) (NP: (Det: <a>) (N: <cat>))) (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))) (S: (NP: <John>) (VP: (V: <saw>) (NP: (NP: (Det: <a>) (N: <cat>)) (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))))
A probabilistic context free grammar (or PCFG) is a context free grammar that associates a probability with each of its productions. Again, parsers for probabilistic parsing are also bundled with NLTK.
There are important features of NLTK this brief introduction could not get to. For example, NLTK has a whole framework for text classification using statistical techniques like "naive Bayesian" and "maximum entropy" models. Heady stuff that I cannot yet explain, even if I had space. But I think even NLTK's lower layers make it look like a useful framework for both pedagogical and practical applications.
The Natural Language Toolking is hosted by Sourceforge, and both its home page and associated documentation, downloads, and various other resources can be found there. The home page itself is at:
http://nltk.sourceforge.net/index.html
The documentation for NLTK is rooted at:
http://nltk.sourceforge.net/index.html
From there you can find API reference guides to several versions of the library. At the time I wrote this, 1.3 was stable, and 1.4 was in alpha; but when you read it, most likely later versions will exist.
Of particular use to the new user of NLTK--including me, as I wrote this article--are the series of tutorials at:
http://nltk.sourceforge.net/tutorial/index.html
Nine tutorials (as of this writing) generally cover respective subpackages of NLTK; three supplementary tutorials introduce Python more generally for linguistics students who may not already know the language (or for other folks). These tutorials are helpful and well written, but occasional details seem not to match the most current API version.
An earlier Charming Python column was about "Developing a full-text
indexer in Python", and presented the tool gnosis.indexer
:
http://www-106.ibm.com/developerworks/xml/library/l-pyind.html
Several earlier installments have looked at parsers for artificial languages:
SimpleParse: http://www-106.ibm.com/developerworks/linux/library/l-simple.html
Spark: http://www-106.ibm.com/developerworks/linux/library/l-spark.html
gnosis.xml.validity: http://www-106.ibm.com/developerworks/linux/library/l-cpdec.html
mx.TextTools: http://gnosis.cx/publish/programming/charming_python_14.html
David Mertz had no idea he was writing prose this whole time. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Check out David's book Text Processing in Python (http://gnosis.cx/TPiP/).