Ebook Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook
Python Text Processing with NLTK 2.0 Cookbook



Nhà xuất bản Packt Publishing
Tác giả Jacob Perkins
Số trang 272
Ngày xuất bản 2010
File PDF

Chapter 1: Tokenizing Text and WordNet Basics 7
Introduction 7
Tokenizing text into sentences 8
Tokenizing sentences into words 9
Tokenizing sentences using regular expressions 11
Filtering stopwords in a tokenized sentence 13
Looking up synsets for a word in WordNet 14
Looking up lemmas and synonyms in WordNet 17
Calculating WordNet synset similarity 19
Discovering word collocations 21

Chapter 2: Replacing and Correcting Words 25
Introduction 25
Stemming words 25
Lemmatizing words with WordNet 28
Translating text with Babelish 30
Replacing words matching regular expressions 32
Removing repeating characters 34
Spelling correction with Enchant 36
Replacing synonyms 39
Replacing negations with antonyms 41

Chapter 3: Creating Custom Corpora 45
Introduction 45
Setting up a custom corpus 46
Creating a word list corpus 48
Creating a part-of-speech tagged word corpus 50
Creating a chunked phrase corpus 54
Creating a categorized text corpus 58
Creating a categorized chunk corpus reader 61
Lazy corpus loading 68
Creating a custom corpus view 70
Creating a MongoDB backed corpus reader 74
Corpus editing with ile locking 77

Chapter 4: Part-of-Speech Tagging 81
Introduction 82
Default tagging 82
Training a unigram part-of-speech tagger 85
Combining taggers with backoff tagging 88
Training and combining Ngram taggers 89
Creating a model of likely word tags 92
Tagging with regular expressions 94
Afix tagging 96
Training a Brill tagger 98
Training the TnT tagger 100
Using WordNet for tagging 103
Tagging proper names 105
Classiier based tagging 106

Chapter 5: Extracting Chunks 111
Introduction 111
Chunking and chinking with regular expressions 112
Merging and splitting chunks with regular expressions 117
Expanding and removing chunks with regular expressions 121
Partial parsing with regular expressions 123
Training a tagger-based chunker 126
Classiication-based chunking 129
Extracting named entities 133
Extracting proper noun chunks 135
Extracting location chunks 137
Training a named entity chunker 140

Chapter 6: Transforming Chunks and Trees 143
Introduction 143
Filtering insigniicant words 144
Correcting verb forms 146
Swapping verb phrases 149
Swapping noun cardinals 150
Swapping ininitive phrases 151
Singularizing plural nouns 153
Chaining chunk transformations 154
Converting a chunk tree to text 155
Flattening a deep tree 157
Creating a shallow tree 161
Converting tree nodes 163

Chapter 7: Text Classiication 167
Introduction 167
Bag of Words feature extraction 168
Training a naive Bayes classiier 170
Training a decision tree classiier 177
Training a maximum entropy classiier 180
Measuring precision and recall of a classiier 183
Calculating high information words 187
Combining classiiers with voting 191
Classifying with multiple binary classiiers 193

Chapter 8: Distributed Processing and Handling Large Datasets 201
Introduction 202
Distributed tagging with execnet 202
Distributed chunking with execnet 206
Parallel list processing with execnet 209
Storing a frequency distribution in Redis 211
Storing a conditional frequency distribution in Redis 215
Storing an ordered dictionary in Redis 218
Distributed word scoring with Redis and execnet 221

Chapter 9: Parsing Speciic Data 227
Introduction 227
Parsing dates and times with Dateutil 228
Time zone lookup and conversion 230
Tagging temporal expressions with Timex 233
Extracting URLs from HTML with lxml 234
Cleaning and stripping HTML 236
Converting HTML entities with BeautifulSoup 238
Detecting and converting character encodings 240

Be the first to comment

Leave a Reply

Your email address will not be published.