Natural Language Annotation for Machine Learning James Pustejovsky and Amber Stubbs Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo Natural Language Annotation for Machine Learning by James Pustejovsky and Amber Stubbs Revision History for the : 2012-03-06 Early release revision 1 2012-03-26 Early release revision 2 See /catalog/?isbn=9781449306663 for release details. ISBN: 978-1-449-30666-3 1332788036 Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1. The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Importance of Language Annotation 1 The Layers of Linguistic Description 2 What is Natural Language Processing? 4 A Brief History of Corpus Linguistics 5 What is a Corpus? 7 Early Use of Corpora 9 Corpora Today 12 Kinds of Annotation 13 Language Data and Machine Learning 18 Classification 19 Clustering 19 Structured Pattern Induction 19 The Annotation Development Cycle 20 Model the phenomenon 21 Annotate with the Specification 24 Train and Test the algorithms over the corpus 25 Evaluate the results 26 Revise the Model and Algorithms 27 Summary 28 2. Defining Your Goal and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Defining a goal 31 The Statement of Purpose 32 Refining your Goal: Informativity versus Correctness 33 Background research 38 Language Resources 39 Organizations and Conferences 39 NLP Challenges 40 iii Assembling your dataset 40 Collecting data from the 41 Eliciting data from people 41 Preparing your data for annotation 42 Metadata 42 Pre-processed data 43 The size of your corpus 44 Existing Corpora 44 Distributions within corpora 45 Summary 47 3. Building Your Model and Specification