University of California at Berkeley EECS Instructional Support Group /share/b/pub/ldc.help Jan 2, 2008 LDC Corpora ----------- EECS has subscribed to the Linguistic Data Consortium (LDC), a source of data and tools that support linguistic research (http://ldc.upenn.edu/). This was requested by Profs Wilensky and Klein in Spring 2005 for classes such as CS288, CS294-5 and CS298-13. Documentation is available under http://inst.eecs.berkeley.edu/~inst/pub/ (requires an EECS Instructional UNIX account to login). Each LDC package has an "index.html" file that describes the package. Instructions for reading the data are in http://www.ldc.upenn.edu/Using/ EECS Instruction funded the annual Not-for-Profit Standard membership for years 2005 and 2006. This includes up to 16 free "corpora" (data sets) each year from the catalog for the current year. Other corpora typically have an additional cost of between $100-$3000 each. (see http://www.ldc.upenn.edu/Catalog/). EECS researchers may copy our corpora to their servers, and they may fund additional corpora. Please contact kevinm@eecs.berkeley.edu for help. The corpora are on a disk that is accessible as /home/tmp/LDC on Instructional UNIX and MacOSX computers \\ping\tmp\LDC on Instructional Windows computers We have installed these "corpora" (data sets) ---------- ------------------------------------------------------------- LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0 [2002] LDC2005T01U01 Chinese Treebank 5.1 [2005] LDC2005T05 Multiple-Translation Arabic (MTA) Part 2 [2005] LDC2005T06 Chinese News Translation Text Part 1 [2005] LDC2005T08 Discourse Graphbank [2005] LDC2005T23 Chinese Proposition Bank 1.0 [2005] LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus [2005] ---------- ------------------------------------------------------------- This data may be removed from the Instructional server when classes do not need it. We will retain the sources on CDs and DVDs. These are available but have not been installed: ---------- ------------------------------------------------------------- LDC2005T01 Chinese Treebank 5.0 [2005] LDC2005T02 Arabic Treebank: Part 1 v 3.0 [2005] LDC2005T09 ACE 2004 Multilingual Training Corpus [2005] LDC2005T10 Chinese English News Magazine Parallel Text [2005] LDC2005T12 English Gigaword Second Edition [2005] LDC2005T13 CCGbank LDC2005T14 Chinese Gigaword Second Edition [2005] LDC2005T20 Arabic Treebank: Part 3 (full corpus) v2.0 (MPG+Syn Anal) [2005] LDC2006S13 N4 NATO Native and Non-Native Speech * LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech LDC2006S30 Speech Controlled Computing LDC2006S31 NIST 2003 Language Recognition Evaluation LDC2006S33 Middle East Technical University Turkish Microphone Speech V1.0 LDC2006S34 Russian through Switched Telephone Network (RuSTeN) LDC2006S36 West Point Korean Speech (2 DVDs) LDC2006S37 West Point Heroico Spanish Speech LDC2006S42 Korean Broadcast News Speech LDC2006S43 Gulf Arabic Conversational Telephone Speech LDC2006S44 2004 NIST Speaker Recognition Evaluation LDC2006S45 Iraqi Arabic Conversational Telephone Speech LDC2006S46 Arabic Broadcast News Speech LDC2006T02 Arabic Gigaword Second Edition LDC2006T03 Korean Propbank * LDC2006T04 Multiple Translation Chinese (MTC) Part 4 LDC2006T06 ACE 2005 Multilingual Training Corpus LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts LDC2006T09 Korean Treebank Annotations Version 2.0 LDC2006T10 English-Arabic Treebank V1.0 LDC2006T12 Spanish Gigaword First Edition LDC2006T14 Korean Broadcast News Transcripts LDC2006T15 Gulf Arabic Conversational Telephone Speech, Transcripts LDC2006T16 Iraqi Arabic Conversational Telephone Speech Transcripts LDC2006T17 French Gigaword First Edition LDC2006T18 TDT5 Multilingual Text LDC2006T19 TDT5 Topics and Annotations LDC2006T20 Arabic Broadcast News Transcripts LDC2007S10 2003 NIST Rich Transcription Evaluation LDC2007S11 2004 Sprint NIST Rich Transportation (TR-04S) Development Data LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data LDC2007T03 Tagged Chinese Gigaword LDC2007T07 English Gigaword Third Edition (2 DVDs) LDC2007T09 ISI Chinese-English Automatically Extracted Parallel Text LDC2007T20 GALE Phase 1 Distillation Training LDC2007T21 OntoNotes v 1.0 LDC2007T23 GALE Phase 1 Chinese Broadcast News parallel Text Part 1 LDC2007T24 Arabic Broadcast News Parallel Text - Part 1 LDC2007T36 Chinese Treebank 6.0 (CTB6.0) LDC2007T38 Chinese Gigaword Third Edition LDC2007T40 Arabic Gigaword Third Edition LDC2007V02 TREVID 2003 Keyframes & Transcripts ---------- ------------------------------------------------------------- * Usage is restricted: LDC2006S13 usage is restricted by this agreement: http://www.ldc.upenn.edu/Catalog/mem_agree/NATO_User_Agreement.html LDC2006T03 usage is restricted by this agreement: http://www.ldc.upenn.edu/Catalog/mem_agree/Korean_Propbank_User_Agreement.html LDC2006T09 usage is restricted by this agreement: http://www.ldc.upenn.edu/Catalog/mem_agree/Korean_Treebank_2_User_Agreement.html EECS Instructional Support 378/386 Cory, 333 Soda inst@eecs.berkeley.edu