语料库资源

http://140.109.18.114/blog/?p=1049

5. SKETCHENGINE多语言语料库

www.sketchengine.co.uk

每个邮箱可以注册一次,免费期是一个月,免费期过了就再注册一个邮箱,再注册一次。其中汉语语料库是没有加工的生语料库,使用价值不大。关键是其中的英语语料库实际上是原来要付费才能使用的BNC,可以好好利用。

6. COCA———美国当代英语语料库(Corpus of Contemporary A2merican English)

http://www.americancorpus.org/

由美国B righam Young University的M ark Davies教授开发的高达3.6亿词汇的美国最新当代英语语料库,是当今世界上最大的英语平衡语料库。与其它语料库不同的是它是免费在线供大家使用,给全世界英语学习者带来了福音,是不可多得的一个英语学习宝库,也是观察美国英语使用和变化的一个绝佳窗口。

(以上来自:http://blog.sina.com.cn/gjxyxkgy

国内外语料库建设一览

北京大学语言信息工程系捷译双语语料库Web对齐工具(自动+手动)开放注册

访问地址在 http://aligner.pkucat.com

文档:http://aligner.pkucat.net/doc/html/

如果有需要者可以给 yjs@pkucat.com 老师写信申请,说明身份和理由即可。
http://bbs.pkucat.com/modcp.php?action=moderate&op=members

已证实可用的英汉平行语料库(部分)

–TEC

http://www.umist.ac.uk/ctis/research/research-overview.htm

翻译语料库方面则以英国曼彻斯特大学科技学院(UMIST) 翻译研究中心1995年创建的世界上第一个翻译语料库( Translational EnglishCorpus , TEC) 最为著名。该语料库主要收集从各国语言翻译成英语的文本,目前已有上千万词的语料(目标是5 千万词) ,分小说(约占80 %) 传记、报纸和期刊4 个子库。它并不要求必须双语对齐。

该库不仅对语料进行了附码标注,还带有许多超语言信息的标注,如对译者情况(包括译者姓名、性别、民族、职业、翻译方向等) 、翻译方式、翻译类型、源语、原书情况、出版社等等均一一予以标注。

–北大双语语料库

北大计算语言学研究所的双语语料库,英汉对齐的句子已有5万多对,并开发了相应的对齐工具和双语语料库管理软件。正在此基础上做汉英对照短语库,预计规模将达数十万条。

–中英双语在线(CEO)测试开通

网址为 http://www.fleric.org.cn/ceo/

–紅樓夢漢英平行語料庫

http://score.crpp.nie.edu.sg/hlm/index.htm

— The Babel English-Chinese Parallel Corpus
http://www.lancs.ac.uk/fass/projects…abel/babel.htm
The Babel English-Chinese Parallel Corpus,which was created on our research project Contrasting English and Chinese (ESRC Award Reference RES-000-23-0553),consists of 327 English articles and their translations in Mandarin Chinese. Of these 115 texts (121,493 English tokens plus 135,493 Chinese tokens) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English tokens plus 151,969 Chinese tokens) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese tokens). Here is a list of the titles of the articles included in the corpus.

The corpus is tagged for part of speech and aligned at the sentence level. The English texts were tagged using the CLAWS C7 tagset while Chinese texts were tagged using the Peking University tagset. Sentence alignment was done automatically and corrected by hand. The corpus is also marked for paragraph and sentence. But different markup systems were adopted for the two subcorpora. For the component of the World of English, sentences were marked consecutively throughout whereas for Time, sentences were marked within each paragraph.