of Bosnian Texts: consists of a corpus of approximately 1.6 million
words, encoded with the IMS corpus workbench developed at the Institut
fur Maschinelle Sprachverarbeitung at the University of Stuttgart.
English Section of the
Helsinki Corpus of English Texts,
searches on lexical items and syntactic structure.The Brown University
Corpus: Approximately 1,000,000 words of American written English
from 1960. The genre categories are parallel to those of the LOB corpus.
service: It is an on-line service for accessing a corpus of modern
English language text, written and spoken. You may take out a six-month
or full-year's subscription to CobuildDirect.
you're interested in the English language -- especially if you are a
or a learner of English -- then these Web pages are for you. The team
at Cobuild works with a huge "corpus" of modern English text on
to analyse language usage.
Corpus of Spoken,
American-English: The corpus, which has been constructed from a
of existing transcripts of interactions in professional settings,
two main sub-corpora of a million words each. One sub-corpus consists
of academic discussions such as faculty council meetings and committee
meetings related to testing. The second sub-corpus contains transcripts
of White House press conferences, which are almost exclusively
The Helsinki Corpus
(Diachronic Part): samples from texts
Old, Middle, and Early Modern English periods. 1,500,000 words in total.
Lancaster/Oslo-Bergen Corpus (LOB): Approximately 1,000,000
British written English dating from 1960. The corpus is made up of 15
genre categories. Available as orthographic text and tagged with the
part-of-speech tagging system. The Leeds-Lancaster Treebank and
Parsed Corpus are analyzed subsamples of the LOB corpus. (See also
The directory contains the transcriptions of the London-Lund corpus
took place as part of the Beach/CoMoPro projects at the University of
Centre for Cognitive Science.
(SEU) is an English Language research unit, based in the Department of
English Language and Literature at University College London.
Parsed Corpus of Old English Poetry: The York Poetry Corpus
71,490 words of Old English text; the samples from the longer texts are 4,000 to 17,000
words in length. The texts represent a range of dates of composition
authors. The size of the corpus is approximately 2.5 megabytes.
A USENET corpus (2005-2007)
This corpus is a collection of public USENET postings. This corpus was
collected between Oct 2005 and Jan 2007, and covers 47860 English
language, non-binary-file news groups.
CORGA (Reference Corpus of
Present-day Galician Language)
TIGER project is pleased to announce the release of the first
version of the TIGER Corpus. This treebank consists of app. 700,000
tokens (40,000 sentences) of German newspaper text. It was
semi-automatically tagged with part-of-speech and syntactic structures.
: A Syntactically Annotated Corpus of German Newspaper Texts. The
NEGRA corpus consists of approx. 20,602 sentences of German newspaper
taken from the Frankfurter Rundschau.
corpus of contemporary written Italian
dell'italiano parlato) containing an online edition of the 500,000 word
The edition is enriched with POS-tags and lemmata.
E„`ROR CORPUS is a compilation of spelling errors produced by 333
users of the English language. The age of the subjects ranges from 14
59, whose occupations also vary from junior-high school student to
Concordance Project a corpus of classical Malay texts (now
nearly 4 million words, including over 50,000 verses) which can be
Proceedings 1996-2001 : This parallel corpus is extracted from the
proceedings of the European Parliament. It includes versions in 11
languages: Romanic (French, Italian, Spanish, Portuguese), Germanic
Dutch, German, Danish, Swedish), Greek and Finnish.
The JRC-Acquis Multilingual Parallel Corpus : The JRC-Acquis covers the 20 official EU languages plus Romanian. Norwegian
is thus not included, but several other Scandinavian languages are. The
corpus is paragraph-aligned for each of the 190 language pairs.
- is an attempt to collect translated texts from the web, to convert
and align the entire collection, to add linguistic data, and to provide
the community with a publicly available parallel corpus
travel stories of the 19th and early 20th century (Darwin, Loti,
Stendhal, Flaubert, Dickens, London, etc.), translated in English,
French, Italian and Spanish.
Linguistic Data Consortium
Linguistic Data Consortium is an open consortium of universities,
and government research laboratories. It creates, collects and
speech and text databases, lexicons, and other resources for research
development purposes. The University of Pennsylvania is the LDC's host
Computer Archive of Modern and Medieval English)
ASSOCIATION The overall goal of ELRA is to provide a centralized
for the validation, management, and distribution of speech, text, and
resources and tools, and to promote their use within the European
Archrive - Text Center
Electronic Texts on the Internet : Alex allows users to find
and retrieve the full-text of documents on the Internet. It
indexes over 700 books and shorter texts by author and title,
texts from Project Gutenberg, Wiretap, the On-line Book Initiative, the
Eris system at Virginia Tech, the English Server at Carnegie Mellon
and the on-line portion of the Oxford Text Archive. For now it
no serials. Alex does include an entry for itself.
the online resource compiled by the Library of Congress National
Library Program. With the participation of other libraries and
the program provides a gateway to rich primary source materials
to the history and cultural developments of the United States. Over one
million items from our historical collections are currently available
Berkeley Digital Library
Digital Library SunSITE builds digital collections and services while
information and support to others doing the same. We are sponsored by
Library, UC Berkeley and Sun Microsystems, Inc.
Analysis of Texts) at the University of Pennsylvania has one of the
archives of ready-to-use e-texts. [downloadable]
Center for Electronic Texts
serves all U.S. scholars, researchers and teachers involved with the
and use of electronic text applications in the humanities.
Center for Electronic Text
Law: CETL currently produces two text databases that can be
from the Internet. The first is the University of Cincinnati's portion
of DIANA, a unique database of human rights materials. The second
the Securities Lawyer's Deskbook, provides electronic acc ess from the
Internet to the text of the Securities Act of 1933 and the Securities
Act of 1934, together with the rules and forms necessary for compliance
with these statutes.
Christian Classics Ethereal*
Classic Christian books in electronic format, selected for your
There is enough good reading material here to last you a lifetime, if
give each work the time it deserves! All of the books on this server
believed to be in the public domain in the United States unless
The English Server, The
Server is a cooperative which has been publishing humanities texts
since 1990. Today it offers over eighteen thousand works, covering a
range of interests. [downloadable]
is a global information network providing free, organized access to
resources in medieval studies through a World Wide Web server at
Online Book Initiative:
is a project to make a large collection of freely redistributable text
available in a common format for others to do with as they like.
The Online Book
The On-Line Books Page is a directory of books that can be freely read
right on the Internet. [downloadable]
Oxford Text Archrive: The
been collecting electronic texts for some twenty years from a wide
of sources, and its holdings reflect the diversity of this medium.
Project Gutenberg The Project
Philosophy is to make information, books and other materials available
to the general public in forms a vast majority of the computers,
and people can easily read, use, quote, and search. (FTP
Barlow's concordance program. It's very fast, easy to use, and can
This link leads you to WinConcordancer, a concordancing programme for
developed at the TH Darmstadt Dept. of Linguistics and Literaure. The
was developed by Zdenek Martinek from the University of West Bohemia,
Republic, in close collaboration with Les Siegrist, from the Technische
Universität Darmstadt, Germany.
interactive concordancer, developed by the Summer Institute of
Browser and Editor for Windows 95 / 98 / NT
of Asian Newspaper English : This concordance has been produced
a corpus of newspaper reports published in 18 Asian countries
September and November 2000. Samples of around 6,000 words were
from the Internet versions of newspapers.
: another web concordance, requires Windows 95/98/ME & Internet
5.0 or greater
text-analysis and retrieval system for MS-DOS that permits inquiries on
text databases in European languages.
is a tool for generating KWIK-concordances based on webpages. There are
two options for defining your corpus: let Google search the relevant
for you or define the URLs that will be used yourself.
WebCorp : an
tool which allows on-line access to Web texts as linguistic rather than
An integrated suite of programs for looking at how words behave in
Tagging: is a very efficient statistical part-of-speech
that is trainable on different languages and virtually any tagset. The
component for parameter generation trains on tagged corpora. The system
incorporates several methods of smoothing and of handling unknown words.
This tagger takes as input English text, possibly containing SGML
and produces tagged text, both in a multi-column and in an SGML/TEI
The tagset used is basically the LOB tagset (about 130 tags plus
although with a few very slight adjustments.
: EngCG-2 is a program that assigns morphological and
tags to words in English text.
- A Fully Automatic English Wordclass Analysis System : AUTASYS is
a menu-driven automatic tagging and lemmatising system that analyses
texts at word-class level with the Lancaster-Oslo-Bergen (LOB) tagset,
the International Corpus of English (ICE) tagset, and the “skeleton”
(SKELETON), which is the set of base tags from ICE without features.
morphology and part-of-speech tagging (Win 95/NT)
English: sell fast and accurate programs for tagging and parsing
texts: Conexor Constraint Grammar of English (EngCG-2 tagger) Conexor
Syntax of English (EngLite parser) Conexor Functional Dependency
of English (FDG parser)
The Link Grammar
is a syntactic parser of English, based on link grammar, an original
of English syntax. Given a sentence, the system assigns to it a
structure, which consists of set of labeled links connecting pairs of
The parser is a bottom-up probabilistic chart parser which finds the
tree with the best score by best-first search algorithm. Its grammar
English) in the distribution is a semi-context sensitive grammar with
non-terminals and it was automatically extracted from Penn Tree Bank,
tagged corpus made at the University of Pennsylvania.
XIARA : XML Aware Indexing
and Retrieval Architecture
PIE : PIE incorporates a database
derived from the second or World Edition of the BNC (2000), but is not
affiliated with the BNC Consortium. It aims to provide a simple yet
powerful interface for studying words and phrases up to six words long
: Unitex is a
corpus processing system, based on automata-oriented technology.
The concept of this software was born at LADL (Laboratoire
d'Automatique Documentaire et Linguistique), under the direction of
its director, Maurice Gross. With this tool, you can handle
electronic resources such as electronic dictionaries and grammars
and apply them. You can work at the levels of morphology, the
lexicon and syntax.
Tools and Corpora Multext encompasses a series of projects whose
are to develop standards and specifications for the encoding and
of linguistic corpora, and to develop tools, corpora and linguistic
embodying these standards.
Technology Group makes available various software packages. For
purposes, these are often available for free to academic research
and for a small fee to industrial R&D groups.
is a software environment that supports researchers in Natural Language
Processing (NLP) and Computational Linguistics (CL) and developers who
are producing and delivering Language Engineering (LE) systems.
a TEI Tag Set Selector: These pages will help you design your own
document type definition. You will be able to select the TEI tag sets
need to make up your very own view of the TEI dtd, including your own
XED: An XML
editor: XED is a text editor for XML document instances. It is
to support hand-authoring of small-to-medium size XML documents, and is
optimised for keyboard input. It works very hard to ensure that you
produce a non-well-formed document.
Xlex/www is a
to a suite of Unix/Linux command line tools (tokenizer, index, POS
concordancer, statistical analysis, etc.)
(LTG) The LTG is a technology transfer group working in the area of
natural language engineering. Based in Edinburgh's Human Communication
Research Centre, it can draw on the skills and expertise of one of the
largest communities of natural language processing specialists in
: is an international project to develop guidelines for the
and interchange of electronic texts for scholarly research, and to
a broad range of uses by the language industries more generally.
(Encoded Archival Description) The Library of Congress has been active
in developing and testing markup of archival finding aids using Encoded
Archival Description, a document type definition (DTD) of Standard
Markup Language (SGML).
of the Concerted Research Action (ARC) of the ILEC group (Ingénierie de
la Langue--Linguistique--Informatique et Corpus écrits). It has started
in 1995 in order to promote research in the field of multi-lingual
This first 2-year period (95-96) was dedicated to the achievement of
two main tasks : the production of a large, rich and standardised
(French-English) corpus suited for the alignment task ; the reflection
and the phasing-in of a protocol in order to objectively evaluate
Association ELRA was established in Luxembourg in February, 1995,
the goal of founding an organization to promote the creation,
and distribution of language resources in Europe.
The aim was to produce a reasonably large text corpus of the major
languages for the linguistic research community. The majority of the
wasdone at the HCRC in Edinburgh and at ISSCO, University of Geneva.
ECI/MCI corpus has now been published on CD-ROM, and contains almost
million words in 27 (mainly European) languages. It consists of 48
corpora marked up in SGML, with easy access to the source text without
markup. 12 of the component corpora are multilingual parallel corpora
from two to nine sub-corpora
(TELRI) is a European Commission-funded initiative which is
a viable infrastructure between leading European language and language
technology centres in order to provide a platform for industry,
institutes and universities and to supply the NLP community with
/ public domain monolingual and multilingual language resources.
TRACTOR: TELRI Research
Computational Tools and Resources
A European project (LE2-4002/10380) on the validation of terminology
co-financed by the EC (DG XIII) within the framework of the Language
programme, which addresses the following issues: Validation of
terminology Resource, Validation methodologies and software toolkit,
Wide Web dissemination,CD-ROM data banks
Page is a comprehensive online database containing reference
and software pertaining to the Standard Generalized Markup Language
and its subset, the Extensible Markup Language (XML).
Resource Service (LETRS) serves as a focal point for members of the
Indiana U community interested in identifying, acquiring, and using
resources for humanities research and teaching.
to Digital Resource 1996-1998 : This is the fourth edition of the
Textual Studies Guide to Digital Resources. The Guide aims to give an
of digital resources which may have application for Higher Education
and research in the disciplines supported by the Centre: Literary
in all languages and periods, Literary Linguistics, Philosophy,
& Religious Studies, Classics, Film and Media Studies, and Drama).
Papers papers fall into two categories: (1) articles dealing with
and computational linguistics and (2) corpus manuals.