DISCO - Description of the language data packets and download
Arabic
Packet name: ar-general-20120124
Packet size: 518 megabytes
Corpus size: 188 million token
Number of queriable words: 134,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 50.
Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- Arabic Wikipedia (XML dump 2012-01-14)
- Ajdir Corpora (online newspapers)
Download and installation:
- Download the archive ar-general-20120124.tar and unpack it.
German
Packet name: de-general-20080727
Packet size: 3.6 Gigabyte
Corpus size: 400 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- Encyclopedia (273 million token)
- Newspaper (48 million token)
- Periodicals (32 million token)
- Parliamentary debates (27 million token)
- Literature: Fiction and Non-fiction (20 million token)
Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)
- Create a new directory named de-general-20080727 on your hard disk.
- Download the following four files into the new directory:
_0.cfs (1903287123 bytes) _1.cfs (1744490465 bytes) segments_2 (70 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
English
Packet name: en-BNC-20080721
Packet size: 1.7 Gigabyte
Corpus size: 119 million token
Number of queriable words: 122,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- the British National Corpus (BNC)
Download and installation:
- Create a new directory named en-BNC-20080721 on your hard disk.
- Download the following three files into the new directory:
_0.cfs (1815005661 bytes) segments_3 (45 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
Packet name: en-PubMedOA-20070903
Packet size: 864 Megabyte
Corpus size: 181 million token
Number of queriable words: 60,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- approx. 100,000 medical articles from the PubMed Open Access database (July 2007).
Download and installation:
- Download the archive en-PubMedOA-20070903.tar and extract it.
Packet name: en-wikipedia-20080101
Packet size: 5.9 Gigabyte
Corpus size: 267 million token
Number of queriable words: 220,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- approx. 300,000 articles from the English Wikipedia as of January 2008.
Download and installation:
- Create a new directory named en-wikipedia-20080101 on your hard disk.
- Download the following six files into the new directory:
_0.cfs (1506801606 bytes) _1.cfs (1694294790 bytes) _2.cfs (1726672861 bytes) _3.cfs (1327106259 bytes) segments_2 (120 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
French
Packet name: fr-wikipedia-20110201-lemma
Packet size: 513 Megabyte
Corpus size: 458 million token
Number of queriable words: 154,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, lemmatization (using the Tree Tagger), deletion of the most frequent function words, deletion of all words with a frequency lower than 50.
Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- French Wikipedia (XML dump from 1st February 2011)
Download and installation:
- Download the archive fr-wikipedia-20110201-lemma.tar and unpack it.
Packet name: fr-wikipedia-20080713
Packet size: 2.4 Gigabyte
Corpus size: 105 million token
Number of queriable words: 188,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization,
deletion of the most frequent function words, deletion of all
words with a frequency lower than 12.
Corpus:
Download and installation:
- Create a new directory named fr-wikipedia-20080713 on your hard disk.
- Download the following four files into the new directory:
_0.cfs (1269708232 bytes) _1.cfs (1291676186 bytes) segments_2 (70 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
Italian
Packet name: it-general-20080815
Packet size: 2.3 Gigabyte
Corpus size: 104 million token
Number of queriable words: 164,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization,
deletion of the most frequent function words, deletion of all
words with a frequency lower than 12.
Corpus:
- Encyclopedia (65 million token)
- Parliamentary debates (39 million token)
Download and installation:
- Create a new directory named it-general-20080815 on your hard disk.
- Download the following four files into the new directory:
_0.cfs (900290978 bytes) _1.cfs (1486761508 bytes) segments_2 (70 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
Dutch
Packet name: nl-general-20081004
Paket size: 4.0 Gigabyte
Corpus size: 114 million token
Number of queriable words: 200,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization,
deletion of the most frequent function words, deletion of all
words with a frequency lower than 10.
Corpus:
- Encyclopedia (58.4 million token)
- Parliamentary debates (37 million token)
- Literature (13 million token)
- Newspaper, radio (5.7 million token)
Download and installation:
- Create a new directory named nl-general-20081004 on your hard disk.
- Download the following five files into the new directory:
_0.cfs (1582576570 bytes) _1.cfs (1189383476 bytes) _2.cfs (1505199527 bytes) segments_2 (95 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
Czech
Packet name: cz-general-20080115
Packet size: 5.6 Gigabyte
Corpus size: 163 million token
Number of queriable words: 320,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- Newspaper articles 1998-2008 (59.5 million token)
- EU documents (59.0 million token)
- Encyclopedia January 2008 (34.9 million token)
- Literature (fiction) 1850-2000 (10.4 million token)
- Subtitles of movies and TV series (5.0 million token)
Download and installation:
- Create a new directory named cz-general-20080115 on your hard disk.
- Download the following three files into the new directory:
_2.cfs (3766028482 bytes) segments_7 (45 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
Spanish
Packet name: es-general-20080720
Packet size: 5.0 Gigabyte
Corpus size: 232 million token
Number of queriable words: 260,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words.
Corpus:
- Encyclopedia July 2008 (184.6 million token)
- Parliamentary debates (41.6 million token)
- Literature (fiction) 1830-1930 (5.8 million token)
Download and installation:
Please note that the commercial usage of this language data packet is not allowed! (More information here.)
- Create a new directory named es-general-20080720 on your hard disk.
- Download the following five files into the new directory:
_0.cfs (1766738706 bytes) _1.cfs (1666434302 bytes) _2.cfs (1842598324 bytes) segments_2 (95 bytes) segments.gen (20 bytes)
Don't change the names of the files! Check if the download was complete by comparing the numbers given in parentheses with the file sizes on your disk!
Russian
Packet name: ru-wikipedia-20110804
Packet size: 544 megabytes
Corpus size: 230 million token
Number of queriable words: 112,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, no lemmatization, deletion of the most frequent function words, deletion of all words with frequency lower than 100.
Parameters used for word space computation: Context window +-3 words regarding exact position, 15,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:
- Russian Wikipedia (XML dump 2011-03-28)
Download and installation:
- Download the archive ru-wikipedia-20110804.tar and unpack it.
