DISCO - Download

The DISCO API is open source and licensed under the Apache License, version 2.0. Most of the language data packets are also freely available (see table below).

New: Version 1.2 of DISCO API allows to load language data packets (word spaces) into main memory (provided that you have enough RAM). This strongly reduces computation time. See javadoc.

You need the Java archive disco-1.2.jar and a language data packet from the table below. Click on a link in the column Packet Name for a packet description on the download page.

Other Downloads:

Old API version 1.1:

Language Packet Name Corpus Size Number of Words Packet Size License
Arabic ar-general-20120124 188 million token 134,000 518 MB no commercial usage!
Czech cz-general-20080115 163 million token 300,000 5.6 GB Apache 2.0
Dutch nl-general-20081004 114 million token 200,000 4.0 GB Apache 2.0
English en-BNC-20080721 119 million token 122,000 1.7 GB Apache 2.0
en-PubMedOA-20070501 181 million token 60,000 864 MB Apache 2.0
en-wikipedia-20080101 267 million token 220,000 5.9 GB Apache 2.0
French fr-wikipedia-20110201-lemma 458 million token 154,000 513 MB Apache 2.0
fr-wikipedia-20080713 105 million token 188,000 2.4 GB Apache 2.0
German de-general-20080727 400 million token 200,000 3.6 GB no commercial usage!
Italian it-general-20080815 104 million token 164,000 2.3 GB Apache 2.0
Russian ru-wikipedia-20110804 230 million token 112,000 544 MB Apache 2.0
Spanish es-general-20080720 232 million token 260,000 5.0 GB no commercial usage!