DISCO - Download

The DISCO API is open source and licensed under the Apache License, version 2.0.

You need the Java archive disco-2.1.jar (the DISCO API) and a word space from the table below. Click on a link in the column Word Space Name for a word space description and the download link. Note that older word spaces are not compatible with the DISCO API version 2.x! Also note that the Protege plugin only works with word spaces for DISCO API version 1!

You can construct word spaces from your own corpora with DISCO Builder. It also allows to import vector files that were produced with other tools like word2vec or GloVe.

Other Downloads:

Language Word Space Name Corpus Size Number of Words Packet Size Word Space Type API Version License
English enwiki-20130403-sim-lemma-mwl-lc 1.9 billion token 420,184 2.3 GB SIM 2.0
enwiki-20130403-word2vec-lm-mwl-lc-sim 1.9 billion token 420,184 1.4 GB SIM 2.0
French fr-general-20151126-lm-sim 1.9 billion token 276,967 2.1 GB SIM 2.0
fr-general-20151126-lm-word2vec-sim 1.9 billion token 281,484 1.7 GB SIM 2.0
German de-general-20150421-lm-sim 1.5 billion token 470,788 3.5 GB SIM 2.0
de-general-20150503-lm-word2vec-sim 1.5 billion token 470,788 3.0 GB SIM 2.0
Russian ru-ruwac-ruwiki-lm-sim 2.2 billion token 226,108 2.8 GB SIM 2.0
ru-ruwac-ruwiki-lm-word2vec-sim 2.2 billion token 226,108 2.6 GB SIM 2.0
ru-ruwac-ruwiki-lem-col 2.2 billion token 226,108 2.3 GB COL 2.0
ru-ruwac-ruwiki-col 2.2 billion token 508,350 4.9 GB COL 2.0

You can find word spaces for the old DISCO API version 1.x at the bottom of this page!
If you want to use the above word spaces with a CC-BY-NC license commercially please contact peter.kolb@linguatools.org.

Version history

11. August 2015    DISCO API version 2.1
Minor update concerning methods DISCO.semanticSimilarity, TextSimilarity.textSimilarity, and TextSimilarity.directedTextSimilarity. In API version 2.1, the desired similarity measure must be passed to these methods. This is neccessary because the similarity measure SimilarityMeasure.KOLB that was used by default does not produce sensible results with word spaces imported from word2vec.

20. May 2015    DISCO API version 2.0
With the release of DISCO Builder the structure of the word spaces has been changed. Therefore, the new API version 2.0 is not compatible with older word spaces. There are now two types of word spaces: SIM and COL.
Quite a number of methods have been added to the DISCO API, including methods for computing text similarity, textual entailment, and clustering of similar words. See the API documentation for more information.
The API now uses version 5.1.0 of Lucene.

28. February 2013    DISCO API version 1.4
The API contains a new class Compositionality with methods for the computation of the compositional similarity between multi-token words or phrases. Also, the method DISCO.commonContext was added to the API.

16. March 2012    DISCO API version 1.3
The API now uses the latest version 3.5 of Lucene.
The command line option -wl allows to print the complete word frequency list of the language data packet into a text file. This option can also be used to check a downloaded language data packet for errors.

24. March 2011    DISCO API version 1.2
Version 1.2 of DISCO API allows to load language data packets (word spaces) into main memory (provided that you have enough RAM). This strongly reduces computation time.

18. September 2008   DISCO API version 1.1
First version of the API being available online.

Word spaces for old DISCO API version 1.x

These word spaces work with the Protege plugin.

Language Word Space Name Corpus Size Number of Words Packet Size API Version License
Arabic ar-general-20120124 188 million token 134,000 518 MB 1.x
Czech cz-general-20080115 163 million token 300,000 5.6 GB 1.x Apache License 2.0
Dutch nl-general-20081004 114 million token 200,000 4.0 GB 1.x Apache License 2.0
English enwiki-20130403-sim-lemma-mwl-lc 1.9 billion token 420,184 2.3 GB 1.x Apache License 2.0
en-BNC-20080721 119 million token 122,000 1.7 GB 1.x Apache License 2.0
en-PubMedOA-20070501 181 million token 60,000 864 MB 1.x Apache License 2.0
en-wikipedia-20080101 267 million token 220,000 5.9 GB 1.x Apache License 2.0
French fr-wikipedia-20110201-lemma 458 million token 154,000 513 MB 1.x Apache License 2.0
fr-wikipedia-20080713 105 million token 188,000 2.4 GB 1.x Apache License 2.0
German de-general-20131219-sim 977 million token 246,119 2.2 GB 1.x Apache License 2.0
de-general-20080727 400 million token 200,000 3.6 GB 1.x
Italian it-general-20080815 104 million token 164,000 2.3 GB 1.x Apache License 2.0
Russian ru-wikipedia-20110804 230 million token 112,000 544 MB 1.x Apache License 2.0
Spanish es-general-20080720 232 million token 260,000 5.0 GB 1.x