DISCO - Download

The DISCO API is open source and licensed under the Apache License, version 2.0.

You need the Java archive disco-3.0.0-all.jar (the DISCO API) and a word space from the table below. Click on a link in the column Word Space Name for a word space description and the download link. Note that the Protege plugin only works with word spaces for DISCO API version 1!

You can construct word spaces from your own corpora with DISCO Builder. It also allows to import vector files that were produced with other tools like fastText, word2vec or GloVe.

Other Downloads:

Language Word Space Name Corpus Size Number of Words Storage type Word Space Type API Version License
English enwiki-20130403-sim-lemma-mwl-lc 1.9 billion token 420,184 DISCOLuceneIndex SIM 2.x, 3.x
enwiki-20130403-word2vec-lm-mwl-lc-sim 1.9 billion token 420,184 DenseMatrix SIM 3.x
enwiki-20130403-word2vec-lm-mwl-lc-sim 1.9 billion token 420,184 DISCOLuceneIndex SIM 2.x, 3.x
French fr-general-20151126-lm-sim 1.9 billion token 276,967 DISCOLuceneIndex SIM 2.x, 3.x
fr-general-20151126-lm-word2vec-sim 1.9 billion token 281,484 DISCOLuceneIndex SIM 2.x, 3.x
German de-general-20150421-lm-word2vec-sim 1.5 billion token 470,788 DenseMatrix SIM 3.x
de-general-20150421-lm-sim 1.5 billion token 470,788 DISCOLuceneIndex SIM 2.x, 3.x
de-general-20150503-lm-word2vec-sim 1.5 billion token 470,788 DISCOLuceneIndex SIM 2.x, 3.x
Russian ru-ruwac-ruwiki-lm-sim 2.2 billion token 226,108 DISCOLuceneIndex SIM 2.x, 3.x
ru-ruwac-ruwiki-lm-word2vec-sim 2.2 billion token 226,108 DISCOLuceneIndex SIM 2.x, 3.x
ru-ruwac-ruwiki-lem-col 2.2 billion token 226,108 DISCOLuceneIndex COL 2.x, 3.x
ru-ruwac-ruwiki-col 2.2 billion token 508,350 DISCOLuceneIndex COL 2.x, 3.x

You can find word spaces for the old DISCO API version 1.x at the bottom of this page!
If you want to use the above word spaces with a CC-BY-NC license commercially please contact [email protected].

Version history

28. June 2018    DISCO API version 3.0
Version 3 introduces a new storage class DenseMatrix that is suited for low-dimensional word embeddings. Additionally, there are several new methods in class Compositionality, like computing the shortest path between two words in a word space of type SIM, or an approximate nearest neighbor search to find the most similar word for a given word or word embedding.
DISCO API now has Sux4J as an additional dependency.
Finally, DISCO API is now a Gradle project - see the GitHub repository.

11. August 2015    DISCO API version 2.1
Minor update concerning methods DISCO.semanticSimilarity, TextSimilarity.textSimilarity, and TextSimilarity.directedTextSimilarity. In API version 2.1, the desired similarity measure must be passed to these methods. This is neccessary because the similarity measure SimilarityMeasure.KOLB that was used by default does not produce sensible results with word spaces imported from word2vec.

20. May 2015    DISCO API version 2.0
With the release of DISCO Builder the structure of the word spaces has been changed. Therefore, the new API version 2.0 is not compatible with older word spaces. There are now two types of word spaces: SIM and COL.
Quite a number of methods have been added to the DISCO API, including methods for computing text similarity, textual entailment, and clustering of similar words. See the API documentation for more information.
The API now uses version 5.1.0 of Lucene.

28. February 2013    DISCO API version 1.4
The API contains a new class Compositionality with methods for the computation of the compositional similarity between multi-token words or phrases. Also, the method DISCO.commonContext was added to the API.

16. March 2012    DISCO API version 1.3
The API now uses the latest version 3.5 of Lucene.
The command line option -wl allows to print the complete word frequency list of the language data packet into a text file. This option can also be used to check a downloaded language data packet for errors.

24. March 2011    DISCO API version 1.2
Version 1.2 of DISCO API allows to load language data packets (word spaces) into main memory (provided that you have enough RAM). This strongly reduces computation time.

18. September 2008   DISCO API version 1.1
First version of the API being available online.

Word spaces for old DISCO API version 1.x

These word spaces work with the Protege plugin.

Language Word Space Name Corpus Size Number of Words Packet Size API Version License
Arabic ar-general-20120124 188 million token 134,000 518 MB 1.x
Czech cz-general-20080115 163 million token 300,000 5.6 GB 1.x Apache License 2.0
Dutch nl-general-20081004 114 million token 200,000 4.0 GB 1.x Apache License 2.0
English enwiki-20130403-sim-lemma-mwl-lc 1.9 billion token 420,184 2.3 GB 1.x Apache License 2.0
en-BNC-20080721 119 million token 122,000 1.7 GB 1.x Apache License 2.0
en-PubMedOA-20070501 181 million token 60,000 864 MB 1.x Apache License 2.0
en-wikipedia-20080101 267 million token 220,000 5.9 GB 1.x Apache License 2.0
French fr-wikipedia-20110201-lemma 458 million token 154,000 513 MB 1.x Apache License 2.0
fr-wikipedia-20080713 105 million token 188,000 2.4 GB 1.x Apache License 2.0
German de-general-20131219-sim 977 million token 246,119 2.2 GB 1.x Apache License 2.0
de-general-20080727 400 million token 200,000 3.6 GB 1.x
Italian it-general-20080815 104 million token 164,000 2.3 GB 1.x Apache License 2.0
Russian ru-wikipedia-20110804 230 million token 112,000 544 MB 1.x Apache License 2.0
Spanish es-general-20080720 232 million token 260,000 5.0 GB 1.x