Visualisation of lexical fields
compute semantic similarity beetween words
DISCO (extracting DIstributionally related words
using CO-occurrences) is a Java class that allows
to retrieve the semantic similarity between arbitrary
words. The similarities are based on the statistical
analysis of very large text collections.
The tool runs on all popular operating systems,
including Windows, Linux, Solaris, and MacOS.
The Java API supplies the following methods:
- retrieve the semantically most similar words for an input word, e.g. shy → timid quiet introverted lonely cautious awkward clumsy soft-spoken gentle
- retrieve the value of the semantic similarity between two input words: sim(gasoline, oil) = 0,522; sim(gasoline, lemonade) = 0,159
- retrieve collocations for an input word: beer → keg brewed brewing lager Pilsener brewers brewery Budweiser mug ale
DISCO can also be queried from the command line.
DISCO is described in the following conference papers:
Peter Kolb. Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics - NODALIDA '09, Odense, Denmark, May 2009.
Peter Kolb. DISCO: A Multilingual Database of Distributionally Similar Words. In Proceedings of KONVENS-2008, Berlin, 2008.
There is a wide range of possible applications for DISCO's semantic similarities, reaching over all areas of natural language processing. The following list is not exhaustive:
- Translation: context-sensitive translation. Example: The bank closes the account. A dictionary lists two possible translations for bank into German: bank → Bank (financial institution) and Ufer (river bank). The dictionary also gives Konto as German translation of account. Now DISCO delivers the similarity values: sim(Bank, Konto) = 0,181 and sim(Ufer, Konto) = 0,022, so that Bank can be chosen as correct translation in the context of the sentence.
- Search engines: associative search using semantically similar words; automatic search term expansion.
- E-Learning: generation of (bilingual) word fields for a domain.
- context-sensitive spelling correction: ranking of proposed corrections not only according to string similarity, but also semantic similarity with the context.
- context-sensitive Thesaurus: propose synonyms that fit into the context.
- Ontology learning: DISCO supplies semantically similar words for an input word, that can be further classified into the type of the similarity relation (synonym, hyponym, antonym etc.). There is a DISCO plug-in available for the well known ontology editor Protégé.
- Text Tiling: automatically divide texts into coherent units.
- Speech recognition and OCR: construction of class-based language models.
Our online demo Wortsurfer shows that DISCO can also be used as a Thesaurus.
The references page lists scientific publications by DISCO users.
DISCO requires an index with data for each
language. These data are automatically built on the basis of
very large electronic text collections (corpora) using
The language data can be downloaded here.
At the moment, the following languages are available:
Arabic Czech Dutch English French German Italian Russian Spanish
- Step 1: download disco-1.4.jar.
- Step 2: download one of the language data packages.
- Step 3: now you can query DISCO from the command line:
java -jar disco-1.4.jar LANGUAGE-DATA-DIRECTORY -bn house 12
outputs the twelve semantically most similar words for house.
You need a Java Runtime Environment. If Java isn't installed on your system, you can download it from www.java.com.
DISCO can be queried from the command line. Just type:
java -jar disco-1.4.jar DATA-DIRECTORY OPTIONS
The possible options are:
-bn WORD N: outputs the N semantically most similar words for WORD. Example: -bn mouse 10 → mice joystick rat Mouse keyboard rats button USB cursor printer rabbit.
-bc WORD N: outputs the N most significant collocations for WORD. Example: -bc decision 7 → unanimous overturned reversed upheld rescinded controversial appealed.
-s WORD1 WORD2: outputs the value of the first-order semantic similarity between the two input words (the value is between 0 and 1).
-s2 WORD1 WORD2: outputs the value of the second-order semantic similarity between the two input words (the value is between 0 and 1).
-cc WORD1 WORD2: outputs the common context of the two input words.
-f WORD: outputs the corpus frequency for WORD.
-n: ouputs the number of words that can be queried.
-wl FILE: prints the complete word frequency list to the file.
DISCO can be integrated into your own applications using the Java API. The Java API supplies several methods to retrieve semantically similar words, the semantic similarity between words, collocations, corpus frequencies etc. You can find more information in the API documentation (javadoc).
The following Java class contains sample code that shows how to call DISCO using the Java API: UseDISCO.java
If you are interested in the commercial use of one of the language data packages that are not freely available please contact Peter Kolb (firstname.lastname@example.org). We can also build language data packages for the language, domain or text type of your choice on request.
DISCO uses the Lucene index.
The DISCO language data were partly compiled on the basis of the following freely available electronic text collections:
- the international versions of the Wikipedia
- the Europarl corpus
- the JRC-ACQUIS corpus
- the German and English Project Gutenberg.