DISCO (disco 3.0 API)

java.lang.Object
- de.linguatools.disco.DISCO

Direct Known Subclasses:

DenseMatrix, DISCOLuceneIndex
```
public abstract class DISCO
extends java.lang.Object
```
DISCO (Extracting DIStributionally Similar Words Using CO-occurrences) provides a number of methods for computing the distributional (i.e. semantic) similarity between arbitrary words and text passages, for retrieving a word's collocations or its corpus frequency. It also provides a method to retrieve the semantically most similar words for a given word.
It is important to keep in mind that there are two different types of word spaces:
- DISCO.WordspaceType.COL: this type stores a word vector for each word. A word vector is the list of the significant co-occurrences of the word together with the type of co-occurrence (if any) and a significance value. The significant co-occurring words of a word are also called its collocations. The type of co-occurrence can be a relative position in a context window, or a syntactic relation
- DISCO.WordspaceType.SIM: this type stores the above word vectors, but also contains pre-computed lists of the most similar words for each word. These words can be queried using the method DISCO.similarWords(). There are several methods in the DISCO API that only work with word spaces of type DISCO.WordspaceType.SIM.
DISCO supports two methods for storing word spaces, implemented by the two subclasses of the abstract DISCO class:
- DISCOLuceneIndex: this class uses Lucene to store a word space. It is intended for very high-dimensional word spaces (like the classic distributional count vectors that work without any dimension reduction techniques) because it stores them as a sparse matrix.
- DenseMatrix: this class stores a word space as a two-dimensional array. It stores the full matrix and is therefore only feasible for predict vectors (word embeddings) like the ones that are produced by word2vec.
Word spaces for several languages are available on the DISCO download page. You can also import word embeddings created with word2vec or fastText using DISCOBuilder's import functionality.

DISCO is described in the following conference papers:
- Peter Kolb: Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics - NODALIDA '09, Odense, Denmark, May 2009.
- Peter Kolb: DISCOLuceneIndex: A Multilingual Database of Distributionally Similar Words. In Tagungsband der 9. KONVENS, Berlin, 2008.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`DISCO.SimilarityMeasure` Available measures for vector comparison.
`static class`	`DISCO.WordspaceType` Available word space types (SIM = word space contains lists of pre-computed similar words for each word, COL = word space contains only word vectors).

Field Summary

Fields
Modifier and Type Field and Description

static java.lang.String RELATION_SEPARATOR
This string is used as separator between a feature word and its relation.

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`RELATION_SEPARATOR` This string is used as separator between a feature word and its relation.

Constructor Summary

Constructors
Constructor and Description

DISCO()

Constructors
Constructor and Description
`DISCO()`

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`abstract ReturnDataCol[]`	`collocations(java.lang.String word)` Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first).
`abstract int`	`frequency(java.lang.String word)` Get corpus frequency of `word`.
`abstract int`	`getMaxFreq()` Get corpus frequency of the most frequent word in the word space (that was not filtered out by the stop word list that was used).
`abstract int`	`getMinFreq()` Get minimum frequency of tokens in corpus.
`abstract java.util.Map<java.lang.String,java.lang.Float>`	`getSecondOrderWordvector(java.lang.String word)` The second order word vector contains the `nBest` most similar words for `word` as features (instead of the directly co-occuring words that you get with `getWordvector`).
`static DISCO.SimilarityMeasure`	`getSimilarityMeasure(java.lang.String simMeasure)` Get `SimilarityMeasure` object from its String name.
`abstract java.lang.String[]`	`getStopwords()` Gets list of stopwords from the `disco.config` file in the word space.
`abstract long`	`getTokenCount()` Size of the underlying corpus.
`static VectorSimilarity`	`getVectorSimilarity(DISCO.SimilarityMeasure simMeasure)` Get VectorSimilarity class for a SimilarityMeasure.
`abstract java.util.Iterator<java.lang.String>`	`getVocabularyIterator()` Returns an iterator that iterates over all words in the word space (the vocabulary).
`abstract java.lang.String`	`getWord(int id)` Returns the id-th word in the vocabulary.
`abstract DISCO.WordspaceType`	`getWordspaceType()` Get type of this word space.
`abstract java.util.Map<java.lang.String,java.lang.Float>`	`getWordvector(java.lang.String word)` Get word vector for `word` as map `feature - value`.
`static DISCO`	`load(java.lang.String discoFile)` Load DISCOLuceneIndex or DenseMatrix from file into memory.
`abstract int`	`numberOfFeatureWords()` For `DISCOLuceneIndex` this returns the number of words that were used as features.
`abstract int`	`numberOfSimilarWords()` Get the number of similar words that are stored in word spaces of type SIM for each word.
`abstract int`	`numberOfWords()` Get number of words stored in this word space.
`static DISCO`	`open(java.lang.String discoFile)` If `discoFile` is a `DISCOLuceneIndex` the index is opened for reading but not loaded into memory. If `discoFile` is a `DenseMatrix` it is loaded into memory.
`abstract float`	`secondOrderSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity)` Computes the similarity between words `w1` and `w2` by comparing the set of the most similar words for `w1` with the set of the most similar words for `w2`. The size of the set of similar words stored for each word in the word space is given by the DISCOBuilder parameter `-nBest`.
`abstract float`	`semanticSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity)` Computes the similarity between words `w1` and `w2` by comparing their word vectors using the `vectorSimilarity` measure of choice.
`abstract ReturnDataBN`	`similarWords(java.lang.String word)` Returns the list of the most similar words for `word` (according to `DISCO.semanticSimilarity`).
`abstract int`	`wordFrequencyList(java.lang.String outputFileName)` Writes word-frequency list to file.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - RELATION_SEPARATOR
```
public static final java.lang.String RELATION_SEPARATOR
```
    This string is used as separator between a feature word and its relation. It is a character from the Unicode private use area.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - DISCO
```
public DISCO()
```
- Method Detail
  - getSimilarityMeasure
```
public static DISCO.SimilarityMeasure getSimilarityMeasure(java.lang.String simMeasure)
```
    Get SimilarityMeasure object from its String name.
    
    Parameters:
    
    simMeasure -
    
    Returns:
    
    SimilarityMeasure or null.
  - getVectorSimilarity
```
public static VectorSimilarity getVectorSimilarity(DISCO.SimilarityMeasure simMeasure)
```
    Get VectorSimilarity class for a SimilarityMeasure.
    
    Parameters:
    
    simMeasure -
    
    Returns:
  - getWordspaceType
```
public abstract DISCO.WordspaceType getWordspaceType()
```
    Get type of this word space.
    
    Returns:
  - numberOfWords
```
public abstract int numberOfWords()
```
    Get number of words stored in this word space. In case of DenseMatrix this returns the value of config.vocabularySize, which has to be equal to the number of rows in the similarity matrix. For DISCOLuceneIndex this returns the number of documents in the Lucene index.
    
    Returns:
  - numberOfFeatureWords
```
public abstract int numberOfFeatureWords()
```
    For DISCOLuceneIndex this returns the number of words that were used as features. Note that this is only equal to the dimensionality of the word vectors if no positional or relational features were used. See options in disco.config under numberFeatureWords for more information.
    For DenseMatrix this is equal to the dimensionality of the word embedding (vector length).
    
    Returns:
  - numberOfSimilarWords
```
public abstract int numberOfSimilarWords()
```
    Get the number of similar words that are stored in word spaces of type SIM for each word.
    
    Returns:
    
    number of similar words that are stored in the word space. For word spaces of type COL this value is always 0.
  - frequency
```
public abstract int frequency(java.lang.String word)
                       throws java.io.IOException
```
    Get corpus frequency of word.
    
    Parameters:
    
    word -
    
    Returns:
    
    Throws:
    
    java.io.IOException
  - similarWords
```
public abstract ReturnDataBN similarWords(java.lang.String word)
                                   throws java.io.IOException,
                                          WrongWordspaceTypeException
```
    Returns the list of the most similar words for word (according to DISCO.semanticSimilarity). Since the list of most similar words for each word is only stored in word spaces of type WordspaceType.SIM this does not work with word spaces of type WordspaceType.COL.
    
    Parameters:
    
    word -
    
    Returns:
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException - if called with a word space of type WordspaceType.COL
  - semanticSimilarity
```
public abstract float semanticSimilarity(java.lang.String w1,
                                         java.lang.String w2,
                                         VectorSimilarity vectorSimilarity)
                                  throws java.io.IOException
```
    Computes the similarity between words w1 and w2 by comparing their word vectors using the vectorSimilarity measure of choice.
    
    Parameters:
    
    w1 -
    
    w2 -
    
    vectorSimilarity -
    
    Returns:
    
    Throws:
    
    java.io.IOException
  - secondOrderSimilarity
```
public abstract float secondOrderSimilarity(java.lang.String w1,
                                            java.lang.String w2,
                                            VectorSimilarity vectorSimilarity)
                                     throws java.io.IOException,
                                            WrongWordspaceTypeException
```
    Computes the similarity between words w1 and w2 by comparing the set of the most similar words for w1 with the set of the most similar words for w2.
    The size of the set of similar words stored for each word in the word space is given by the DISCOBuilder parameter -nBest. Default size is 300.
    Only works with word spaces of type WordspaceType.SIM!
    
    Parameters:
    
    w1 -
    
    w2 -
    
    vectorSimilarity -
    
    Returns:
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException - if called with a word space of type WordspaceType.COL
  - getWordvector
```
public abstract java.util.Map<java.lang.String,java.lang.Float> getWordvector(java.lang.String word)
                                                                       throws java.io.IOException
```
    Get word vector for word as map feature - value. Features are either words or IDs.
    
    Parameters:
    
    word -
    
    Returns:
    
    map vector
    
    Throws:
    
    java.io.IOException
  - getSecondOrderWordvector
```
public abstract java.util.Map<java.lang.String,java.lang.Float> getSecondOrderWordvector(java.lang.String word)
                                                                                  throws java.io.IOException,
                                                                                         WrongWordspaceTypeException
```
    The second order word vector contains the nBest most similar words for word as features (instead of the directly co-occuring words that you get with getWordvector).
    
    Parameters:
    
    word -
    
    Returns:
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException - when used with word space that is not of type SIM.
  - collocations
```
public abstract ReturnDataCol[] collocations(java.lang.String word)
                                      throws java.io.IOException
```
    Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first). If the search word is not found in the index, the return value is null.
    Unlike the method getWordvector() this method summarizes the words over the different relations.
    Important note: if used with a DenseMatrix or a DISCOLuceneIndex that has IDs as features, the return values will be IDs and not words.
    
    Parameters:
    
    word -
    
    Returns:
    
    Throws:
    
    java.io.IOException
  - wordFrequencyList
```
public abstract int wordFrequencyList(java.lang.String outputFileName)
```
    Writes word-frequency list to file. Iterates over all words in the word space and outputs them together with their corpus frequency.
    
    Parameters:
    
    outputFileName -
    
    Returns:
  - getVocabularyIterator
```
public abstract java.util.Iterator<java.lang.String> getVocabularyIterator()
                                                                    throws java.io.IOException
```
    Returns an iterator that iterates over all words in the word space (the vocabulary). There is no special ordering of the words. The method remove is not supported.
    
    Returns:
    
    Throws:
    
    java.io.IOException
  - getWord
```
public abstract java.lang.String getWord(int id)
                                  throws java.io.IOException
```
    Returns the id-th word in the vocabulary.
    
    Parameters:
    
    id - id has to be between 0 and DISCO.numberOfWords() - 1.
    
    Returns:
    
    word with given id or null if id is not in the range 0..DISCO.numberOfWords() - 1.
    
    Throws:
    
    java.io.IOException
  - getStopwords
```
public abstract java.lang.String[] getStopwords()
                                         throws java.io.FileNotFoundException,
                                                java.io.IOException,
                                                CorruptConfigFileException
```
    Gets list of stopwords from the disco.config file in the word space.
    
    Returns:
    
    stop words that were used in word space creation.
    
    Throws:
    
    java.io.FileNotFoundException
    
    java.io.IOException
    
    CorruptConfigFileException
  - getTokenCount
```
public abstract long getTokenCount()
```
    Size of the underlying corpus.
    
    Returns:
  - getMinFreq
```
public abstract int getMinFreq()
```
    Get minimum frequency of tokens in corpus.
    
    Returns:
  - getMaxFreq
```
public abstract int getMaxFreq()
```
    Get corpus frequency of the most frequent word in the word space (that was not filtered out by the stop word list that was used).
    
    Returns:
  - load
```
public static DISCO load(java.lang.String discoFile)
                  throws org.apache.lucene.index.CorruptIndexException,
                         java.io.IOException,
                         java.io.FileNotFoundException,
                         CorruptConfigFileException
```
    Load DISCOLuceneIndex or DenseMatrix from file into memory.
    
    Parameters:
    
    discoFile - can be either Lucene index directory or serialized DenseMatrix file.
    
    Returns:
    
    Throws:
    
    org.apache.lucene.index.CorruptIndexException
    
    java.io.FileNotFoundException
    
    CorruptConfigFileException
    
    java.io.IOException
  - open
```
public static DISCO open(java.lang.String discoFile)
                  throws org.apache.lucene.index.CorruptIndexException,
                         java.io.IOException,
                         java.io.FileNotFoundException,
                         CorruptConfigFileException
```
    If discoFile is a DISCOLuceneIndex the index is opened for reading but not loaded into memory.
    If discoFile is a DenseMatrix it is loaded into memory. In this case open behaves exactly as load.
    
    Parameters:
    
    discoFile - can be either Lucene index directory or serialized DenseMatrix file.
    
    Returns:
    
    Throws:
    
    org.apache.lucene.index.CorruptIndexException
    
    java.io.FileNotFoundException
    
    CorruptConfigFileException
    
    java.io.IOException

Class DISCO

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

RELATION_SEPARATOR

Constructor Detail

DISCO

Method Detail

getSimilarityMeasure

getVectorSimilarity

getWordspaceType

numberOfWords

numberOfFeatureWords

numberOfSimilarWords

frequency

similarWords

semanticSimilarity

secondOrderSimilarity

getWordvector

getSecondOrderWordvector

collocations

wordFrequencyList

getVocabularyIterator

getWord

getStopwords

getTokenCount

getMinFreq

getMaxFreq

load

open