DISCO

java.lang.Object
- de.linguatools.disco.DISCO

```
public class DISCO
extends java.lang.Object
```
DISCO (Extracting DIStributionally Similar Words Using CO-occurrences) provides a number of methods for computing the distributional (i.e. semantic) similarity between arbitrary words and text passages, for retrieving a word's collocations or its corpus frequency. It also provides a method to retrieve the semantically most similar words for a given word.
The methods in this class work with word spaces (a.k.a. language data packets) stored in the form of Lucene indexes. Word spaces for several languages are available on the DISCO download page.
It is important to keep in mind that there are two different types of word spaces:
- DISCO.WordspaceType.COL: this type stores a word vector for each word. A word vector is the list of the significant co-occurrences of the word together with the type of co-occurrence (if any) and a significance value. The significant co-occurring words of a word are also called its collocations. The type of co-occurrence can be a relative position in a context window, or a syntactic relation
- DISCO.WordspaceType.SIM: this type stores the above word vectors, but also contains pre-computed lists of the most similar words for each word. These words can be queried using the method DISCO.similarWords(). There are several methods in the DISCO API that only work with word spaces of type DISCO.WordspaceType.SIM.
DISCO is described in the following conference papers:
- Peter Kolb: Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics - NODALIDA '09, Odense, Denmark, May 2009.
- Peter Kolb: DISCO: A Multilingual Database of Distributionally Similar Words. In Tagungsband der 9. KONVENS, Berlin, 2008.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`DISCO.SimilarityMeasure` Available measures for vector comparison.
`static class`	`DISCO.WordspaceType` Available word space types (SIM = word space contains lists of pre-computed similar words for each word, COL = word space contains only word vectors).

Field Summary

Fields
Modifier and Type	Field and Description
`java.lang.String`	`indexDir` Name of the word space directory.
`org.apache.lucene.store.RAMDirectory`	`indexRAM` The word space loaded into RAM.
`static java.lang.String`	`relationSeparator` This string is used as separator between a feature word and its relation.
`DISCO.WordspaceType`	`wordspaceType` Type of this word space.

Constructor Summary

Constructors
Constructor and Description
`DISCO(java.lang.String idxName, boolean loadIntoRAM)` DISCO version 2.0 allows to load a complete word space into RAM to speed up similarity computations.

Method Summary

Methods
Modifier and Type	Method and Description
`float`	`collocationalValue(java.lang.String w1, java.lang.String w2)` Returns the collocational strength between words `w1` and `w2`, summed up over all relations.
`ReturnDataCol[]`	`collocations(java.lang.String word)` Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first).
`int`	`frequency(java.lang.String word)` Looks up the input word in the word space and returns its frequency.
`static DISCO.SimilarityMeasure`	`getSimilarityMeasure(java.lang.String simMeasure)` Get SimilarityMeasure from string.
`java.lang.String[]`	`getStopwords()` Get the stopwords for this word space instance.
`DISCO.WordspaceType`	`getWordspaceType()` Returns the type of the word space instance.
`java.util.HashMap<java.lang.String,java.lang.Float>`	`getWordvector(java.lang.String word)` Returns the word vector representing the distribution of the input word in the corpus. The word vector can be used with the methods in the class `Compositionality`.
`int`	`numberOfWords()` Returns the number of `Documents` (i.e.
`org.apache.lucene.document.Document`	`searchIndex(java.lang.String word)` Searches for a input word in index field `word` and returns the first hit `Document` or `null`. DISCO uses the Lucene index.
`float`	`secondOrderSimilarity(java.lang.String w1, java.lang.String w2)` Computes the second order semantic similarity between the input words based on the sets of their distributionally similar words. Important note: This method only works with word spaces of type `DISCO.WordspaceType.SIM`.
`float`	`semanticSimilarity(java.lang.String w1, java.lang.String w2)` Computes the semantic similarity (according to the vector similarity measure `SimilarityMeasures.KOLB` which is described in Kolb 2009) between the input words based on their collocation sets (i.e.
`float`	`semanticSimilarity(java.lang.String w1, java.lang.String w2, DISCO.SimilarityMeasure similarityMeasure)` Computes the semantic similarity (according to the vector similarity measure `similarityMeasure`) between the two input words based on their collocation sets (i.e.
`ReturnDataBN`	`similarWords(java.lang.String word)` Looks up the input word in the index and returns its semantically similar words ordered by decreasing similarity together with their similarity values. If the search word isn't found in the word space, the return value is `null`. Important note: This method only works with word spaces of type `DISCO.WordspaceType.SIM`.
`int`	`wordFrequencyList(java.lang.String outputFileName)` Run trough all documents (i.e.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - indexDir
```
public java.lang.String indexDir
```
    Name of the word space directory.
  - indexRAM
```
public org.apache.lucene.store.RAMDirectory indexRAM
```
    The word space loaded into RAM.
  - wordspaceType
```
public DISCO.WordspaceType wordspaceType
```
    Type of this word space.
  - relationSeparator
```
public static final java.lang.String relationSeparator
```
    This string is used as separator between a feature word and its relation. It is a character from the Unicode private use area.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - DISCO
```
public DISCO(java.lang.String idxName,
     boolean loadIntoRAM)
      throws java.io.FileNotFoundException,
             org.apache.lucene.index.CorruptIndexException,
             java.io.IOException,
             CorruptConfigFileException
```
    DISCO version 2.0 allows to load a complete word space into RAM to speed up similarity computations. Make sure that you have enough free memory since word spaces can be very large. Also, remember that loading a huge word space into RAM will take some time.
    This constructor reads the word space type from the file disco.config in the word space directory. If the file is not found in the word space directory a FileNotFoundException is thrown. If the word space type can not be determined (due to a corrupted config file), a CorruptConfigFileException is thrown.
    
    Parameters:
    idxName - the name of the word space directory
    loadIntoRAM - if true the word space is loaded into RAM
    
    Throws:
    
    java.io.IOException
    
    java.io.FileNotFoundException - if the file "disco.config" can not be found in the word space directory idxName.
    
    org.apache.lucene.index.CorruptIndexException
    
    CorruptConfigFileException - if the file "disco.config" is corrupt.
- Method Detail
  - getSimilarityMeasure
```
public static DISCO.SimilarityMeasure getSimilarityMeasure(java.lang.String simMeasure)
```
    Get SimilarityMeasure from string.
    
    Parameters:
    simMeasure -
    
    Returns:
    SimilarityMeasure or null.
  - getWordspaceType
```
public DISCO.WordspaceType getWordspaceType()
```
    Returns the type of the word space instance.
    
    Returns:
    word space type
  - searchIndex
```
public org.apache.lucene.document.Document searchIndex(java.lang.String word)
                                                throws java.io.IOException
```
    Searches for a input word in index field word and returns the first hit Document or null.
    DISCO uses the Lucene index. A word's data are stored in the index in an object of type Document. A Document has the following 6 fields:
    - word: contains a word, tokenized with WhitespaceAnalyzer. This is the only searchable field.
    - freq: the corpus frequency of the word. This field is only stored, but not indexed.
    - dsb: the distributionally similar words for the input word. They are stored in a single string, in which the words are separated by spaces. This field is not indexed, and therefore not searchable. The words are sorted by their similarity value, highest value first.
      For word spaces of type WordspaceType.COL, this field is empty!
    - dsbSim: contains a single string with the similarity values for the words in the field dsb, separated by spaces. The string in this field is parallel to the string in the field dsb, i.e., the n-th token of the string in dsbSim corresponds to the n-th token in dsb.
      Example: field dsb contains the string "apple banana cherry", field dsbSim contains the string "0.3241 0.1233 0.0788". This means that the similarity between the word in the field word and "cherry" is 0.0788.
      For word spaces of type WordspaceType.COL, this field is empty!
    - kol: contains the features from the input word's sparse word vector. "Sparse" means that only those features are stored that have a value greater than or equal to the threshold that was set in minWeight in the disco.config file.
      There are three forms features can have:
      
      featureWord: the feature is a plain word.
      
      featureWord<SEP>relation: the feature is composed of a word and a specific relation between the inputWord and the featureWord. The relation can be a window position or a syntactic dependency relation. featureWord and relation are separated by the character DISCO.relationSeparator.
      
      ID: the feature is a number. This is the case for word spaces that have been imported from other tools like word2vec. Word spaces of type word x document also have IDs as features.
      
      The features in the field kol are separated by a space.
    - kolSig: contains the significance values for kol, in a string parallel to the string in kol.
    Parameters:
    word - input word to be looked up in index (must be a single token).
    
    Returns:
    index entry of input word or null if the input word can not be found in the index.
    
    Throws:
    
    java.io.IOException
  - numberOfWords
```
public int numberOfWords()
                  throws java.io.IOException
```
    Returns the number of Documents (i.e. words) in the word space.
    
    Returns:
    number of words in index
    
    Throws:
    
    java.io.IOException
  - frequency
```
public int frequency(java.lang.String word)
              throws java.io.IOException
```
    Looks up the input word in the word space and returns its frequency. If the word is not found the return value is zero.
    
    Parameters:
    word - word to be looked up (must be a single token).
    
    Returns:
    frequency of the input word in the text corpus from which the word space index was built
    
    Throws:
    
    java.io.IOException
  - similarWords
```
public ReturnDataBN similarWords(java.lang.String word)
                          throws java.io.IOException,
                                 WrongWordspaceTypeException
```
    Looks up the input word in the index and returns its semantically similar words ordered by decreasing similarity together with their similarity values.
    If the search word isn't found in the word space, the return value is null.
    Important note: This method only works with word spaces of type DISCO.WordspaceType.SIM.
    
    Parameters:
    word - word to be looked up (must be a single token).
    
    Returns:
    result data structure or null
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException - if the word space does not have the type DISCO.WordspaceType.SIM.
  - collocations
```
public ReturnDataCol[] collocations(java.lang.String word)
                             throws java.io.IOException
```
    Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first). If the search word is not found in the index, the return value is null.
    The collocations are derived from the word's features. As features can be not only plain words, but also words plus their relation to the input word, the relation is cut off of the word, and the significance values of identical words are summed up. (If you want to receive the full features instead of only the words use the method getWordvector().)
    Features can also be IDs (for word spaces with document features or imported from other tools). In this case, the "collocations" will be a list of IDs.
    The significance measure that was used in word space construction by DISCOBuilder is stored in the file disco.config in the word space directory (look at the line weightingMethod). For more information on available significance measures consult DISCOBuilder's documentation.
    
    Parameters:
    word - the input word (must be a single token).
    
    Returns:
    the list of collocations with their significance values or null. The relation fields of the array elements are not set.
    
    Throws:
    
    java.io.IOException
  - semanticSimilarity
```
public float semanticSimilarity(java.lang.String w1,
                       java.lang.String w2)
                         throws java.io.IOException
```
    Computes the semantic similarity (according to the vector similarity measure SimilarityMeasures.KOLB which is described in Kolb 2009) between the input words based on their collocation sets (i.e. word vectors). If any of the two words isn't found in the index, the return value is -2.
    Note: To compute the similarity between multi-word expressions (e.g. "New York" or "nuclear power plant") use the methods in the class Compositionality.
    
    Parameters:
    w1 - input word #1 (must be a single token).
    w2 - input word #2 (must be a single token).
    
    Returns:
    similarity value between 0.0F and 1.0F or -2.0F.
    
    Throws:
    
    java.io.IOException
  - semanticSimilarity
```
public float semanticSimilarity(java.lang.String w1,
                       java.lang.String w2,
                       DISCO.SimilarityMeasure similarityMeasure)
                         throws java.io.IOException
```
    Computes the semantic similarity (according to the vector similarity measure similarityMeasure) between the two input words based on their collocation sets (i.e. word vectors).
    Note: To compute the similarity between multi-word expressions (e.g. "New York" or "nuclear power plant") use the methods in the class Compositionality.
    
    Parameters:
    w1 - input word #1 (must be a single token).
    w2 - input word #2 (must be a single token).
    similarityMeasure - One of the similarity measures enumerated in SimilarityMeasures.
    
    Returns:
    The similarity between the two input words; depending on the chosen similarity measure a value between 0.0F and 1.0F, or -1.0F and 1.0F. If any of the two words isn't found in the index, the return value is -2.0F. In case the similarityMeasure is unknown the return value is -3.0F.
    
    Throws:
    
    java.io.IOException
  - secondOrderSimilarity
```
public float secondOrderSimilarity(java.lang.String w1,
                          java.lang.String w2)
                            throws java.io.IOException,
                                   WrongWordspaceTypeException
```
    Computes the second order semantic similarity between the input words based on the sets of their distributionally similar words.
    Important note: This method only works with word spaces of type DISCO.WordspaceType.SIM.
    
    Parameters:
    w1 - input word #1 (must be a single token).
    w2 - input word #2 (must be a single token).
    
    Returns:
    similarity value between 0.0F and 1.0F. If any of the two words isn't found in the index, the return value is -2.0F.
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException
  - collocationalValue
```
public float collocationalValue(java.lang.String w1,
                       java.lang.String w2)
                         throws java.io.IOException
```
    Returns the collocational strength between words w1 and w2, summed up over all relations.
    
    Parameters:
    w1 - input word #1 (must be a single token).
    w2 - input word #2 (must be a single token).
    
    Returns:
    the sum of the significance values between word w1 and all its features that have w2 as their word part while ignoring the relation (if any). If w1 is not found the return value is 0.
    
    Throws:
    
    java.io.IOException
  - getWordvector
```
public java.util.HashMap<java.lang.String,java.lang.Float> getWordvector(java.lang.String word)
                                                                  throws java.io.IOException
```
    Returns the word vector representing the distribution of the input word in the corpus.
    The word vector can be used with the methods in the class Compositionality.
    
    Parameters:
    word - input word (must be a single token).
    
    Returns:
    HashMap containing the word vector or null if word is not found. The features of the word vector are the keys of the resulting HashMap, the values are the significance values of the word vector. For more information on the values consult the documentation of the method searchIndex() (field kol).
    
    Throws:
    
    java.io.IOException
  - wordFrequencyList
```
public int wordFrequencyList(java.lang.String outputFileName)
```
    Run trough all documents (i.e. queryable words) in the index, and retrieve the word and its frequency. Write both informations to the text file named outputFileName. Note that the output is not sorted.
    This method can be used to check index integrity. If an error occurs while querying a word, a warning is written to standard output.
    
    Parameters:
    outputFileName - name of the output file.
    
    Returns:
    number of words written to the output file. In case of success the value is equal to the number of words in the index.
  - getStopwords
```
public java.lang.String[] getStopwords()
                                throws java.io.FileNotFoundException,
                                       java.io.IOException,
                                       CorruptConfigFileException
```
    Get the stopwords for this word space instance.
    
    Returns:
    Array with stopwords
    
    Throws:
    
    java.io.FileNotFoundException
    
    java.io.IOException
    
    CorruptConfigFileException

Class DISCO

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

indexDir

indexRAM

wordspaceType

relationSeparator

Constructor Detail

DISCO

Method Detail

getSimilarityMeasure

getWordspaceType

searchIndex

numberOfWords

frequency

similarWords

collocations

semanticSimilarity

semanticSimilarity

secondOrderSimilarity

collocationalValue

getWordvector

wordFrequencyList

getStopwords