DISCOLuceneIndex (disco 3.0 API)

java.lang.Object
- de.linguatools.disco.DISCO
- - de.linguatools.disco.DISCOLuceneIndex

```
public class DISCOLuceneIndex
extends DISCO
```
The methods in this class work with word spaces stored in the form of Lucene indexes. Word spaces for several languages are available on the DISCO download page.

Nested Class Summary
- Nested classes/interfaces inherited from class de.linguatools.disco.DISCO
  DISCO.SimilarityMeasure, DISCO.WordspaceType

Field Summary

Fields
Modifier and Type Field and Description

java.lang.String indexDir
Name of the word space directory.

org.apache.lucene.store.RAMDirectory indexRAM
The word space loaded into RAM.
- Fields inherited from class de.linguatools.disco.DISCO
  RELATION_SEPARATOR

Fields
Modifier and Type	Field and Description
`java.lang.String`	`indexDir` Name of the word space directory.
`org.apache.lucene.store.RAMDirectory`	`indexRAM` The word space loaded into RAM.

Constructor Summary

Constructors
Constructor and Description
`DISCOLuceneIndex(java.lang.String idxName, boolean loadIntoRAM)` DISCO version 2.0 allows to load a complete word space into RAM to speed up similarity computations.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`float`	`collocationalValue(java.lang.String w1, java.lang.String w2)` Returns the collocational strength between words `w1` and `w2`, summed up over all relations.
`ReturnDataCol[]`	`collocations(java.lang.String word)` Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first).
`int`	`frequency(java.lang.String word)` Looks up the input word in the word space and returns its frequency.
`int`	`getMaxFreq()` Get corpus frequency of the most frequent word in the word space (that was not filtered out by the stop word list that was used).
`int`	`getMinFreq()` Get minimum frequency of tokens in corpus.
`java.util.Map<java.lang.String,java.lang.Float>`	`getSecondOrderWordvector(java.lang.String word)` The second order word vector contains the `nBest` most similar words for `word` as features (instead of the directly co-occuring words that you get with `getWordvector`).
`java.lang.String[]`	`getStopwords()` Get the stopwords for this word space instance.
`long`	`getTokenCount()` Size of the underlying corpus.
`java.util.Iterator<java.lang.String>`	`getVocabularyIterator()` Returns an iterator that iterates over all words in the word space (the vocabulary).
`java.lang.String`	`getWord(int id)` Returns the id-th word in the vocabulary.
`DISCO.WordspaceType`	`getWordspaceType()` Returns the type of the word space instance.
`java.util.Map<java.lang.String,java.lang.Float>`	`getWordvector(java.lang.String word)` Returns the word vector representing the distribution of the input word in the corpus. The word vector can be used with the methods in the class `Compositionality`.
`int`	`numberOfFeatureWords()` For `DISCOLuceneIndex` this returns the number of words that were used as features.
`int`	`numberOfSimilarWords()` Get the number of similar words that are stored in word spaces of type SIM for each word.
`int`	`numberOfWords()` Returns the number of `Documents` (i.e.
`org.apache.lucene.document.Document`	`searchIndex(java.lang.String word)` Searches for a input word in index field `word` and returns the first hit `Document` or `null`. DISCOLuceneIndex uses the Lucene index.
`float`	`secondOrderSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity)` Computes the second order semantic similarity between the input words based on the sets of their distributionally similar words. Important note: This method only works with word spaces of type `WordspaceType.SIM`.
`float`	`semanticSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity)` Computes the semantic similarity (according to the vector similarity measure `similarityMeasure`) between the two input words based on their collocation sets (i.e.
`ReturnDataBN`	`similarWords(java.lang.String word)` Looks up the input word in the index and returns its semantically similar words ordered by decreasing similarity together with their similarity values. If the search word isn't found in the word space, the return value is `null`. The similarity values in the result can differ from the values you get with `DISCOLuceneIndex.semanticSimilarity` for the same word pair.
`int`	`wordFrequencyList(java.lang.String outputFileName)` Run trough all documents (i.e.

Methods inherited from class de.linguatools.disco.DISCO
getSimilarityMeasure, getVectorSimilarity, load, open

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - indexDir
```
public java.lang.String indexDir
```
    Name of the word space directory.
  - indexRAM
```
public org.apache.lucene.store.RAMDirectory indexRAM
```
    The word space loaded into RAM.
- Constructor Detail
  - DISCOLuceneIndex
```
public DISCOLuceneIndex(java.lang.String idxName,
                        boolean loadIntoRAM)
                 throws java.io.FileNotFoundException,
                        org.apache.lucene.index.CorruptIndexException,
                        java.io.IOException,
                        CorruptConfigFileException
```
    DISCO version 2.0 allows to load a complete word space into RAM to speed up similarity computations. Make sure that you have enough free memory since word spaces can be very large. Also, remember that loading a huge word space into RAM will take some time.
    This constructor reads the word space type from the file disco.config in the word space directory. If the file is not found in the word space directory a FileNotFoundException is thrown. If the word space type can not be determined (due to a corrupted config file), a CorruptConfigFileException is thrown.
    
    Parameters:
    
    idxName - the name of the word space directory
    
    loadIntoRAM - if true the word space is loaded into RAM
    
    Throws:
    
    java.io.IOException
    
    java.io.FileNotFoundException - if the file "disco.config" can not be found in the word space directory idxName.
    
    org.apache.lucene.index.CorruptIndexException
    
    CorruptConfigFileException - if the file "disco.config" is corrupt.
- Method Detail
  - getWordspaceType
```
public DISCO.WordspaceType getWordspaceType()
```
    Returns the type of the word space instance.
    
    Specified by:
    
    getWordspaceType in class DISCO
    
    Returns:
    
    word space type
  - numberOfWords
```
public int numberOfWords()
```
    Returns the number of Documents (i.e. words) in the word space.
    
    Specified by:
    
    numberOfWords in class DISCO
    
    Returns:
    
    number of words in index
  - numberOfFeatureWords
```
public int numberOfFeatureWords()
```
    Description copied from class: DISCO
    
    For DISCOLuceneIndex this returns the number of words that were used as features. Note that this is only equal to the dimensionality of the word vectors if no positional or relational features were used. See options in disco.config under numberFeatureWords for more information.
    For DenseMatrix this is equal to the dimensionality of the word embedding (vector length).
    
    Specified by:
    
    numberOfFeatureWords in class DISCO
    
    Returns:
  - numberOfSimilarWords
```
public int numberOfSimilarWords()
```
    Description copied from class: DISCO
    
    Get the number of similar words that are stored in word spaces of type SIM for each word.
    
    Specified by:
    
    numberOfSimilarWords in class DISCO
    
    Returns:
    
    number of similar words that are stored in the word space. For word spaces of type COL this value is always 0.
  - frequency
```
public int frequency(java.lang.String word)
              throws java.io.IOException
```
    Looks up the input word in the word space and returns its frequency. If the word is not found the return value is zero.
    
    Specified by:
    
    frequency in class DISCO
    
    Parameters:
    
    word - word to be looked up (must be a single token).
    
    Returns:
    
    frequency of the input word in the text corpus from which the word space index was built
    
    Throws:
    
    java.io.IOException
  - similarWords
```
public ReturnDataBN similarWords(java.lang.String word)
                          throws java.io.IOException,
                                 WrongWordspaceTypeException
```
    Looks up the input word in the index and returns its semantically similar words ordered by decreasing similarity together with their similarity values.
    If the search word isn't found in the word space, the return value is null.
    The similarity values in the result can differ from the values you get with DISCOLuceneIndex.semanticSimilarity for the same word pair. This is the case when another similarity measure was used in generating the word space. Consult the file disco.config in the word space directory to get the similarity measure that was used. If no measure is given there the default measure KOLB was used.
    Important note: This method only works with word spaces of type DISCOLuceneIndex.WordspaceType.SIM.
    
    Specified by:
    
    similarWords in class DISCO
    
    Parameters:
    
    word - word to be looked up (must be a single token).
    
    Returns:
    
    result data structure or null
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException - if the word space does not have the type DISCOLuceneIndex.WordspaceType.SIM.
  - semanticSimilarity
```
public float semanticSimilarity(java.lang.String w1,
                                java.lang.String w2,
                                VectorSimilarity vectorSimilarity)
                         throws java.io.IOException
```
    Computes the semantic similarity (according to the vector similarity measure similarityMeasure) between the two input words based on their collocation sets (i.e. word vectors).
    Important: The measure SimilarityMeasure.KOLB should not be used with word spaces imported from word2vec!
    Note: To compute the similarity between multi-word expressions (e.g. "New York" or "nuclear power plant") use the methods in the class Compositionality.
    
    Specified by:
    
    semanticSimilarity in class DISCO
    
    Parameters:
    
    w1 - input word #1 (must be a single token).
    
    w2 - input word #2 (must be a single token).
    
    vectorSimilarity -
    
    Returns:
    
    The similarity between the two input words; depending on the chosen similarity measure a value between 0.0F and 1.0F (SimilarityMeasure.KOLB), or -1.0F and 1.0F (SimilarityMeasure.COSINE). If any of the two words isn't found in the index, the return value is -2.0F. In case the similarityMeasure is unknown the return value is -3.0F.
    
    Throws:
    
    java.io.IOException
  - secondOrderSimilarity
```
public float secondOrderSimilarity(java.lang.String w1,
                                   java.lang.String w2,
                                   VectorSimilarity vectorSimilarity)
                            throws java.io.IOException,
                                   WrongWordspaceTypeException
```
    Computes the second order semantic similarity between the input words based on the sets of their distributionally similar words.
    Important note: This method only works with word spaces of type WordspaceType.SIM.
    
    Specified by:
    
    secondOrderSimilarity in class DISCO
    
    Parameters:
    
    w1 - input word #1 (must be a single token).
    
    w2 - input word #2 (must be a single token).
    
    vectorSimilarity -
    
    Returns:
    
    similarity value between 0.0F and 1.0F. If any of the two words isn't found in the index, the return value is -2.0F.
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException
  - getWordvector
```
public java.util.Map<java.lang.String,java.lang.Float> getWordvector(java.lang.String word)
                                                              throws java.io.IOException
```
    Returns the word vector representing the distribution of the input word in the corpus.
    The word vector can be used with the methods in the class Compositionality.
    
    Specified by:
    
    getWordvector in class DISCO
    
    Parameters:
    
    word - input word (must be a single token - to get a word vector for a phrase use Compositionality.composeWordVectors).
    
    Returns:
    
    HashMap containing the word vector or null if word is not found. The features of the word vector are the keys of the resulting HashMap, the values are the significance values of the word vector. For more information on the values consult the documentation of the method searchIndex() (field kol).
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    Compositionality
  - getSecondOrderWordvector
```
public java.util.Map<java.lang.String,java.lang.Float> getSecondOrderWordvector(java.lang.String word)
                                                                         throws WrongWordspaceTypeException,
                                                                                java.io.IOException
```
    Description copied from class: DISCO
    
    The second order word vector contains the nBest most similar words for word as features (instead of the directly co-occuring words that you get with getWordvector).
    
    Specified by:
    
    getSecondOrderWordvector in class DISCO
    
    Returns:
    
    Throws:
    
    WrongWordspaceTypeException - when used with word space that is not of type SIM.
    
    java.io.IOException
  - wordFrequencyList
```
public int wordFrequencyList(java.lang.String outputFileName)
```
    Run trough all documents (i.e. queryable words) in the index, and retrieve the word and its frequency. Write both informations to the text file named outputFileName. Note that the output is not sorted.
    This method can be used to check index integrity. If an error occurs while querying a word, a warning is written to standard output.
    
    Specified by:
    
    wordFrequencyList in class DISCO
    
    Parameters:
    
    outputFileName - name of the output file.
    
    Returns:
    
    number of words written to the output file. In case of success the value is equal to the number of words in the index.
  - getStopwords
```
public java.lang.String[] getStopwords()
                                throws java.io.FileNotFoundException,
                                       java.io.IOException,
                                       CorruptConfigFileException
```
    Get the stopwords for this word space instance.
    
    Specified by:
    
    getStopwords in class DISCO
    
    Returns:
    
    Array with stopwords
    
    Throws:
    
    java.io.FileNotFoundException
    
    java.io.IOException
    
    CorruptConfigFileException
  - getTokenCount
```
public long getTokenCount()
```
    Description copied from class: DISCO
    
    Size of the underlying corpus.
    
    Specified by:
    
    getTokenCount in class DISCO
    
    Returns:
  - getMinFreq
```
public int getMinFreq()
```
    Description copied from class: DISCO
    
    Get minimum frequency of tokens in corpus.
    
    Specified by:
    
    getMinFreq in class DISCO
    
    Returns:
  - getMaxFreq
```
public int getMaxFreq()
```
    Description copied from class: DISCO
    
    Get corpus frequency of the most frequent word in the word space (that was not filtered out by the stop word list that was used).
    
    Specified by:
    
    getMaxFreq in class DISCO
    
    Returns:
  - getVocabularyIterator
```
public java.util.Iterator<java.lang.String> getVocabularyIterator()
                                                           throws java.io.IOException
```
    Description copied from class: DISCO
    
    Returns an iterator that iterates over all words in the word space (the vocabulary). There is no special ordering of the words. The method remove is not supported.
    
    Specified by:
    
    getVocabularyIterator in class DISCO
    
    Returns:
    
    Throws:
    
    java.io.IOException
  - getWord
```
public java.lang.String getWord(int id)
                         throws java.io.IOException
```
    Description copied from class: DISCO
    
    Returns the id-th word in the vocabulary.
    
    Specified by:
    
    getWord in class DISCO
    
    Parameters:
    
    id - id has to be between 0 and DISCO.numberOfWords() - 1.
    
    Returns:
    
    word with given id or null if id is not in the range 0..DISCO.numberOfWords() - 1.
    
    Throws:
    
    java.io.IOException
  - searchIndex
```
public org.apache.lucene.document.Document searchIndex(java.lang.String word)
                                                throws java.io.IOException
```
    Searches for a input word in index field word and returns the first hit Document or null.
    DISCOLuceneIndex uses the Lucene index. A word's data are stored in the index in an object of type Document. A Document has the following 6 fields:
    - word: contains a word, tokenized with WhitespaceAnalyzer. This is the only searchable field.
    - freq: the corpus frequency of the word. This field is only stored, but not indexed.
    - dsb: the distributionally similar words for the input word. They are stored in a single string, in which the words are separated by spaces. This field is not indexed, and therefore not searchable. The words are sorted by their similarity value, highest value first.
      For word spaces of type WordspaceType.COL, this field is empty!
    - dsbSim: contains a single string with the similarity values for the words in the field dsb, separated by spaces. The string in this field is parallel to the string in the field dsb, i.e., the n-th token of the string in dsbSim corresponds to the n-th token in dsb.
      Example: field dsb contains the string "apple banana cherry", field dsbSim contains the string "0.3241 0.1233 0.0788". This means that the similarity between the word in the field word and "cherry" is 0.0788.
      For word spaces of type WordspaceType.COL, this field is empty!
    - kol: contains the features from the input word's sparse word vector. "Sparse" means that only those features are stored that have a value greater than or equal to the threshold that was set in minWeight in the disco.config file.
      There are three forms features can have:
      
      featureWord: the feature is a plain word.
      
      featureWord<SEP>relation: the feature is composed of a word and a specific relation between the inputWord and the featureWord. The relation can be a window position or a syntactic dependency relation. featureWord and relation are separated by the character DISCOLuceneIndex.relationSeparator.
      
      ID: the feature is a number. This is the case for word spaces that have been imported from other tools like word2vec. Word spaces of type word x document also have IDs as features.
      
      The features in the field kol are separated by a space.
    - kolSig: contains the significance values for kol, in a string parallel to the string in kol.
    Parameters:
    
    word - input word to be looked up in index (must be a single token).
    
    Returns:
    
    index entry of input word or null if the input word can not be found in the index.
    
    Throws:
    
    java.io.IOException
  - collocations
```
public ReturnDataCol[] collocations(java.lang.String word)
                             throws java.io.IOException
```
    Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first). If the search word is not found in the index, the return value is null.
    The collocations are derived from the word's features. As features can be not only plain words, but also words plus their relation to the input word, the relation is cut off of the word, and the significance values of identical words are summed up. (If you want to receive the full features instead of only the words use the method getWordvector().)
    Features can also be IDs (for word spaces with document features or imported from other tools). In this case, the "collocations" will be a list of IDs.
    The significance measure that was used in word space construction by DISCOBuilder is stored in the file disco.config in the word space directory (look at the line weightingMethod). For more information on available significance measures consult DISCOBuilder's documentation.
    
    Specified by:
    
    collocations in class DISCO
    
    Parameters:
    
    word - the input word (must be a single token).
    
    Returns:
    
    the list of collocations with their significance values or null. The relation fields of the array elements are not set.
    
    Throws:
    
    java.io.IOException
  - collocationalValue
```
public float collocationalValue(java.lang.String w1,
                                java.lang.String w2)
                         throws java.io.IOException
```
    Returns the collocational strength between words w1 and w2, summed up over all relations.
    
    Parameters:
    
    w1 - input word #1 (must be a single token).
    
    w2 - input word #2 (must be a single token).
    
    Returns:
    
    the sum of the significance values between word w1 and all its features that have w2 as their word part while ignoring the relation (if any) and the same for w2 with w1 as feature. Returns whichever value is greater. If w1 or w2 are not found the return value is 0.
    
    Throws:
    
    java.io.IOException

Class DISCOLuceneIndex

Nested Class Summary

Nested classes/interfaces inherited from class de.linguatools.disco.DISCO

Field Summary

Fields inherited from class de.linguatools.disco.DISCO

Constructor Summary

Method Summary

Methods inherited from class de.linguatools.disco.DISCO

Methods inherited from class java.lang.Object

Field Detail

indexDir

indexRAM

Constructor Detail

DISCOLuceneIndex

Method Detail

getWordspaceType

numberOfWords

numberOfFeatureWords

numberOfSimilarWords

frequency

similarWords

semanticSimilarity

secondOrderSimilarity

getWordvector

getSecondOrderWordvector

wordFrequencyList

getStopwords

getTokenCount

getMinFreq

getMaxFreq

getVocabularyIterator

getWord

searchIndex

collocations

collocationalValue