de.linguatools.disco
Class DISCO

java.lang.Object
  extended by de.linguatools.disco.DISCO

public class DISCO
extends java.lang.Object

DISCO (Extracting DIStributionally Similar Words Using CO-occurrences) provides a number of methods for computing the distributional (i.e. semantic) similarity between arbitrary words, for retrieving a word's collocations or its corpus frequency. It also provides a method to retrieve the semantically most similar words for a given word.


Constructor Summary
DISCO()
          Deprecated.  
DISCO(java.lang.String idxName, boolean loadIntoRAM)
          With DISCO version 1.2 a complete word space can be loaded into RAM to speed up similarity computations.
 
Method Summary
 ReturnDataCol[] collocations(java.lang.String word)
          Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first).
 ReturnDataCol[] collocations(java.lang.String idxName, java.lang.String word)
          Deprecated.  
 void destroy()
          This method closes the RAMDirectory where the word space is stored and sets all internal variables of the DISCO instance to null.
 float firstOrderSimilarity(java.lang.String w1, java.lang.String w2)
          Computes the first order similarity (according to Lin's vector similarity measure) between the input words based on their collocation sets.
 float firstOrderSimilarity(java.lang.String idxName, java.lang.String w1, java.lang.String w2)
          Deprecated.  
 int frequency(java.lang.String word)
          Looks up the input word in the index and returns its frequency.
 int frequency(java.lang.String idxName, java.lang.String word)
          Deprecated.  
 int numberOfWords()
          returns the number of Documents (i.e. words) in the index.
 int numberOfWords(java.lang.String idxName)
          Deprecated.  
 org.apache.lucene.document.Document searchIndex(java.lang.String word)
          Searches for a word in index field "word" and returns the first hit Document or null.
 org.apache.lucene.document.Document searchIndex(java.lang.String idxName, java.lang.String word)
          Deprecated.  
 float secondOrderSimilarity(java.lang.String w1, java.lang.String w2)
          Computes the second order similarity (according to Lin's measure) between the input words based on the sets of their distributional similar words.
 float secondOrderSimilarity(java.lang.String idxName, java.lang.String w1, java.lang.String w2)
          Deprecated.  
 ReturnDataBN similarWords(java.lang.String word)
          Looks up the input word in the index and returns its distributionally similar words ordered by decreasing similarity together with similarity values.
 ReturnDataBN similarWords(java.lang.String idxName, java.lang.String word)
          Deprecated.  
 ReturnDataCol[] wordvector(java.lang.String word)
          Returns the collocations with their exact positions and their significance values -- in other words the word vector representing the input word.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DISCO

public DISCO(java.lang.String idxName,
             boolean loadIntoRAM)
      throws java.io.IOException
With DISCO version 1.2 a complete word space can be loaded into RAM to speed up similarity computations. Make sure that you have enough free memory since word spaces can be very large. Also, remember that loading a huge word space into RAM will take some time.

Parameters:
idxName - the word space directory
loadIntoRAM - if true the word space is loaded into RAM
Throws:
java.io.IOException

DISCO

public DISCO()
Deprecated. 

Constructor provided for compatability with DISCO version 1.1.

Method Detail

searchIndex

public org.apache.lucene.document.Document searchIndex(java.lang.String word)
                                                throws java.io.IOException
Searches for a word in index field "word" and returns the first hit Document or null.
DISCO uses the Lucene index. A word's data are stored in the index in an object of type Document. A Document has the following 16 fields:

Parameters:
word - word to be looked up in index
Returns:
index entry of word or null
Throws:
java.io.IOException

searchIndex

public org.apache.lucene.document.Document searchIndex(java.lang.String idxName,
                                                       java.lang.String word)
                                                throws org.apache.lucene.index.CorruptIndexException,
                                                       java.io.IOException
Deprecated. 

Interface provided for compatability with DISCO version 1.1.

Parameters:
idxName - name of index directory
word - word to be looked up in index
Returns:
index entry of word or null
Throws:
org.apache.lucene.index.CorruptIndexException
java.io.IOException

numberOfWords

public int numberOfWords()
                  throws java.io.IOException
returns the number of Documents (i.e. words) in the index.

Returns:
number of words in index
Throws:
java.io.IOException

numberOfWords

public int numberOfWords(java.lang.String idxName)
                  throws org.apache.lucene.index.CorruptIndexException,
                         java.io.IOException
Deprecated. 

Interface provided for compatability with DISCO version 1.1.

Parameters:
idxName - name of index directory
Returns:
number of words in index
Throws:
org.apache.lucene.index.CorruptIndexException
java.io.IOException

frequency

public int frequency(java.lang.String word)
              throws java.io.IOException
Looks up the input word in the index and returns its frequency. If the word is not found the return value is zero.

Parameters:
word - word to be looked up
Returns:
frequency of the input word (0 if word is not found)
Throws:
java.io.IOException

frequency

public int frequency(java.lang.String idxName,
                     java.lang.String word)
              throws org.apache.lucene.index.CorruptIndexException,
                     java.io.IOException
Deprecated. 

Interface provided for compatability with DISCO version 1.1.

Parameters:
idxName - name of index directory
word - word to be looked up
Returns:
frequency of the input word (0 if word is not found)
Throws:
org.apache.lucene.index.CorruptIndexException
java.io.IOException

similarWords

public ReturnDataBN similarWords(java.lang.String word)
                          throws java.io.IOException
Looks up the input word in the index and returns its distributionally similar words ordered by decreasing similarity together with similarity values. If the search word isn't found in the index, the return value is null.
Lin's similarity measure was used to compute the similar words.

Parameters:
word - word to be looked up
Returns:
result data structure or null
Throws:
java.io.IOException

similarWords

public ReturnDataBN similarWords(java.lang.String idxName,
                                 java.lang.String word)
                          throws java.io.IOException
Deprecated. 

Interface provided for compatability with DISCO version 1.1.

Parameters:
idxName - name of index directory
word - word to be looked up
Returns:
result data structure or null
Throws:
java.io.IOException

collocations

public ReturnDataCol[] collocations(java.lang.String word)
                             throws java.io.IOException
Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first).
Important note: The collocations are summed up over their different positions, and the variable relation in the returned data structure is not set.
If the search word isn't found in the index, the return value is null.
The significance values were computed using Lin's measure.

Parameters:
word - the input word
Returns:
the list of collocations with their significance values
Throws:
java.io.IOException
See Also:
wordvector(java.lang.String)

collocations

public ReturnDataCol[] collocations(java.lang.String idxName,
                                    java.lang.String word)
                             throws java.io.IOException
Deprecated. 

Interface provided for compatability with DISCO version 1.1.

Parameters:
idxName - name of index directory
word - input word
Returns:
the list of collocations with their significance values
Throws:
java.io.IOException

wordvector

public ReturnDataCol[] wordvector(java.lang.String word)
                           throws java.io.IOException
Returns the collocations with their exact positions and their significance values -- in other words the word vector representing the input word.

Parameters:
word - input word
Returns:
data structure containing word vector or null
Throws:
java.io.IOException

firstOrderSimilarity

public float firstOrderSimilarity(java.lang.String w1,
                                  java.lang.String w2)
                           throws java.io.IOException
Computes the first order similarity (according to Lin's vector similarity measure) between the input words based on their collocation sets. If any of the two words isn't found in the index, the return value is -1.

Parameters:
w1 - input word #1
w2 - input word #2
Returns:
similarity value (between 0 and 1 or -1 if word not found)
Throws:
java.io.IOException

firstOrderSimilarity

public float firstOrderSimilarity(java.lang.String idxName,
                                  java.lang.String w1,
                                  java.lang.String w2)
                           throws java.io.IOException
Deprecated. 

Interface provided for compatability with DISCO version 1.1.

Parameters:
idxName - name of index directory
w1 - input word #1
w2 - input word #2
Returns:
similarity value (between 0 and 1 or -1 if word not found)
Throws:
java.io.IOException

secondOrderSimilarity

public float secondOrderSimilarity(java.lang.String w1,
                                   java.lang.String w2)
                            throws java.io.IOException
Computes the second order similarity (according to Lin's measure) between the input words based on the sets of their distributional similar words. If any of the two words isn't found in the index, the return value is -1.

Parameters:
w1 - input word #1
w2 - input word #2
Returns:
similarity value
Throws:
java.io.IOException

secondOrderSimilarity

public float secondOrderSimilarity(java.lang.String idxName,
                                   java.lang.String w1,
                                   java.lang.String w2)
                            throws java.io.IOException
Deprecated. 

Interface provided for compatability with DISCO version 1.1.

Parameters:
idxName - name of index directory
w1 - input word #1
w2 - input word #2
Returns:
similarity value
Throws:
java.io.IOException

destroy

public void destroy()
This method closes the RAMDirectory where the word space is stored and sets all internal variables of the DISCO instance to null. The sole purpose of this method is to release the memory that is associated with a word space loaded into RAM. Subsequent calls to the DISCO instance will throw NullPointerExceptions! In most cases it is not necessary for a program to call this method. Normally, you do not have to destroy a DISCO instance after using it.