public class DISCO
extends java.lang.Object
DISCO.WordspaceType.COL
: this type stores a word vector
for each word. A word vector is the list of the significant co-occurrences of
the word together with the type of co-occurrence (if any) and a significance
value. The significant co-occurring words of a word are also called its
collocations. The type of co-occurrence can be a relative position in
a context window, or a syntactic relationDISCO.WordspaceType.SIM
: this type stores the above word
vectors, but also contains pre-computed lists of the most similar words for
each word. These words can be queried using the method DISCO.similarWords().
There are several methods in the DISCO API that only work with word spaces of
type DISCO.WordspaceType.SIM
.Modifier and Type | Class and Description |
---|---|
static class |
DISCO.SimilarityMeasure
Available measures for vector comparison.
|
static class |
DISCO.WordspaceType
Available word space types (SIM = word space contains lists of
pre-computed similar words for each word, COL = word space contains only
word vectors).
|
Modifier and Type | Field and Description |
---|---|
java.lang.String |
indexDir
Name of the word space directory.
|
org.apache.lucene.store.RAMDirectory |
indexRAM
The word space loaded into RAM.
|
static java.lang.String |
relationSeparator
This string is used as separator between a feature word and its relation.
|
DISCO.WordspaceType |
wordspaceType
Type of this word space.
|
Constructor and Description |
---|
DISCO(java.lang.String idxName,
boolean loadIntoRAM)
DISCO version 2.0 allows to load a complete word space into RAM to
speed up similarity computations.
|
Modifier and Type | Method and Description |
---|---|
float |
collocationalValue(java.lang.String w1,
java.lang.String w2)
Returns the collocational strength between words
w1 and
w2 , summed up over all relations. |
ReturnDataCol[] |
collocations(java.lang.String word)
Returns the collocations for the input word together with their
significance values, ordered by significance value (highest significance
first).
|
int |
frequency(java.lang.String word)
Looks up the input word in the word space and returns its frequency.
|
static DISCO.SimilarityMeasure |
getSimilarityMeasure(java.lang.String simMeasure)
Get SimilarityMeasure from string.
|
java.lang.String[] |
getStopwords()
Get the stopwords for this word space instance.
|
DISCO.WordspaceType |
getWordspaceType()
Returns the type of the word space instance.
|
java.util.HashMap<java.lang.String,java.lang.Float> |
getWordvector(java.lang.String word)
Returns the word vector representing the distribution of the input word
in the corpus.
The word vector can be used with the methods in the class Compositionality . |
int |
numberOfWords()
Returns the number of
Documents (i.e. |
org.apache.lucene.document.Document |
searchIndex(java.lang.String word)
Searches for a input word in index field
word and returns
the first hit Document or null .DISCO uses the Lucene index. |
float |
secondOrderSimilarity(java.lang.String w1,
java.lang.String w2)
Computes the second order semantic similarity between the input words
based on the sets of their distributionally similar words.
Important note: This method only works with word spaces of type DISCO.WordspaceType.SIM . |
float |
semanticSimilarity(java.lang.String w1,
java.lang.String w2)
Computes the semantic similarity (according to the vector similarity
measure
SimilarityMeasures.KOLB which is described in
Kolb 2009) between the
input words based on their collocation sets (i.e. |
float |
semanticSimilarity(java.lang.String w1,
java.lang.String w2,
DISCO.SimilarityMeasure similarityMeasure)
Computes the semantic similarity (according to the vector similarity
measure
similarityMeasure ) between the two input words based
on their collocation sets (i.e. |
ReturnDataBN |
similarWords(java.lang.String word)
Looks up the input word in the index and returns its semantically
similar words ordered by decreasing similarity together
with their similarity values.
If the search word isn't found in the word space, the return value is null .Important note: This method only works with word spaces of type DISCO.WordspaceType.SIM . |
int |
wordFrequencyList(java.lang.String outputFileName)
Run trough all documents (i.e.
|
public java.lang.String indexDir
public org.apache.lucene.store.RAMDirectory indexRAM
public DISCO.WordspaceType wordspaceType
public static final java.lang.String relationSeparator
public DISCO(java.lang.String idxName, boolean loadIntoRAM) throws java.io.FileNotFoundException, org.apache.lucene.index.CorruptIndexException, java.io.IOException, CorruptConfigFileException
disco.config
in the word space directory. If the file is not
found in the word space directory a FileNotFoundException
is
thrown. If the word space type can not be determined (due to a corrupted
config file), a CorruptConfigFileException
is thrown.idxName
- the name of the word space directoryloadIntoRAM
- if true the word space is loaded into RAMjava.io.IOException
java.io.FileNotFoundException
- if the file "disco.config" can not be found
in the word space directory idxName
.org.apache.lucene.index.CorruptIndexException
CorruptConfigFileException
- if the file "disco.config" is corrupt.public static DISCO.SimilarityMeasure getSimilarityMeasure(java.lang.String simMeasure)
simMeasure
- public DISCO.WordspaceType getWordspaceType()
public org.apache.lucene.document.Document searchIndex(java.lang.String word) throws java.io.IOException
word
and returns
the first hit Document
or null
.Document
. A Document
has the following 6
fields:
word
: contains a word, tokenized with
WhitespaceAnalyzer
. This is the only searchable field.
freq
: the corpus frequency of the word. This field is
only stored, but not indexed.
dsb
: the distributionally similar words for the input
word. They are stored in a single string, in which the words are
separated by spaces. This field is not indexed, and therefore not
searchable. The words are sorted by their similarity value, highest value
first.WordspaceType.COL
, this field is
empty!
dsbSim
: contains a single string with the similarity
values for the words in the field dsb
, separated by spaces.
The string in this field is parallel to the string in the field
dsb
, i.e., the n-th token of the string in dsbSim
corresponds to the n-th token in dsb
.dsb
contains the string "apple banana cherry",
field dsbSim
contains the string "0.3241 0.1233 0.0788". This
means that the similarity between the word in the field word
and "cherry" is 0.0788.WordspaceType.COL
, this field is
empty!
kol
: contains the features from the input word's sparse
word vector. "Sparse" means that only those features are stored that have
a value greater than or equal to the threshold that was set in
minWeight
in the disco.config
file.featureWord
: the feature is a plain word.featureWord<SEP>relation
: the feature is composed of a word and
a specific relation between the inputWord and the featureWord. The relation
can be a window position or a syntactic dependency relation. featureWord
and relation are separated by the character DISCO.relationSeparator
.ID
: the feature is a number. This is the case for word spaces that have
been imported from other tools like word2vec. Word spaces of type word x
document also have IDs as features.kol
are separated by a space.
kolSig
: contains the significance values for
kol
, in a string parallel to the string in kol
.
word
- input word to be looked up in index (must be a single token).null
if the input word
can not be found in the index.java.io.IOException
public int numberOfWords() throws java.io.IOException
Documents
(i.e. words) in the word
space.java.io.IOException
public int frequency(java.lang.String word) throws java.io.IOException
word
- word to be looked up (must be a single token).java.io.IOException
public ReturnDataBN similarWords(java.lang.String word) throws java.io.IOException, WrongWordspaceTypeException
null
.DISCO.WordspaceType.SIM
.word
- word to be looked up (must be a single token).null
java.io.IOException
WrongWordspaceTypeException
- if the word space does not have the
type DISCO.WordspaceType.SIM
.public ReturnDataCol[] collocations(java.lang.String word) throws java.io.IOException
null
.getWordvector()
.)
disco.config
in the word space
directory (look at the line weightingMethod
). For more
information on available significance measures consult DISCOBuilder's
documentation.word
- the input word (must be a single token).null
. The relation
fields of the array elements
are not set.java.io.IOException
public float semanticSimilarity(java.lang.String w1, java.lang.String w2) throws java.io.IOException
SimilarityMeasures.KOLB
which is described in
Kolb 2009) between the
input words based on their collocation sets (i.e. word vectors). If any
of the two words isn't found in the index, the return value is -2.Compositionality
.w1
- input word #1 (must be a single token).w2
- input word #2 (must be a single token).java.io.IOException
public float semanticSimilarity(java.lang.String w1, java.lang.String w2, DISCO.SimilarityMeasure similarityMeasure) throws java.io.IOException
similarityMeasure
) between the two input words based
on their collocation sets (i.e. word vectors).Compositionality
.w1
- input word #1 (must be a single token).w2
- input word #2 (must be a single token).similarityMeasure
- One of the similarity measures enumerated in
SimilarityMeasures
.similarityMeasure
is unknown the
return value is -3.0F.java.io.IOException
public float secondOrderSimilarity(java.lang.String w1, java.lang.String w2) throws java.io.IOException, WrongWordspaceTypeException
DISCO.WordspaceType.SIM
.w1
- input word #1 (must be a single token).w2
- input word #2 (must be a single token).java.io.IOException
WrongWordspaceTypeException
public float collocationalValue(java.lang.String w1, java.lang.String w2) throws java.io.IOException
w1
and
w2
, summed up over all relations.w1
- input word #1 (must be a single token).w2
- input word #2 (must be a single token).java.io.IOException
public java.util.HashMap<java.lang.String,java.lang.Float> getWordvector(java.lang.String word) throws java.io.IOException
Compositionality
.word
- input word (must be a single token).null
if
word
is not found. The features of the word vector are
the keys of the resulting HashMap, the values are the significance values
of the word vector. For more information on the values consult the
documentation of the method searchIndex()
(field
kol
).java.io.IOException
public int wordFrequencyList(java.lang.String outputFileName)
outputFileName
. Note that the output is not
sorted.outputFileName
- name of the output file.public java.lang.String[] getStopwords() throws java.io.FileNotFoundException, java.io.IOException, CorruptConfigFileException
java.io.FileNotFoundException
java.io.IOException
CorruptConfigFileException