public class DISCOLuceneIndex extends DISCO
DISCO.SimilarityMeasure, DISCO.WordspaceType
Modifier and Type | Field and Description |
---|---|
java.lang.String |
indexDir
Name of the word space directory.
|
org.apache.lucene.store.RAMDirectory |
indexRAM
The word space loaded into RAM.
|
RELATION_SEPARATOR
Constructor and Description |
---|
DISCOLuceneIndex(java.lang.String idxName,
boolean loadIntoRAM)
DISCO version 2.0 allows to load a complete word space into RAM to
speed up similarity computations.
|
Modifier and Type | Method and Description |
---|---|
float |
collocationalValue(java.lang.String w1,
java.lang.String w2)
Returns the collocational strength between words
w1 and
w2 , summed up over all relations. |
ReturnDataCol[] |
collocations(java.lang.String word)
Returns the collocations for the input word together with their
significance values, ordered by significance value (highest significance
first).
|
int |
frequency(java.lang.String word)
Looks up the input word in the word space and returns its frequency.
|
int |
getMaxFreq()
Get corpus frequency of the most frequent word in the word space (that
was not filtered out by the stop word list that was used).
|
int |
getMinFreq()
Get minimum frequency of tokens in corpus.
|
java.util.Map<java.lang.String,java.lang.Float> |
getSecondOrderWordvector(java.lang.String word)
The second order word vector contains the
nBest most similar
words for word as features (instead of the directly
co-occuring words that you get with getWordvector ). |
java.lang.String[] |
getStopwords()
Get the stopwords for this word space instance.
|
long |
getTokenCount()
Size of the underlying corpus.
|
java.util.Iterator<java.lang.String> |
getVocabularyIterator()
Returns an iterator that iterates over all words in the word space (the
vocabulary).
|
java.lang.String |
getWord(int id)
Returns the id-th word in the vocabulary.
|
DISCO.WordspaceType |
getWordspaceType()
Returns the type of the word space instance.
|
java.util.Map<java.lang.String,java.lang.Float> |
getWordvector(java.lang.String word)
Returns the word vector representing the distribution of the input word
in the corpus.
The word vector can be used with the methods in the class Compositionality . |
int |
numberOfFeatureWords()
For
DISCOLuceneIndex this returns the number of words that
were used as features. |
int |
numberOfSimilarWords()
Get the number of similar words that are stored in word spaces of type SIM
for each word.
|
int |
numberOfWords()
Returns the number of
Documents (i.e. |
org.apache.lucene.document.Document |
searchIndex(java.lang.String word)
Searches for a input word in index field
word and returns
the first hit Document or null .DISCOLuceneIndex uses the Lucene index. |
float |
secondOrderSimilarity(java.lang.String w1,
java.lang.String w2,
VectorSimilarity vectorSimilarity)
Computes the second order semantic similarity between the input words
based on the sets of their distributionally similar words.
Important note: This method only works with word spaces of type WordspaceType.SIM . |
float |
semanticSimilarity(java.lang.String w1,
java.lang.String w2,
VectorSimilarity vectorSimilarity)
Computes the semantic similarity (according to the vector similarity
measure
similarityMeasure ) between the two input words based
on their collocation sets (i.e. |
ReturnDataBN |
similarWords(java.lang.String word)
Looks up the input word in the index and returns its semantically
similar words ordered by decreasing similarity together
with their similarity values.
If the search word isn't found in the word space, the return value is null .The similarity values in the result can differ from the values you get with DISCOLuceneIndex.semanticSimilarity for the same word pair. |
int |
wordFrequencyList(java.lang.String outputFileName)
Run trough all documents (i.e.
|
getSimilarityMeasure, getVectorSimilarity, load, open
public java.lang.String indexDir
public org.apache.lucene.store.RAMDirectory indexRAM
public DISCOLuceneIndex(java.lang.String idxName, boolean loadIntoRAM) throws java.io.FileNotFoundException, org.apache.lucene.index.CorruptIndexException, java.io.IOException, CorruptConfigFileException
disco.config
in the word space directory. If the file is not
found in the word space directory a FileNotFoundException
is
thrown. If the word space type can not be determined (due to a corrupted
config file), a CorruptConfigFileException
is thrown.idxName
- the name of the word space directoryloadIntoRAM
- if true the word space is loaded into RAMjava.io.IOException
java.io.FileNotFoundException
- if the file "disco.config" can not be found
in the word space directory idxName
.org.apache.lucene.index.CorruptIndexException
CorruptConfigFileException
- if the file "disco.config" is corrupt.public DISCO.WordspaceType getWordspaceType()
getWordspaceType
in class DISCO
public int numberOfWords()
Documents
(i.e. words) in the word
space.numberOfWords
in class DISCO
public int numberOfFeatureWords()
DISCO
DISCOLuceneIndex
this returns the number of words that
were used as features. Note that this is only equal to the dimensionality
of the word vectors if no positional or relational features were used.
See
options in disco.config under numberFeatureWords
for more
information.DenseMatrix
this is equal to the dimensionality of the
word embedding (vector length).numberOfFeatureWords
in class DISCO
public int numberOfSimilarWords()
DISCO
numberOfSimilarWords
in class DISCO
public int frequency(java.lang.String word) throws java.io.IOException
public ReturnDataBN similarWords(java.lang.String word) throws java.io.IOException, WrongWordspaceTypeException
null
.DISCOLuceneIndex.semanticSimilarity
for the same word pair. This is
the case when another similarity measure was used in generating the word
space. Consult the file disco.config
in the word space
directory to get the similarity measure that was used. If no measure is
given there the default measure KOLB
was used.DISCOLuceneIndex.WordspaceType.SIM
.similarWords
in class DISCO
word
- word to be looked up (must be a single token).null
java.io.IOException
WrongWordspaceTypeException
- if the word space does not have the
type DISCOLuceneIndex.WordspaceType.SIM
.public float semanticSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity) throws java.io.IOException
similarityMeasure
) between the two input words based
on their collocation sets (i.e. word vectors).SimilarityMeasure.KOLB
should
not be used with word spaces imported from word2vec!Compositionality
.semanticSimilarity
in class DISCO
w1
- input word #1 (must be a single token).w2
- input word #2 (must be a single token).vectorSimilarity
- similarityMeasure
is unknown the
return value is -3.0F.java.io.IOException
public float secondOrderSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity) throws java.io.IOException, WrongWordspaceTypeException
WordspaceType.SIM
.secondOrderSimilarity
in class DISCO
w1
- input word #1 (must be a single token).w2
- input word #2 (must be a single token).vectorSimilarity
- java.io.IOException
WrongWordspaceTypeException
public java.util.Map<java.lang.String,java.lang.Float> getWordvector(java.lang.String word) throws java.io.IOException
Compositionality
.getWordvector
in class DISCO
word
- input word (must be a single token - to get a word vector for
a phrase use Compositionality.composeWordVectors
).null
if
word
is not found. The features of the word vector are
the keys of the resulting HashMap, the values are the significance values
of the word vector. For more information on the values consult the
documentation of the method searchIndex()
(field
kol
).java.io.IOException
Compositionality
public java.util.Map<java.lang.String,java.lang.Float> getSecondOrderWordvector(java.lang.String word) throws WrongWordspaceTypeException, java.io.IOException
DISCO
nBest
most similar
words for word
as features (instead of the directly
co-occuring words that you get with getWordvector
).getSecondOrderWordvector
in class DISCO
WrongWordspaceTypeException
- when used with
word space that is not of type SIM.java.io.IOException
public int wordFrequencyList(java.lang.String outputFileName)
outputFileName
. Note that the output is not
sorted.wordFrequencyList
in class DISCO
outputFileName
- name of the output file.public java.lang.String[] getStopwords() throws java.io.FileNotFoundException, java.io.IOException, CorruptConfigFileException
getStopwords
in class DISCO
java.io.FileNotFoundException
java.io.IOException
CorruptConfigFileException
public long getTokenCount()
DISCO
getTokenCount
in class DISCO
public int getMinFreq()
DISCO
getMinFreq
in class DISCO
public int getMaxFreq()
DISCO
getMaxFreq
in class DISCO
public java.util.Iterator<java.lang.String> getVocabularyIterator() throws java.io.IOException
DISCO
remove
is not supported.getVocabularyIterator
in class DISCO
java.io.IOException
public java.lang.String getWord(int id) throws java.io.IOException
DISCO
public org.apache.lucene.document.Document searchIndex(java.lang.String word) throws java.io.IOException
word
and returns
the first hit Document
or null
.Document
. A Document
has the following 6
fields:
word
: contains a word, tokenized with
WhitespaceAnalyzer
. This is the only searchable field.
freq
: the corpus frequency of the word. This field is
only stored, but not indexed.
dsb
: the distributionally similar words for the input
word. They are stored in a single string, in which the words are
separated by spaces. This field is not indexed, and therefore not
searchable. The words are sorted by their similarity value, highest value
first.WordspaceType.COL
, this field is
empty!
dsbSim
: contains a single string with the similarity
values for the words in the field dsb
, separated by spaces.
The string in this field is parallel to the string in the field
dsb
, i.e., the n-th token of the string in dsbSim
corresponds to the n-th token in dsb
.dsb
contains the string "apple banana cherry",
field dsbSim
contains the string "0.3241 0.1233 0.0788". This
means that the similarity between the word in the field word
and "cherry" is 0.0788.WordspaceType.COL
, this field is
empty!
kol
: contains the features from the input word's sparse
word vector. "Sparse" means that only those features are stored that have
a value greater than or equal to the threshold that was set in
minWeight
in the disco.config
file.featureWord
: the feature is a plain word.featureWord<SEP>relation
: the feature is composed of a word and
a specific relation between the inputWord and the featureWord. The relation
can be a window position or a syntactic dependency relation. featureWord
and relation are separated by the character DISCOLuceneIndex.relationSeparator
.ID
: the feature is a number. This is the case for word spaces that have
been imported from other tools like word2vec. Word spaces of type word x
document also have IDs as features.kol
are separated by a space.
kolSig
: contains the significance values for
kol
, in a string parallel to the string in kol
.
word
- input word to be looked up in index (must be a single token).null
if the input word
can not be found in the index.java.io.IOException
public ReturnDataCol[] collocations(java.lang.String word) throws java.io.IOException
null
.getWordvector()
.)
disco.config
in the word space
directory (look at the line weightingMethod
). For more
information on available significance measures consult DISCOBuilder's
documentation.collocations
in class DISCO
word
- the input word (must be a single token).null
. The relation
fields of the array elements
are not set.java.io.IOException
public float collocationalValue(java.lang.String w1, java.lang.String w2) throws java.io.IOException
w1
and
w2
, summed up over all relations.w1
- input word #1 (must be a single token).w2
- input word #2 (must be a single token).java.io.IOException