public abstract class DISCO
extends java.lang.Object
DISCO.WordspaceType.COL
: this type stores a
word vector for each word. A word vector is the list of the
significant co-occurrences of the word together with the type of
co-occurrence (if any) and a significance value. The significant co-occurring
words of a word are also called its collocations. The type of
co-occurrence can be a relative position in a context window, or a syntactic
relationDISCO.WordspaceType.SIM
: this type stores the
above word vectors, but also contains pre-computed lists of the most similar
words for each word. These words can be queried using the method
DISCO.similarWords()
. There are several methods in
the DISCO API that only work with word spaces of type
DISCO.WordspaceType.SIM
.DISCO
class:
DISCOLuceneIndex
: this class uses
Lucene to store a word space. It is
intended for very high-dimensional word spaces (like the classic distributional
count vectors that work without any dimension reduction techniques) because it
stores them as a sparse matrix.DenseMatrix
: this class stores a word space as a two-dimensional
array. It stores the full matrix and is therefore only feasible for predict
vectors (word embeddings) like the ones that are produced by word2vec.Modifier and Type | Class and Description |
---|---|
static class |
DISCO.SimilarityMeasure
Available measures for vector comparison.
|
static class |
DISCO.WordspaceType
Available word space types (SIM = word space contains lists of
pre-computed similar words for each word, COL = word space contains only
word vectors).
|
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
RELATION_SEPARATOR
This string is used as separator between a feature word and its relation.
|
Constructor and Description |
---|
DISCO() |
Modifier and Type | Method and Description |
---|---|
abstract ReturnDataCol[] |
collocations(java.lang.String word)
Returns the collocations for the input word together with their
significance values, ordered by significance value (highest significance
first).
|
abstract int |
frequency(java.lang.String word)
Get corpus frequency of
word . |
abstract int |
getMaxFreq()
Get corpus frequency of the most frequent word in the word space (that
was not filtered out by the stop word list that was used).
|
abstract int |
getMinFreq()
Get minimum frequency of tokens in corpus.
|
abstract java.util.Map<java.lang.String,java.lang.Float> |
getSecondOrderWordvector(java.lang.String word)
The second order word vector contains the
nBest most similar
words for word as features (instead of the directly
co-occuring words that you get with getWordvector ). |
static DISCO.SimilarityMeasure |
getSimilarityMeasure(java.lang.String simMeasure)
Get
SimilarityMeasure object from its String name. |
abstract java.lang.String[] |
getStopwords()
Gets list of stopwords from the
disco.config file in the
word space. |
abstract long |
getTokenCount()
Size of the underlying corpus.
|
static VectorSimilarity |
getVectorSimilarity(DISCO.SimilarityMeasure simMeasure)
Get VectorSimilarity class for a SimilarityMeasure.
|
abstract java.util.Iterator<java.lang.String> |
getVocabularyIterator()
Returns an iterator that iterates over all words in the word space (the
vocabulary).
|
abstract java.lang.String |
getWord(int id)
Returns the id-th word in the vocabulary.
|
abstract DISCO.WordspaceType |
getWordspaceType()
Get type of this word space.
|
abstract java.util.Map<java.lang.String,java.lang.Float> |
getWordvector(java.lang.String word)
Get word vector for
word as map feature - value . |
static DISCO |
load(java.lang.String discoFile)
Load DISCOLuceneIndex or DenseMatrix from file into memory.
|
abstract int |
numberOfFeatureWords()
For
DISCOLuceneIndex this returns the number of words that
were used as features. |
abstract int |
numberOfSimilarWords()
Get the number of similar words that are stored in word spaces of type SIM
for each word.
|
abstract int |
numberOfWords()
Get number of words stored in this word space.
|
static DISCO |
open(java.lang.String discoFile)
If
discoFile is a DISCOLuceneIndex the index is
opened for reading but not loaded into memory.If discoFile is a DenseMatrix it is loaded into
memory. |
abstract float |
secondOrderSimilarity(java.lang.String w1,
java.lang.String w2,
VectorSimilarity vectorSimilarity)
Computes the similarity between words
w1 and w2
by comparing the set of the most similar words for w1 with
the set of the most similar words for w2 .The size of the set of similar words stored for each word in the word space is given by the DISCOBuilder parameter -nBest . |
abstract float |
semanticSimilarity(java.lang.String w1,
java.lang.String w2,
VectorSimilarity vectorSimilarity)
Computes the similarity between words
w1 and w2
by comparing their word vectors using the vectorSimilarity
measure of choice. |
abstract ReturnDataBN |
similarWords(java.lang.String word)
Returns the list of the most similar words for
word (according
to DISCO.semanticSimilarity ). |
abstract int |
wordFrequencyList(java.lang.String outputFileName)
Writes word-frequency list to file.
|
public static final java.lang.String RELATION_SEPARATOR
public static DISCO.SimilarityMeasure getSimilarityMeasure(java.lang.String simMeasure)
SimilarityMeasure
object from its String name.simMeasure
- public static VectorSimilarity getVectorSimilarity(DISCO.SimilarityMeasure simMeasure)
simMeasure
- public abstract DISCO.WordspaceType getWordspaceType()
public abstract int numberOfWords()
DenseMatrix
this returns the value of config.vocabularySize
,
which has to be equal to the number of rows in the similarity matrix.
For DISCOLuceneIndex
this returns the number of documents in
the Lucene index.public abstract int numberOfFeatureWords()
DISCOLuceneIndex
this returns the number of words that
were used as features. Note that this is only equal to the dimensionality
of the word vectors if no positional or relational features were used.
See
options in disco.config under numberFeatureWords
for more
information.DenseMatrix
this is equal to the dimensionality of the
word embedding (vector length).public abstract int numberOfSimilarWords()
public abstract int frequency(java.lang.String word) throws java.io.IOException
word
.word
- java.io.IOException
public abstract ReturnDataBN similarWords(java.lang.String word) throws java.io.IOException, WrongWordspaceTypeException
word
(according
to DISCO.semanticSimilarity
). Since the list of most similar
words for each word is only stored in word spaces of type
WordspaceType.SIM
this does not work with word spaces of
type WordspaceType.COL
.word
- java.io.IOException
WrongWordspaceTypeException
- if called with a word space of type
WordspaceType.COL
public abstract float semanticSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity) throws java.io.IOException
w1
and w2
by comparing their word vectors using the vectorSimilarity
measure of choice.w1
- w2
- vectorSimilarity
- java.io.IOException
public abstract float secondOrderSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity) throws java.io.IOException, WrongWordspaceTypeException
w1
and w2
by comparing the set of the most similar words for w1
with
the set of the most similar words for w2
.-nBest
. Default size
is 300.WordspaceType.SIM
!w1
- w2
- vectorSimilarity
- java.io.IOException
WrongWordspaceTypeException
- if called with a word space of type
WordspaceType.COL
public abstract java.util.Map<java.lang.String,java.lang.Float> getWordvector(java.lang.String word) throws java.io.IOException
word
as map feature - value
.
Features are either words or IDs.word
- java.io.IOException
public abstract java.util.Map<java.lang.String,java.lang.Float> getSecondOrderWordvector(java.lang.String word) throws java.io.IOException, WrongWordspaceTypeException
nBest
most similar
words for word
as features (instead of the directly
co-occuring words that you get with getWordvector
).word
- java.io.IOException
WrongWordspaceTypeException
- when used with
word space that is not of type SIM.public abstract ReturnDataCol[] collocations(java.lang.String word) throws java.io.IOException
null
.getWordvector()
this method summarizes the
words over the different relations.word
- java.io.IOException
public abstract int wordFrequencyList(java.lang.String outputFileName)
outputFileName
- public abstract java.util.Iterator<java.lang.String> getVocabularyIterator() throws java.io.IOException
remove
is not supported.java.io.IOException
public abstract java.lang.String getWord(int id) throws java.io.IOException
id
- id has to be between 0 and DISCO.numberOfWords() - 1
.null
if id is not in the range
0..DISCO.numberOfWords() - 1
.java.io.IOException
public abstract java.lang.String[] getStopwords() throws java.io.FileNotFoundException, java.io.IOException, CorruptConfigFileException
disco.config
file in the
word space.java.io.FileNotFoundException
java.io.IOException
CorruptConfigFileException
public abstract long getTokenCount()
public abstract int getMinFreq()
public abstract int getMaxFreq()
public static DISCO load(java.lang.String discoFile) throws org.apache.lucene.index.CorruptIndexException, java.io.IOException, java.io.FileNotFoundException, CorruptConfigFileException
discoFile
- can be either Lucene index directory or serialized DenseMatrix
file.org.apache.lucene.index.CorruptIndexException
java.io.FileNotFoundException
CorruptConfigFileException
java.io.IOException
public static DISCO open(java.lang.String discoFile) throws org.apache.lucene.index.CorruptIndexException, java.io.IOException, java.io.FileNotFoundException, CorruptConfigFileException
discoFile
is a DISCOLuceneIndex
the index is
opened for reading but not loaded into memory.discoFile
is a DenseMatrix
it is loaded into
memory. In this case open
behaves exactly as load
.discoFile
- can be either Lucene index directory or serialized DenseMatrix
file.org.apache.lucene.index.CorruptIndexException
java.io.FileNotFoundException
CorruptConfigFileException
java.io.IOException