public class DenseMatrix extends DISCO implements java.io.Serializable
DISCO.SimilarityMeasure, DISCO.WordspaceType
Modifier and Type | Field and Description |
---|---|
static java.nio.charset.Charset |
UTF8 |
RELATION_SEPARATOR
Constructor and Description |
---|
DenseMatrix(float[][] matrix,
float[][] ngramMatrix,
int[][] simMatrix,
float[][] simValues,
it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> word2indexMap,
int[] wordIndex2id,
int[] wordId2offset,
int[] frequencies,
byte[] offset2word,
it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> ngram2indexMap,
int[] ngramIndex2id,
int[] ngramId2offset,
byte[] offset2ngram,
int minN,
int maxN,
ConfigFile config,
DISCO.WordspaceType wordspaceType,
int numberOfSimilarWords)
Constructor to be used by
DenseMatrixFactory only. |
Modifier and Type | Method and Description |
---|---|
ReturnDataCol[] |
collocations(java.lang.String word)
Returns the collocations for the input word together with their
significance values, ordered by significance value (highest significance
first).
|
static DenseMatrix |
deserialize(java.io.File serializedDenseMatrixPath)
Deserialize
DenseMatrix object from file. |
int |
frequency(java.lang.String word)
Get corpus frequency of
word . |
ConfigFile |
getConfig() |
int |
getMatrixRowNumber(java.lang.String word) |
int |
getMaxFreq()
Get corpus frequency of the most frequent word in the word space (that
was not filtered out by the stop word list that was used).
|
int |
getMaxN() |
int |
getMinFreq()
Get minimum frequency of tokens in corpus.
|
int |
getMinN() |
java.util.List<ReturnDataCol> |
getMostSimilar(int wordId,
int max)
Compute the
max most similar words for word with ID
wordId . |
float[] |
getNgramVector(java.lang.String ngram) |
int[] |
getSecondOrderWordvector(int id) |
java.util.Map<java.lang.String,java.lang.Float> |
getSecondOrderWordvector(java.lang.String word)
The second order word vector contains words as keys, namely the most
similar words for
word . |
java.lang.String[] |
getStopwords()
Gets list of stopwords from the
disco.config file in the
word space. |
long |
getTokenCount()
Size of the underlying corpus.
|
java.util.Iterator<java.lang.String> |
getVocabularyIterator()
Returns an iterator that iterates over all words in the word space (the
vocabulary).
|
java.lang.String |
getWord(int id)
Returns the id-th word in the vocabulary.
|
float[] |
getWordEmbedding(java.lang.String word)
Get embedding vector for
word . |
int |
getWordId(java.lang.String word) |
DISCO.WordspaceType |
getWordspaceType()
Returns the type of the word space instance.
|
float[] |
getWordVector(int id) |
java.util.Map<java.lang.String,java.lang.Float> |
getWordvector(java.lang.String word)
Returns a word embedding converted to a sparse vector.
|
int |
numberOfFeatureWords()
For
DISCOLuceneIndex this returns the number of words that
were used as features. |
int |
numberOfSimilarWords()
Get the number of similar words that are stored in word spaces of type SIM
for each word.
|
int |
numberOfWords()
Get number of words stored in this word space.
|
void |
printMostSimilar(java.lang.String w,
int max) |
float |
secondOrderSimilarity(java.lang.String w1,
java.lang.String w2,
VectorSimilarity vectorSimilarity)
Computes the similarity between words
w1 and w2
by comparing the set of the most similar words for w1 with
the set of the most similar words for w2 .The size of the set of similar words stored for each word in the word space is given by the DISCOBuilder parameter -nBest . |
float |
semanticSimilarity(java.lang.String w1,
java.lang.String w2,
VectorSimilarity vectorSimilarity)
Computes the similarity between words
w1 and w2
by comparing their word vectors using the vectorSimilarity
measure of choice. |
static void |
serialize(DenseMatrix denseMatrix,
java.lang.String outputPath)
Serialize
DenseMatrix object to file. |
void |
setNumberOfSimilarWords(int n) |
void |
setSimMatrix(int[][] simMatrix) |
void |
setSimValues(float[][] simValues) |
void |
setWordspaceType(DISCO.WordspaceType type) |
ReturnDataBN |
similarWords(java.lang.String word)
Only works with word spaces of type
DISCO.WordspaceType.SIM . |
int |
wordFrequencyList(java.lang.String outputFileName)
Writes word-frequency list to file.
|
getSimilarityMeasure, getVectorSimilarity, load, open
public DenseMatrix(float[][] matrix, float[][] ngramMatrix, int[][] simMatrix, float[][] simValues, it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> word2indexMap, int[] wordIndex2id, int[] wordId2offset, int[] frequencies, byte[] offset2word, it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> ngram2indexMap, int[] ngramIndex2id, int[] ngramId2offset, byte[] offset2ngram, int minN, int maxN, ConfigFile config, DISCO.WordspaceType wordspaceType, int numberOfSimilarWords)
DenseMatrixFactory
only. To create
a DenseMatrix
from a Lucene index word space use
DenseMatrixFactory.create
(or the command line interface). To
load a serialized DenseMatrix
use DenseMatrixFactory.load
.matrix
- ngramMatrix
- simMatrix
- simValues
- word2indexMap
- wordIndex2id
- wordId2offset
- frequencies
- ngram2indexMap
- ngramIndex2id
- offset2word
- ngramId2offset
- offset2ngram
- minN
- minimum ngram size or -1maxN
- maximum ngram size or -1config
- wordspaceType
- numberOfSimilarWords
- public DISCO.WordspaceType getWordspaceType()
getWordspaceType
in class DISCO
public int numberOfWords()
DISCO
DenseMatrix
this returns the value of config.vocabularySize
,
which has to be equal to the number of rows in the similarity matrix.
For DISCOLuceneIndex
this returns the number of documents in
the Lucene index.numberOfWords
in class DISCO
public int numberOfFeatureWords()
DISCO
DISCOLuceneIndex
this returns the number of words that
were used as features. Note that this is only equal to the dimensionality
of the word vectors if no positional or relational features were used.
See
options in disco.config under numberFeatureWords
for more
information.DenseMatrix
this is equal to the dimensionality of the
word embedding (vector length).numberOfFeatureWords
in class DISCO
public int numberOfSimilarWords()
DISCO
numberOfSimilarWords
in class DISCO
public int frequency(java.lang.String word) throws java.io.IOException
DISCO
word
.public ReturnDataBN similarWords(java.lang.String word) throws java.io.IOException, WrongWordspaceTypeException
DISCO.WordspaceType.SIM
.similarWords
in class DISCO
word
- word
if these are stored
in the word space.java.io.IOException
WrongWordspaceTypeException
public float semanticSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity) throws java.io.IOException
DISCO
w1
and w2
by comparing their word vectors using the vectorSimilarity
measure of choice.semanticSimilarity
in class DISCO
java.io.IOException
public float secondOrderSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity) throws java.io.IOException, WrongWordspaceTypeException
DISCO
w1
and w2
by comparing the set of the most similar words for w1
with
the set of the most similar words for w2
.-nBest
. Default size
is 300.WordspaceType.SIM
!secondOrderSimilarity
in class DISCO
java.io.IOException
WrongWordspaceTypeException
- if called with a word space of type
WordspaceType.COL
public java.util.Map<java.lang.String,java.lang.Float> getWordvector(java.lang.String word) throws java.io.IOException
getWordEmbedding
instead.getWordvector
in class DISCO
word
- java.io.IOException
public java.util.Map<java.lang.String,java.lang.Float> getSecondOrderWordvector(java.lang.String word) throws WrongWordspaceTypeException
word
.getSecondOrderWordvector
in class DISCO
word
- null
if word
not found.WrongWordspaceTypeException
public int[] getSecondOrderWordvector(int id)
public ReturnDataCol[] collocations(java.lang.String word) throws java.io.IOException
DISCO
null
.getWordvector()
this method summarizes the
words over the different relations.collocations
in class DISCO
java.io.IOException
public int wordFrequencyList(java.lang.String outputFileName)
DISCO
wordFrequencyList
in class DISCO
public java.lang.String[] getStopwords() throws java.io.FileNotFoundException, java.io.IOException, CorruptConfigFileException
DISCO
disco.config
file in the
word space.getStopwords
in class DISCO
java.io.FileNotFoundException
java.io.IOException
CorruptConfigFileException
public long getTokenCount()
DISCO
getTokenCount
in class DISCO
public int getMinFreq()
DISCO
getMinFreq
in class DISCO
public int getMaxFreq()
DISCO
getMaxFreq
in class DISCO
public java.util.Iterator<java.lang.String> getVocabularyIterator()
DISCO
remove
is not supported.getVocabularyIterator
in class DISCO
public java.lang.String getWord(int id) throws java.io.IOException
DISCO
public int getWordId(java.lang.String word)
word
- DenseMatrix
was created from a
DISCOLuceneIndex
then the returned ID can be larger than the
number of rows in the matrix!
For accessing a matrix row, always use getMatrixRowNumber(String)
instead.public int getMatrixRowNumber(java.lang.String word)
word
- 0..config.vocabularySize
or -1
if word
is not found in the word2indexMap
or its
ID is larger than config.vocabularySize-1
. The ID is equal
to the row index in matrix
(the matrix row is the
word
's word vector).getWordVector(int)
.public float[] getWordVector(int id)
id
- id
) or
null
if id < 0
or id >
config.vocabularySize-1
.public float[] getWordEmbedding(java.lang.String word)
word
. If word is not found (out of
vocabulary) then null
is returned unless this word space stores
subword information (n-grams). In this case a word embedding is computed
on the fly from word
's character n-grams.word
- null
.public float[] getNgramVector(java.lang.String ngram)
ngram
- public int getMinN()
public int getMaxN()
public ConfigFile getConfig()
public void setWordspaceType(DISCO.WordspaceType type)
public void setNumberOfSimilarWords(int n)
public void setSimMatrix(int[][] simMatrix)
public void setSimValues(float[][] simValues)
public java.util.List<ReturnDataCol> getMostSimilar(int wordId, int max)
max
most similar words for word with ID
wordId
. This is done by comparing the matrix row for
wordId
with all other matrix rows.wordId
- max
- public void printMostSimilar(java.lang.String w, int max)
public static void serialize(DenseMatrix denseMatrix, java.lang.String outputPath)
DenseMatrix
object to file.denseMatrix
- outputPath
- public static DenseMatrix deserialize(java.io.File serializedDenseMatrixPath)
DenseMatrix
object from file.serializedDenseMatrixPath
-