DenseMatrix (disco 3.0 API)

java.lang.Object
- de.linguatools.disco.DISCO
- - de.linguatools.disco.DenseMatrix

All Implemented Interfaces:

java.io.Serializable
```
public class DenseMatrix
extends DISCO
implements java.io.Serializable
```
This stores a word space in a dense matrix. Use for low-dimensional word embeddings only.

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from class de.linguatools.disco.DISCO
  DISCO.SimilarityMeasure, DISCO.WordspaceType

Field Summary

Fields
Modifier and Type Field and Description

static java.nio.charset.Charset UTF8
- Fields inherited from class de.linguatools.disco.DISCO
  RELATION_SEPARATOR

Fields
Modifier and Type	Field and Description
`static java.nio.charset.Charset`	`UTF8`

Constructor Summary

Constructors
Constructor and Description
DenseMatrix(float[][] matrix, float[][] ngramMatrix, int[][] simMatrix, float[][] simValues, it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> word2indexMap, int[] wordIndex2id, int[] wordId2offset, int[] frequencies, byte[] offset2word, it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> ngram2indexMap, int[] ngramIndex2id, int[] ngramId2offset, byte[] offset2ngram, int minN, int maxN, ConfigFile config, DISCO.WordspaceType wordspaceType, int numberOfSimilarWords) Constructor to be used by `DenseMatrixFactory` only.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`ReturnDataCol[]`	`collocations(java.lang.String word)` Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first).
`static DenseMatrix`	`deserialize(java.io.File serializedDenseMatrixPath)` Deserialize `DenseMatrix` object from file.
`int`	`frequency(java.lang.String word)` Get corpus frequency of `word`.
`ConfigFile`	`getConfig()`
`int`	`getMatrixRowNumber(java.lang.String word)`
`int`	`getMaxFreq()` Get corpus frequency of the most frequent word in the word space (that was not filtered out by the stop word list that was used).
`int`	`getMaxN()`
`int`	`getMinFreq()` Get minimum frequency of tokens in corpus.
`int`	`getMinN()`
`java.util.List<ReturnDataCol>`	`getMostSimilar(int wordId, int max)` Compute the `max` most similar words for word with ID `wordId`.
`float[]`	`getNgramVector(java.lang.String ngram)`
`int[]`	`getSecondOrderWordvector(int id)`
`java.util.Map<java.lang.String,java.lang.Float>`	`getSecondOrderWordvector(java.lang.String word)` The second order word vector contains words as keys, namely the most similar words for `word`.
`java.lang.String[]`	`getStopwords()` Gets list of stopwords from the `disco.config` file in the word space.
`long`	`getTokenCount()` Size of the underlying corpus.
`java.util.Iterator<java.lang.String>`	`getVocabularyIterator()` Returns an iterator that iterates over all words in the word space (the vocabulary).
`java.lang.String`	`getWord(int id)` Returns the id-th word in the vocabulary.
`float[]`	`getWordEmbedding(java.lang.String word)` Get embedding vector for `word`.
`int`	`getWordId(java.lang.String word)`
`DISCO.WordspaceType`	`getWordspaceType()` Returns the type of the word space instance.
`float[]`	`getWordVector(int id)`
`java.util.Map<java.lang.String,java.lang.Float>`	`getWordvector(java.lang.String word)` Returns a word embedding converted to a sparse vector.
`int`	`numberOfFeatureWords()` For `DISCOLuceneIndex` this returns the number of words that were used as features.
`int`	`numberOfSimilarWords()` Get the number of similar words that are stored in word spaces of type SIM for each word.
`int`	`numberOfWords()` Get number of words stored in this word space.
`void`	`printMostSimilar(java.lang.String w, int max)`
`float`	`secondOrderSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity)` Computes the similarity between words `w1` and `w2` by comparing the set of the most similar words for `w1` with the set of the most similar words for `w2`. The size of the set of similar words stored for each word in the word space is given by the DISCOBuilder parameter `-nBest`.
`float`	`semanticSimilarity(java.lang.String w1, java.lang.String w2, VectorSimilarity vectorSimilarity)` Computes the similarity between words `w1` and `w2` by comparing their word vectors using the `vectorSimilarity` measure of choice.
`static void`	`serialize(DenseMatrix denseMatrix, java.lang.String outputPath)` Serialize `DenseMatrix` object to file.
`void`	`setNumberOfSimilarWords(int n)`
`void`	`setSimMatrix(int[][] simMatrix)`
`void`	`setSimValues(float[][] simValues)`
`void`	`setWordspaceType(DISCO.WordspaceType type)`
`ReturnDataBN`	`similarWords(java.lang.String word)` Only works with word spaces of type `DISCO.WordspaceType.SIM`.
`int`	`wordFrequencyList(java.lang.String outputFileName)` Writes word-frequency list to file.

Methods inherited from class de.linguatools.disco.DISCO
getSimilarityMeasure, getVectorSimilarity, load, open

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - UTF8
```
public static final java.nio.charset.Charset UTF8
```
- Constructor Detail
  - DenseMatrix
```
public DenseMatrix(float[][] matrix,
                   float[][] ngramMatrix,
                   int[][] simMatrix,
                   float[][] simValues,
                   it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> word2indexMap,
                   int[] wordIndex2id,
                   int[] wordId2offset,
                   int[] frequencies,
                   byte[] offset2word,
                   it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction<java.lang.CharSequence> ngram2indexMap,
                   int[] ngramIndex2id,
                   int[] ngramId2offset,
                   byte[] offset2ngram,
                   int minN,
                   int maxN,
                   ConfigFile config,
                   DISCO.WordspaceType wordspaceType,
                   int numberOfSimilarWords)
```
    Constructor to be used by DenseMatrixFactory only. To create a DenseMatrix from a Lucene index word space use DenseMatrixFactory.create (or the command line interface). To load a serialized DenseMatrix use DenseMatrixFactory.load.
    
    Parameters:
    
    matrix -
    
    ngramMatrix -
    
    simMatrix -
    
    simValues -
    
    word2indexMap -
    
    wordIndex2id -
    
    wordId2offset -
    
    frequencies -
    
    ngram2indexMap -
    
    ngramIndex2id -
    
    offset2word -
    
    ngramId2offset -
    
    offset2ngram -
    
    minN - minimum ngram size or -1
    
    maxN - maximum ngram size or -1
    
    config -
    
    wordspaceType -
    
    numberOfSimilarWords -
- Method Detail
  - getWordspaceType
```
public DISCO.WordspaceType getWordspaceType()
```
    Returns the type of the word space instance.
    
    Specified by:
    
    getWordspaceType in class DISCO
    
    Returns:
    
    word space type
  - numberOfWords
```
public int numberOfWords()
```
    Description copied from class: DISCO
    
    Get number of words stored in this word space. In case of DenseMatrix this returns the value of config.vocabularySize, which has to be equal to the number of rows in the similarity matrix. For DISCOLuceneIndex this returns the number of documents in the Lucene index.
    
    Specified by:
    
    numberOfWords in class DISCO
    
    Returns:
    
    vocabulary size.
  - numberOfFeatureWords
```
public int numberOfFeatureWords()
```
    Description copied from class: DISCO
    
    For DISCOLuceneIndex this returns the number of words that were used as features. Note that this is only equal to the dimensionality of the word vectors if no positional or relational features were used. See options in disco.config under numberFeatureWords for more information.
    For DenseMatrix this is equal to the dimensionality of the word embedding (vector length).
    
    Specified by:
    
    numberOfFeatureWords in class DISCO
    
    Returns:
    
    dimensionality of the word space.
  - numberOfSimilarWords
```
public int numberOfSimilarWords()
```
    Description copied from class: DISCO
    
    Get the number of similar words that are stored in word spaces of type SIM for each word.
    
    Specified by:
    
    numberOfSimilarWords in class DISCO
    
    Returns:
    
    number of similar words that are stored in the word space. For word spaces of type COL this value is always 0.
  - frequency
```
public int frequency(java.lang.String word)
              throws java.io.IOException
```
    Description copied from class: DISCO
    
    Get corpus frequency of word.
    
    Specified by:
    
    frequency in class DISCO
    
    Parameters:
    
    word -
    
    Returns:
    
    frequency of word in corpus. 0 if word not found.
    
    Throws:
    
    java.io.IOException
  - similarWords
```
public ReturnDataBN similarWords(java.lang.String word)
                          throws java.io.IOException,
                                 WrongWordspaceTypeException
```
    Only works with word spaces of type DISCO.WordspaceType.SIM.
    
    Specified by:
    
    similarWords in class DISCO
    
    Parameters:
    
    word -
    
    Returns:
    
    list of similar words for word if these are stored in the word space.
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException
  - semanticSimilarity
```
public float semanticSimilarity(java.lang.String w1,
                                java.lang.String w2,
                                VectorSimilarity vectorSimilarity)
                         throws java.io.IOException
```
    Description copied from class: DISCO
    
    Computes the similarity between words w1 and w2 by comparing their word vectors using the vectorSimilarity measure of choice.
    
    Specified by:
    
    semanticSimilarity in class DISCO
    
    Returns:
    
    Throws:
    
    java.io.IOException
  - secondOrderSimilarity
```
public float secondOrderSimilarity(java.lang.String w1,
                                   java.lang.String w2,
                                   VectorSimilarity vectorSimilarity)
                            throws java.io.IOException,
                                   WrongWordspaceTypeException
```
    Description copied from class: DISCO
    
    Computes the similarity between words w1 and w2 by comparing the set of the most similar words for w1 with the set of the most similar words for w2.
    The size of the set of similar words stored for each word in the word space is given by the DISCOBuilder parameter -nBest. Default size is 300.
    Only works with word spaces of type WordspaceType.SIM!
    
    Specified by:
    
    secondOrderSimilarity in class DISCO
    
    Returns:
    
    Throws:
    
    java.io.IOException
    
    WrongWordspaceTypeException - if called with a word space of type WordspaceType.COL
  - getWordvector
```
public java.util.Map<java.lang.String,java.lang.Float> getWordvector(java.lang.String word)
                                                              throws java.io.IOException
```
    Returns a word embedding converted to a sparse vector. Note that word vectors in a dense matrix only contain IDs as keys and not words.
    To get a dense vector (float array) use getWordEmbedding instead.
    
    Specified by:
    
    getWordvector in class DISCO
    
    Parameters:
    
    word -
    
    Returns:
    
    sparse vector.
    
    Throws:
    
    java.io.IOException
  - getSecondOrderWordvector
```
public java.util.Map<java.lang.String,java.lang.Float> getSecondOrderWordvector(java.lang.String word)
                                                                         throws WrongWordspaceTypeException
```
    The second order word vector contains words as keys, namely the most similar words for word.
    
    Specified by:
    
    getSecondOrderWordvector in class DISCO
    
    Parameters:
    
    word -
    
    Returns:
    
    second order word vector or null if word not found.
    
    Throws:
    
    WrongWordspaceTypeException
  - getSecondOrderWordvector
```
public int[] getSecondOrderWordvector(int id)
```
  - collocations
```
public ReturnDataCol[] collocations(java.lang.String word)
                             throws java.io.IOException
```
    Description copied from class: DISCO
    
    Returns the collocations for the input word together with their significance values, ordered by significance value (highest significance first). If the search word is not found in the index, the return value is null.
    Unlike the method getWordvector() this method summarizes the words over the different relations.
    Important note: if used with a DenseMatrix or a DISCOLuceneIndex that has IDs as features, the return values will be IDs and not words.
    
    Specified by:
    
    collocations in class DISCO
    
    Returns:
    
    Throws:
    
    java.io.IOException
  - wordFrequencyList
```
public int wordFrequencyList(java.lang.String outputFileName)
```
    Description copied from class: DISCO
    
    Writes word-frequency list to file. Iterates over all words in the word space and outputs them together with their corpus frequency.
    
    Specified by:
    
    wordFrequencyList in class DISCO
    
    Returns:
  - getStopwords
```
public java.lang.String[] getStopwords()
                                throws java.io.FileNotFoundException,
                                       java.io.IOException,
                                       CorruptConfigFileException
```
    Description copied from class: DISCO
    
    Gets list of stopwords from the disco.config file in the word space.
    
    Specified by:
    
    getStopwords in class DISCO
    
    Returns:
    
    stop words that were used in word space creation.
    
    Throws:
    
    java.io.FileNotFoundException
    
    java.io.IOException
    
    CorruptConfigFileException
  - getTokenCount
```
public long getTokenCount()
```
    Description copied from class: DISCO
    
    Size of the underlying corpus.
    
    Specified by:
    
    getTokenCount in class DISCO
    
    Returns:
  - getMinFreq
```
public int getMinFreq()
```
    Description copied from class: DISCO
    
    Get minimum frequency of tokens in corpus.
    
    Specified by:
    
    getMinFreq in class DISCO
    
    Returns:
  - getMaxFreq
```
public int getMaxFreq()
```
    Description copied from class: DISCO
    
    Get corpus frequency of the most frequent word in the word space (that was not filtered out by the stop word list that was used).
    
    Specified by:
    
    getMaxFreq in class DISCO
    
    Returns:
  - getVocabularyIterator
```
public java.util.Iterator<java.lang.String> getVocabularyIterator()
```
    Description copied from class: DISCO
    
    Returns an iterator that iterates over all words in the word space (the vocabulary). There is no special ordering of the words. The method remove is not supported.
    
    Specified by:
    
    getVocabularyIterator in class DISCO
    
    Returns:
  - getWord
```
public java.lang.String getWord(int id)
                         throws java.io.IOException
```
    Description copied from class: DISCO
    
    Returns the id-th word in the vocabulary.
    
    Specified by:
    
    getWord in class DISCO
    
    Parameters:
    
    id - id has to be between 0 and DISCO.numberOfWords() - 1.
    
    Returns:
    
    word with given id or null if id is not in the range 0..DISCO.numberOfWords() - 1.
    
    Throws:
    
    java.io.IOException
  - getWordId
```
public int getWordId(java.lang.String word)
```
    Parameters:
    
    word -
    
    Returns:
    
    word ID (equal to row index in matrix) or -1 if word not found. WARNING: if this DenseMatrix was created from a DISCOLuceneIndex then the returned ID can be larger than the number of rows in the matrix! For accessing a matrix row, always use getMatrixRowNumber(String) instead.
  - getMatrixRowNumber
```
public int getMatrixRowNumber(java.lang.String word)
```
    Parameters:
    
    word -
    
    Returns:
    
    word ID in the range 0..config.vocabularySize or -1 if word is not found in the word2indexMap or its ID is larger than config.vocabularySize-1. The ID is equal to the row index in matrix (the matrix row is the word's word vector).
    The ID that you get with this method can be safely used to retrieve word vectors with getWordVector(int).
  - getWordVector
```
public float[] getWordVector(int id)
```
    Parameters:
    
    id -
    
    Returns:
    
    dense vector (the matrix row no. id) or null if id < 0 or id > config.vocabularySize-1.
  - getWordEmbedding
```
public float[] getWordEmbedding(java.lang.String word)
```
    Get embedding vector for word. If word is not found (out of vocabulary) then null is returned unless this word space stores subword information (n-grams). In this case a word embedding is computed on the fly from word's character n-grams.
    
    Parameters:
    
    word -
    
    Returns:
    
    embedding vector or null.
  - getNgramVector
```
public float[] getNgramVector(java.lang.String ngram)
```
    Parameters:
    
    ngram -
    
    Returns:
    
    ngram vector or null if either ngram not found or this word space does not contain any ngrams.
  - getMinN
```
public int getMinN()
```
    Returns:
    
    minimum ngram size in characters or -1 if no ngrams are stored.
  - getMaxN
```
public int getMaxN()
```
    Returns:
    
    maximum ngram size in characters or -1 if no ngrams are stored.
  - getConfig
```
public ConfigFile getConfig()
```
  - setWordspaceType
```
public void setWordspaceType(DISCO.WordspaceType type)
```
  - setNumberOfSimilarWords
```
public void setNumberOfSimilarWords(int n)
```
  - setSimMatrix
```
public void setSimMatrix(int[][] simMatrix)
```
  - setSimValues
```
public void setSimValues(float[][] simValues)
```
  - getMostSimilar
```
public java.util.List<ReturnDataCol> getMostSimilar(int wordId,
                                                    int max)
```
    Compute the max most similar words for word with ID wordId. This is done by comparing the matrix row for wordId with all other matrix rows.
    
    Parameters:
    
    wordId -
    
    max -
    
    Returns:
  - printMostSimilar
```
public void printMostSimilar(java.lang.String w,
                             int max)
```
  - serialize
```
public static void serialize(DenseMatrix denseMatrix,
                             java.lang.String outputPath)
```
    Serialize DenseMatrix object to file.
    
    Parameters:
    
    denseMatrix -
    
    outputPath -
  - deserialize
```
public static DenseMatrix deserialize(java.io.File serializedDenseMatrixPath)
```
    Deserialize DenseMatrix object from file.
    
    Parameters:
    
    serializedDenseMatrixPath -
    
    Returns:

Class DenseMatrix

Nested Class Summary

Nested classes/interfaces inherited from class de.linguatools.disco.DISCO

Field Summary

Fields inherited from class de.linguatools.disco.DISCO

Constructor Summary

Method Summary

Methods inherited from class de.linguatools.disco.DISCO

Methods inherited from class java.lang.Object

Field Detail

UTF8

Constructor Detail

DenseMatrix

Method Detail

getWordspaceType

numberOfWords

numberOfFeatureWords

numberOfSimilarWords

frequency

similarWords

semanticSimilarity

secondOrderSimilarity

getWordvector

getSecondOrderWordvector

getSecondOrderWordvector

collocations

wordFrequencyList

getStopwords

getTokenCount

getMinFreq

getMaxFreq

getVocabularyIterator

getWord

getWordId

getMatrixRowNumber

getWordVector

getWordEmbedding

getNgramVector

getMinN

getMaxN

getConfig

setWordspaceType

setNumberOfSimilarWords

setSimMatrix

setSimValues

getMostSimilar

printMostSimilar

serialize

deserialize