public class Compositionality
extends java.lang.Object
Modifier and Type | Class and Description |
---|---|
static class |
Compositionality.VectorCompositionMethod
Implemented methods of vector composition.
|
Constructor and Description |
---|
Compositionality() |
Modifier and Type | Method and Description |
---|---|
static float[] |
composeVectorsByCombinedMultAdd(float[] wv1,
float[] wv2,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c)
Compose vectors wv1 and wv2 by a combination of addition and
multiplication:
p = a*wv1 + b*wv2 + c*wv1*wv2
The contribution of multiplication and addition, as well
as the contribution of each of the two vectors can be controlled by the
three parameters a, b and c.
For instance, in Mitchell and Lapata 2008 where wv1 is a verb and wv2 is a noun, the parameters a, b and c are set as follows: a = 0.95 b = 0 c = 0.05. If one of a, b, c is null, then these default values are used. |
static java.util.Map<java.lang.String,java.lang.Float> |
composeVectorsByCombinedMultAdd(java.util.Map<java.lang.String,java.lang.Float> wv1,
java.util.Map<java.lang.String,java.lang.Float> wv2,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c)
Compose vectors wv1 and wv2 by a combination of addition and
multiplication:
p = a*wv1 + b*wv2 + c*wv1*wv2
The contribution of multiplication and addition, as well
as the contribution of each of the two vectors can be controlled by the
three parameters a, b and c.
For instance, in Mitchell and Lapata 2008 where wv1 is a verb and wv2 is a noun, the parameters a, b and c are set as follows: a = 0.95 b = 0 c = 0.05. If one of a, b, c is null, then these default values are used. |
static float[] |
composeVectorsByDilation(float[] wv1,
float[] wv2,
java.lang.Float lambda)
The following formula is used:
(wv1*wv1)wv2 + (lambda-1)(wv1*wv2)wv1
The default value (if lambda is null) for lambda is 2.0.
This composition method only works with the SimilarityMeasures.COSINE similarity measure. |
static java.util.Map<java.lang.String,java.lang.Float> |
composeVectorsByDilation(java.util.Map<java.lang.String,java.lang.Float> wv1,
java.util.Map<java.lang.String,java.lang.Float> wv2,
java.lang.Float lambda)
The following formula is used:
(wv1*wv1)wv2 + (lambda-1)(wv1*wv2)wv1
The default value (if lambda is null) for lambda is 2.0.
This composition method only works with the SimilarityMeasures.COSINE similarity measure. |
static java.util.Map<java.lang.String,java.lang.Float> |
composeWordVectors(java.util.ArrayList<java.util.Map<java.lang.String,java.lang.Float>> wordvectorList,
Compositionality.VectorCompositionMethod compositionMethod,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c,
java.lang.Float lambda)
Compose two or more word vectors by the composition method given in
compositionMethod . |
static float[] |
composeWordVectors(float[] wv1,
float[] wv2,
Compositionality.VectorCompositionMethod compositionMethod,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c,
java.lang.Float lambda)
Compose two word vectors by the composition method given in
compositionMethod . |
static float[] |
composeWordVectors(java.util.List<float[]> wordvectorList,
Compositionality.VectorCompositionMethod compositionMethod,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c,
java.lang.Float lambda)
Compose two or more word vectors by the composition method given in
compositionMethod . |
static java.util.Map<java.lang.String,java.lang.Float> |
composeWordVectors(java.util.Map<java.lang.String,java.lang.Float> wv1,
java.util.Map<java.lang.String,java.lang.Float> wv2,
Compositionality.VectorCompositionMethod compositionMethod,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c,
java.lang.Float lambda)
Compose two word vectors by the composition method given in
compositionMethod . |
static float |
compositionalSemanticSimilarity(java.lang.String multiWords1,
java.lang.String multiWords2,
Compositionality.VectorCompositionMethod compositionMethod,
DISCO.SimilarityMeasure simMeasure,
DISCO disco,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c,
java.lang.Float lambda)
This method computes the semantic similarity between two multi-word terms,
phrases, sentences or paragraphs.
|
static float[] |
computeAvgDenseOffsetVector(java.util.List<java.lang.String[]> wordPairs,
DISCO disco)
Computes the average vector over all offset vectors in the
wordPairs
list. |
static java.util.Map<java.lang.String,java.lang.Float> |
computeWordVector(java.lang.String[] multi,
Compositionality.VectorCompositionMethod compositionMethod,
DISCO disco,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c,
java.lang.Float lambda)
Construct a word vector that represents the
multi -word
phrase. |
static float[] |
computeWordVector(java.lang.String[] multi,
DenseMatrix disco,
Compositionality.VectorCompositionMethod compositionMethod,
java.lang.Float a,
java.lang.Float b,
java.lang.Float c,
java.lang.Float lambda)
Construct a word embedding that represents the
multi word
phrase. |
static java.util.List<java.lang.Integer> |
findShortestPath(int i1,
int i2,
DenseMatrix denseMatrix)
Exhaustive breath-first search to find the shortest path between two input
words in the neighborhood graph.
|
static java.util.List<java.lang.String> |
findShortestPath(java.lang.String w1,
java.lang.String w2,
DenseMatrix denseMatrix)
Wrapper method.
|
static void |
printWordVector(java.util.Map<java.lang.String,java.lang.Float> wordvector)
Utility function.
|
static java.util.List<ReturnDataCol> |
similarWords(float[] wordEmbedding,
DenseMatrix disco,
DISCO.SimilarityMeasure simMeasure,
int maxN)
Find the most similar words in the DISCO word space for an input word
vector.
|
static java.util.List<ReturnDataCol> |
similarWords(java.util.Map<java.lang.String,java.lang.Float> wordvector,
DISCO disco,
DISCO.SimilarityMeasure simMeasure,
int maxN)
Find the most similar words in the DISCO word space for an input word
vector.
|
static java.util.List<ReturnDataCol> |
similarWordsGraphSearch(float[] wordEmbedding,
DenseMatrix disco,
DISCO.SimilarityMeasure simMeasure,
int nMax)
Approximate nearest neighbor search to find the most similar word in the
vocabulary for an input
wordvector . |
static java.util.List<ReturnDataCol> |
similarWordsGraphSearch(java.util.Map<java.lang.String,java.lang.Float> wordvector,
DISCO disco,
DISCO.SimilarityMeasure simMeasure,
int nMax)
Approximate nearest neighbor search to find the most similar word in the
vocabulary for an input
wordvector . |
static java.util.List<ReturnDataCol> |
solveAnalogy(java.lang.String b1,
java.lang.String a2,
java.lang.String b2,
DISCO disco)
This method solves the analogy "a1 is to b1 like a2 is to b2", i.e.
|
static java.util.List<ReturnDataCol> |
solveAnalogyApprox(java.lang.String b1,
java.lang.String a2,
java.lang.String b2,
DISCO disco)
Fast approximation of
solveAnalogy . |
static java.util.List<ReturnDataCol> |
solveAnalogyAverageOffset(java.lang.String b1,
java.util.List<java.lang.String[]> wordPairs,
DISCO disco)
Solves the analogy
a1 : b1 = a2 : b2 by returning the missing
word a1 . |
static float[] |
vectorRejection(float[] a,
float[] b)
Computes vector rejection of a on b.
|
static java.util.Map<java.lang.String,java.lang.Float> |
vectorRejection(java.util.Map<java.lang.String,java.lang.Float> a,
java.util.Map<java.lang.String,java.lang.Float> b)
Computes vector rejection of a on b.
|
public static java.util.Map<java.lang.String,java.lang.Float> composeVectorsByDilation(java.util.Map<java.lang.String,java.lang.Float> wv1, java.util.Map<java.lang.String,java.lang.Float> wv2, java.lang.Float lambda)
(wv1*wv1)wv2 + (lambda-1)(wv1*wv2)wv1The default value (if lambda is null) for lambda is 2.0.
wv1
- wv2
- lambda
- public static float[] composeVectorsByDilation(float[] wv1, float[] wv2, java.lang.Float lambda)
(wv1*wv1)wv2 + (lambda-1)(wv1*wv2)wv1The default value (if lambda is null) for lambda is 2.0.
wv1
- wv2
- lambda
- public static java.util.Map<java.lang.String,java.lang.Float> composeVectorsByCombinedMultAdd(java.util.Map<java.lang.String,java.lang.Float> wv1, java.util.Map<java.lang.String,java.lang.Float> wv2, java.lang.Float a, java.lang.Float b, java.lang.Float c)
p = a*wv1 + b*wv2 + c*wv1*wv2The contribution of multiplication and addition, as well as the contribution of each of the two vectors can be controlled by the three parameters a, b and c.
a = 0.95If one of a, b, c is null, then these default values are used.
b = 0
c = 0.05.
wv1
- first word vectorwv2
- second word vectora
- weight of additive contribution of first word vectorb
- weight of additive contribution of second word vectorc
- weight of multiplicative contribution of both word vectorspublic static float[] composeVectorsByCombinedMultAdd(float[] wv1, float[] wv2, java.lang.Float a, java.lang.Float b, java.lang.Float c)
p = a*wv1 + b*wv2 + c*wv1*wv2The contribution of multiplication and addition, as well as the contribution of each of the two vectors can be controlled by the three parameters a, b and c.
a = 0.95If one of a, b, c is null, then these default values are used.
b = 0
c = 0.05.
wv1
- first word vectorwv2
- second word vectora
- weight of additive contribution of first word vectorb
- weight of additive contribution of second word vectorc
- weight of multiplicative contribution of both word vectorspublic static java.util.Map<java.lang.String,java.lang.Float> vectorRejection(java.util.Map<java.lang.String,java.lang.Float> a, java.util.Map<java.lang.String,java.lang.Float> b)
bank_without_finance = vectorRejection(bank, averageVector(deposit,
account, cashier))
a
- b
- public static float[] vectorRejection(float[] a, float[] b)
bank_without_finance = vectorRejection(bank, averageVector(deposit,
account, cashier))
a
- b
- public static java.util.Map<java.lang.String,java.lang.Float> composeWordVectors(java.util.Map<java.lang.String,java.lang.Float> wv1, java.util.Map<java.lang.String,java.lang.Float> wv2, Compositionality.VectorCompositionMethod compositionMethod, java.lang.Float a, java.lang.Float b, java.lang.Float c, java.lang.Float lambda)
compositionMethod
.wv1
- word vector #1wv2
- word vector #2compositionMethod
- One of the methods in VectorCompositionMethod
.a
- only needed for composition method COMBINED.b
- only needed for composition method COMBINED.c
- only needed for composition method COMBINED.lambda
- only needed for composition method DILATION.null
.public static float[] composeWordVectors(float[] wv1, float[] wv2, Compositionality.VectorCompositionMethod compositionMethod, java.lang.Float a, java.lang.Float b, java.lang.Float c, java.lang.Float lambda)
compositionMethod
.wv1
- word vector #1wv2
- word vector #2compositionMethod
- One of the methods in VectorCompositionMethod
.a
- only needed for composition method COMBINED.b
- only needed for composition method COMBINED.c
- only needed for composition method COMBINED.lambda
- only needed for composition method DILATION.null
.public static java.util.Map<java.lang.String,java.lang.Float> composeWordVectors(java.util.ArrayList<java.util.Map<java.lang.String,java.lang.Float>> wordvectorList, Compositionality.VectorCompositionMethod compositionMethod, java.lang.Float a, java.lang.Float b, java.lang.Float c, java.lang.Float lambda)
compositionMethod
.wordvectorList
- a list of word vectors to be combined. The list has
to have at least two elements. The ordering of the list has no influence
on the result.compositionMethod
- One of the methods in VectorCompositionMethod
.a
- only needed for composition method COMBINED.b
- only needed for composition method COMBINED.c
- only needed for composition method COMBINED.lambda
- only needed for composition method DILATION.null
.public static float[] composeWordVectors(java.util.List<float[]> wordvectorList, Compositionality.VectorCompositionMethod compositionMethod, java.lang.Float a, java.lang.Float b, java.lang.Float c, java.lang.Float lambda)
compositionMethod
.wordvectorList
- a list of word vectors to be combined. The list has
to have at least two elements. The ordering of the list has no influence
on the result.compositionMethod
- One of the methods in VectorCompositionMethod
.a
- only needed for composition method COMBINED.b
- only needed for composition method COMBINED.c
- only needed for composition method COMBINED.lambda
- only needed for composition method DILATION.null
.public static void printWordVector(java.util.Map<java.lang.String,java.lang.Float> wordvector)
wordvector
- public static float compositionalSemanticSimilarity(java.lang.String multiWords1, java.lang.String multiWords2, Compositionality.VectorCompositionMethod compositionMethod, DISCO.SimilarityMeasure simMeasure, DISCO disco, java.lang.Float a, java.lang.Float b, java.lang.Float c, java.lang.Float lambda) throws java.io.IOException
composeWordVectors()
.
The two resulting vectors are then compared using
Compositionality.semanticSimilarity()
.TextSimilarity
might give
more accurate results for short text similarity because they weight the
words in the input strings by their frequency and try to align words in
the input strings.multiWords1
- a tokenized string containing a multi-word term, phrase,
sentence or paragraph.multiWords2
- a tokenized string containing a multi-word term, phrase,
sentence or paragraph.compositionMethod
- a vector composition method.simMeasure
- a similarity measure.disco
- a DISCOLuceneIndex word space.a
- only needed for composition method COMBINED.b
- only needed for composition method COMBINED.c
- only needed for composition method COMBINED.lambda
- only needed for composition method DILATION.multiWord1
and
multiWord2
.java.io.IOException
TextSimilarity
public static java.util.Map<java.lang.String,java.lang.Float> computeWordVector(java.lang.String[] multi, Compositionality.VectorCompositionMethod compositionMethod, DISCO disco, java.lang.Float a, java.lang.Float b, java.lang.Float c, java.lang.Float lambda) throws java.io.IOException
multi
-word
phrase.multi
- a multi-token term, phrase or sentence (one token per array element).compositionMethod
- disco
- a
- b
- c
- lambda
- java.io.IOException
public static float[] computeWordVector(java.lang.String[] multi, DenseMatrix disco, Compositionality.VectorCompositionMethod compositionMethod, java.lang.Float a, java.lang.Float b, java.lang.Float c, java.lang.Float lambda) throws java.io.IOException
multi
word
phrase.multi
- a multi-token term, phrase or sentence (one token per array element).disco
- compositionMethod
- a
- b
- c
- lambda
- java.io.IOException
public static java.util.List<ReturnDataCol> similarWords(java.util.Map<java.lang.String,java.lang.Float> wordvector, DISCO disco, DISCO.SimilarityMeasure simMeasure, int maxN) throws java.io.IOException
Compositionality.composeWordVectors()
) the most
similar words will only be single-token words from the index.wordvector
- input word vectordisco
- DISCO word spacesimMeasure
- maxN
- return only the maxN
most similar words. If
maxN < 1
all words are returned.wordvector
is greater than zero, ordered by
similarity value (highest value first).java.io.IOException
public static java.util.List<ReturnDataCol> similarWords(float[] wordEmbedding, DenseMatrix disco, DISCO.SimilarityMeasure simMeasure, int maxN) throws java.io.IOException
Compositionality.composeWordVectors()
) the most
similar words will only be single-token words from the index.wordEmbedding
- input word vectordisco
- DISCO word spacesimMeasure
- maxN
- return only the maxN
most similar words. If
maxN < 1
all words are returned.wordvector
is greater than zero, ordered by
similarity value (highest value first).java.io.IOException
public static java.util.List<ReturnDataCol> similarWordsGraphSearch(java.util.Map<java.lang.String,java.lang.Float> wordvector, DISCO disco, DISCO.SimilarityMeasure simMeasure, int nMax) throws java.io.IOException, WrongWordspaceTypeException
wordvector
. This is about 20 times
faster than brute-force search with Compositionality.similarWords
.
The true nearest neighbor is found in 80% of cases (depending on the number
of similar words stored for each word).Kohei Sugawara, Hayato Kobayashi, Masajiro Iwasaki. On Approximately Searching for Similar Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2265–2275, Berlin, Germany, August 7-12, 2016.The pre-computed most similar words that are stored for each word (in DISCO word spaces of type SIM) constitute a neighborhood graph. The basic idea is to use this graph as a search index. Instead of comparing the input word vector with the word vectors of all words in the vocabulary (brute-force search) we perform a best-first search. First, we pick a random word w and compute the similarity of all neighbors of w with the input word vector. The closest neighbor is then set as new word w, and the process is repeated until no new w can be found that is closer to the input vector.
Mark Steyvers, Joshua B. Tenenbaum. The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth. Cognitive Science 29 (2005) 41–78.
wordvector
- disco
- simMeasure
- nMax
- return at most nMax
words.java.io.IOException
WrongWordspaceTypeException
public static java.util.List<ReturnDataCol> similarWordsGraphSearch(float[] wordEmbedding, DenseMatrix disco, DISCO.SimilarityMeasure simMeasure, int nMax) throws java.io.IOException, WrongWordspaceTypeException
wordvector
. This is about 20 times
faster than brute-force search with Compositionality.similarWords
.
The true nearest neighbor is found in 80% of cases (depending on the number
of similar words stored for each word).Kohei Sugawara, Hayato Kobayashi, Masajiro Iwasaki. On Approximately Searching for Similar Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2265–2275, Berlin, Germany, August 7-12, 2016.The pre-computed most similar words that are stored for each word (in DISCO word spaces of type SIM) constitute a neighborhood graph. The basic idea is to use this graph as a search index. Instead of comparing the input word vector with the word vectors of all words in the vocabulary (brute-force search) we perform a best-first search. First, we pick a random word w and compute the similarity of all neighbors of w with the input word vector. The closest neighbor is then set as new word w, and the process is repeated until no new w can be found that is closer to the input vector.
Mark Steyvers, Joshua B. Tenenbaum. The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth. Cognitive Science 29 (2005) 41–78.
wordEmbedding
- disco
- simMeasure
- nMax
- return max. nMax wordswordEmbedding
sorted by similarity to wordEmbedding
(most similar first).java.io.IOException
WrongWordspaceTypeException
public static java.util.List<java.lang.Integer> findShortestPath(int i1, int i2, DenseMatrix denseMatrix) throws WrongWordspaceTypeException
numberOfSimilarWords >= 50
), showing that the
neighborhood graph of word spaces is fully connected. For more information
on the neighborhood graph see similarWordsGraphSearch
.i1
- ID of input word #1i2
- ID of input word #2denseMatrix
- i1
and i2
. The
resulting list contains the path in reverse order, i.e. the first list
element is i2
, the last element is i1
.WrongWordspaceTypeException
public static java.util.List<java.lang.String> findShortestPath(java.lang.String w1, java.lang.String w2, DenseMatrix denseMatrix) throws WrongWordspaceTypeException, java.io.IOException
w1
- w2
- denseMatrix
- WrongWordspaceTypeException
java.io.IOException
public static java.util.List<ReturnDataCol> solveAnalogy(java.lang.String b1, java.lang.String a2, java.lang.String b2, DISCO disco) throws java.io.IOException, WrongWordspaceTypeException
solveAnalogyApprox
instead.b1
- must be single token, e.g. "woman"a2
- must be single token, e.g. "king"b2
- must be single token, e.g. "man"disco
- null
if one of words b1, a2, or b2 was not found in the DISCO index. You may
want to filter out b1, a2, and b2 from the resulting list.java.io.IOException
WrongWordspaceTypeException
public static java.util.List<ReturnDataCol> solveAnalogyApprox(java.lang.String b1, java.lang.String a2, java.lang.String b2, DISCO disco) throws java.io.IOException, WrongWordspaceTypeException
solveAnalogy
. This uses similarWordsGraphSearch
instead of similarWords
to find the nearest words for the
composed word vector v(a1)
.b1
- a2
- b2
- disco
- null
if one of words b1, a2, or b2 was not found in the DISCO index. You may
want to filter out b1, a2, and b2 from the resulting list.java.io.IOException
WrongWordspaceTypeException
public static float[] computeAvgDenseOffsetVector(java.util.List<java.lang.String[]> wordPairs, DISCO disco)
wordPairs
list. An offset vector is computed for each entry in wordPairs
as v(a2) - v(b2) = v(wordPairs.get(i)[0]) - v(wordPairs.get(i)[1])
with v(x)
being the word vector for the word x
.wordPairs
- each String array in the list must store word pairs
[a2, b2]
(see solveAnalogy
).disco
- null
if no word pair was found in
disco
.public static java.util.List<ReturnDataCol> solveAnalogyAverageOffset(java.lang.String b1, java.util.List<java.lang.String[]> wordPairs, DISCO disco) throws java.io.IOException
a1 : b1 = a2 : b2
by returning the missing
word a1
. In contrast to the method solveAnalogy
where you have to supply only a single pair a2, b2
, this method
computes the average offset vector over all pairs in wordPairs
and uses this as offset vector to get more robust results.b1
- wordPairs
- list of pairs [a2, b2]
disco
- disco
that is most similar to the vector
computed by v(b1) + computeAvgDenseOffsetVector(wordPairs)
or null
if b1
or none of wordPairs
was found in disco
.java.io.IOException