public class Cluster
extends java.lang.Object
Constructor and Description |
---|
Cluster() |
Modifier and Type | Method and Description |
---|---|
void |
clutoClusterSimilarityGraph(DISCO disco,
int n,
float minSim,
java.lang.String outputDir)
Creates a sparse graph file that can be clustered with CLUTO's
scluster program.Important note: This method only works with word spaces of type DISCO.WordspaceType.SIM ! |
void |
clutoClusterVectors(DISCO disco,
java.util.ArrayList<java.lang.String> wordList,
java.lang.String outputDir)
Creates sparse matrix file for use with CLUTO's
vcluster program. |
static ReturnDataBN |
filterOutliers(DISCO disco,
java.lang.String word,
int n)
This method takes the list of the n most similar words of the
input word and filters out all words that do not appear in the
similarity list of at least one of the other similar
words of the input word.
The resulting list of similar words will have size <= n. Important note: This method only works with word spaces of type DISCO.WordspaceType.SIM . |
static java.lang.String[] |
growSet(DISCO disco,
java.lang.String[] inputSet)
Retrieves the similar words for all the words in the input set
and extends the input set by all words that appear in the
similarity lists of all the input words.
|
public static ReturnDataBN filterOutliers(DISCO disco, java.lang.String word, int n) throws java.io.IOException, WrongWordspaceTypeException
DISCO.WordspaceType.SIM
.disco
- DISCO word space of type DISCO.WordspaceType.SIM
.word
- input word (must be a single token).n
- look in the list of the n most similar words of the input wordjava.io.IOException
WrongWordspaceTypeException
- if the disco
word space
is not of type DISCO.WordspaceType.SIM
.public static java.lang.String[] growSet(DISCO disco, java.lang.String[] inputSet) throws java.io.IOException, WrongWordspaceTypeException
DISCO.WordspaceType.SIM
!disco
- DISCO word space of type DISCO.WordspaceType.SIM
.inputSet
- set of input words (must be single tokens).java.io.IOException
WrongWordspaceTypeException
- if the disco
word space
is not of type DISCO.WordspaceType.SIM
.public void clutoClusterSimilarityGraph(DISCO disco, int n, float minSim, java.lang.String outputDir) throws org.apache.lucene.index.CorruptIndexException, java.io.IOException, WrongWordspaceTypeException
scluster
program.DISCO.WordspaceType.SIM
!disco
- DISCO word space loaded into RAM. The word space has to be
of type DISCO.WordspaceType.SIM
.n
- cluster the first n words in the word space index.minSim
- create an edge between words that have a similarity value
of at least minSim
.outputDir
- output directory. Two files are created in the output
directory outputDir
: sparseGraph.dat
and
rowLabels.dat
. Existing files with these names are
overwritten.org.apache.lucene.index.CorruptIndexException
java.io.IOException
WrongWordspaceTypeException
- if the disco
word space
is not of type DISCO.WordspaceType.SIM
.public void clutoClusterVectors(DISCO disco, java.util.ArrayList<java.lang.String> wordList, java.lang.String outputDir) throws java.io.IOException
vcluster
program. For every word in the word list its word
vector is retrieved from the DISCO index and written to the sparse matrix
file. A row label file is also created that maps the row numbers to the
words.disco
- DISCO word space loaded into RAM. The word space may be of
any type.wordList
- list of words to be clustered.outputDir
- output directory. Two files are created in the output
directory outputDir
: sparseMatrix.dat
and
rowLabels.dat
. Existing files with these names are
overwritten.java.io.IOException