DISCO Builder

Basics     Options in disco.config
Create a standard DISCO word spaceImport word spaces from other tools
Create a word space with documents as featuresEvaluation of your word spaces
Create a word space from parsed textShare your word spaces

Create your own DISCO word spaces with DISCO Builder

Download and installation

Download DISCOBuilder-1.0.tar.gz and unpack it. You need a Java 7 Runtime Engine.

DISCO Builder is licensed under the Creative Commons Attribution-NonCommercial license. For commercial use, please contact peter.kolb@linguatools.org.

Creative Commons Attribution-NonCommercial

Getting started

Follow the three steps below to create a DISCO word space from the small test corpus supplied with the DISCO Builder distribution. If this works go on and read the basics section. After that, you are prepared to create word spaces from your corpus as described in one of the sections Create a standard DISCO word space with words as features, Create a word space with documents as features, Create a word space from parsed text. All options are explained thoroughly in the options section. If you want to import a word space from word2vec or GloVe go to the import section.

1. Create an output directory.

2. Edit the file disco.config in the DISCO Builder directory. DISCO Builder is controlled via this configuration file. You only have to the parameter outputDirectory to point to the directory that you created in step 1. Also, edit the parameter inputDirectory to point to the directory test-corpus-lemma in the DISCO Builder directory. Leave all other parameters unchanged.

3. Start DISCO Builder:

java -jar DISCOBuilder-1.0.jar disco.config

If the word space was created without error you can examine the contents of the output directory. There will be a lot of files and one directory called DISCO-idx. This directory contains your new word space. Never change the names of any of the files in this directory! Otherwise, the DISCO API will not work with this word space any more. Also, do not edit the configuration file disco.config in the DISCO-idx directory. Some API methods read data from this file, and if they can't find it, you will get a CorruptConfigFileException.
You can, however, change the name of the word space directory itself. Give it a more informative name than DISCO-idx. The word space directory is self-contained, that means you can copy the word space directory to any location – the other files in the output directory are not needed for querying the word space with the DISCO API.

Basics: Corpus preprocessing and feature types

In distributional semantics the meaning of a word is learned from the contexts where the word occurs. Basically, context can be defined in two ways:

  • the other words around the target word. The plain words can be augmented with their relation to the target word (like window position or syntactic dependency) to build the features. Word spaces of this type are called word-by-word.
  • the document where the target word occurs in. Document can also be a sentence or paragraph. Here, the document IDs act as features. Word spaces of this type are called word-by-document.

In order to use features like paragraphs or syntactic relations, they have to be annotated in the input corpus. DISCO can process the following input formats:

  • Tokenized text: In this case, the tokens (exactly as they appear in the text) act as words and as features at the same time. The tokens won't be changed by DISCO Builder in any way, i.e. there is no lowercasing or anything. The tokenized text may contain boundary tags like <p> but these tags have to stand on a line of their own. For an example how a tokenized text file looks like see the files in the test-corpus-tokenized directory of the DISCO Builder distribution.
  • Lemmatized text: This input format has three tab-separated columns per line: the token, a part-of-speech tag, and the base form (lemma) of the token. In other words, the raw text is augmented with two annotation layers, part-of-speech and lemma. Which layer will act as words or as features can be set in the configuration file with the options lemma and lemmaFeatures. However, the third column doesn't have to be the base form of the token, it may as well be the stem, the lowercased variant, or whatever you may want to trie. In conclusion, the "lemmatized text" input file format makes it possible to use features that are different from the tokens itself (and still create word vectors for the tokens). The lemmatized text may contain boundary tags like <p> on a line of their own. For an example of this file format see the files in the test-corpus-lemmatized directory of the DISCO Builder distribution.
  • Parsed text: This has to be the output of a dependency parser. DISCO Builder can read dependency relations in the CoNNL-U format. The features will be words with their respective dependency relation (see below).

Representation of the feature types in the final word space index

Depending on the chosen definition of context, the features will have one of the following three forms:

  • word: the features are words. If the input type is lemmatized text, the words can either be the plain tokens from the corpus, or the lemmata (or whatever your input file has in the third column).
  • word<SEP>relation: the features are words plus their specific relation to the target word. For tokenized text and lemmatized text, relation can be a window position. For parsed text, it is the dependency relation.
  • ID: a number identifying a latent dimension or a document context. This is the case for all word spaces that were imported from other tools (see below) or that were created with documents as features (word-by-document word spaces).

If you build a word space index with the idea of retrieving collocations (like hair → tousled combed permed cropped dyed frizzed greying) make sure to chose settings where the features are words (possibly augmented with a relation) because the collocations will be the features of the word space. If you want to retrieve collocations from a word space where the features are IDs, you will only get a list of numbers.

The two DISCO word space types

The option dontCompute2ndOrder in the disco.config configuration file determines which type of word space will be generated.

The relation between a word and its features is called a first-order relation; word and feature co-occur in contexts. For instance, hair and comb co-occur significantly often, therefore comb makes a good feature to describe hair. Other good features for describing hair are grey, black, blonde, curly, dye and so on. All the significant features of a word combined constitute the word vector of the word. If we compare words based on the sets of their first-order relations (in other words, based on their word vectors), we get words that are used in similar contexts (the distributionally similar words). These are called the second-order relations, like hair - fur, beard. Words that are related via a second-order relation may have never occured together in a context.

If you set dontCompute2ndOrder=true then these second-order relations will not be computed by DISCO Builder, creating a word space of type COL where only first-order relations are stored (i.e. the word vectors). If you set dontCompute2ndOrder=false then the second order relations will be computed and stored in the word space for fast retrieval, generating a word space of type SIM. A word space of type SIM stores word vectors (first-order relations) and additionally the most similar words for each word (second-order relations).

The advantage of COL word spaces is that they are smaller and faster to build. However, there are some methods of the DISCO API that only work with word spaces of type SIM. These are the following:

Create a standard DISCO word space with words as features

You need a tokenized or lemmatized corpus (if you have a parsed corpus, see section Create a word space from parsed text). Put all your corpus files into one directory and set the option inputDir to point to this directory. Depending on the format of your corpus, set the option inputFileFormat to TOKENIZED (for tokenized text) or LEMMATIZED (for lemmatized text in three tab-separated columns per line). Create an output directory and set the option outputDir to point to this directory.

You should always use a stopword list. There are stopword lists for 15 languages in the directory stopword-lists of the DISCO Builder distribution. Set the option stopwordList to point to your stopword list file.

Define a context window. There are two ways to define a window context:

  • rightContext, leftContext (optionally position). The DISCO standard context window is rightContext=3, leftContext=3, position=true.
  • openingTag, closingTag. This is experimental.

If you have lemmatized text (inputFileFormat=LEMMATIZED), then set lemmaFeatures=true to improve results. Note that this will still produce word vectors for all wordforms, unless you also set lemma=true (see next paragraph).
If you have tokenized text (inputFileFormat=TOKENIZED), you have to set lemmaFeatures=false.
If you comment the option out or leave it blank, the default value false will be used.

By setting the option lemma you can build word vectors for wordforms or for lemmata. If you set lemma=true, vectors for lemmata will be created. The results are generally improved if you do this, but remember that the resulting word space will only contain lemmata (you will get no result when querying inflected forms like houses).
The default is false.

Set the minimum word frequency minFreq. The optimal value depends on the size of your corpus (the larger the corpus, the larger the value for minFreq). For corpus sizes between 100 and 1000 million tokens a value between 20 and 200 is fine.
The default value is 100.

Set numberOfFeatureWords to some value between 10000 and 50000. The default value is 30000.

Option dontCompute2ndOrder determines the word space type. Default is false, i.e. build a SIM space.

After you have edited and saved the configuration file disco.config start DISCO Builder by typing

java -Xmx<N> -jar DISCOBuilder.jar -threads <T> disco.config

with <N> being enough memory to hold the word space. <T> is the number of threads to start when computing the most similar words for each word (this is only relevant if you are building a word space of type SIM).

In summary, you have to set the following options in disco.config to create a standard DISCO word space:

inputDir=path/to/your/corpus/directory
outputDir=path/to/your/output/directory
# if you have a lemmatized corpus:
inputFileFormat=LEMMATIZED
stopwordFile=/home/peter/DISCO/DISCOBuilder-1.0/stopword-lists/stopword-list_de_utf8.txt
# depends on your corpus' annotations:
boundaryMarks=<doc>,<p>,</p>,</article>,<s>,</s>

rightContext=3
leftContext=3
position=true

# if inputFileFormat=LEMMATIZED:
lemmaFeatures=true
# this creates word vectors for lemmata but not for inflected forms:
lemma=true
# if inputFileFormat=TOKENIZED:
#lemmaFeatures=false
#lemma=false

minFreq=100
numberOfFeatureWords=30000
weightingMethod=lin
minWeight=0.1
similarityMeasure=KOLB
# word space type SIM:
dontCompute2ndOrder=false
findMultiTokenWords=false
wordByDocument=false
minimumWordLength=2
maximumWordLength=31
# all Unicode letters (\p{L}) plus the characters listed here:
allowedCharactersWord=.-\'_
minimumFeatureLength=2
maximumFeatureLength=31
# all Unicode letters (\p{L}) plus the characters listed here:
allowedCharactersFeature=.-\'_

# leave these blank:
openingTag=
closingTag=
existingCoocFile=
existingWeightFile=
addInverseRelations=
stopwords=
maxFreq=
tokencount=
vocabularySize=
discoVersion=

Create a word space with documents as features

Set parameter wordByDocument=true. Define what text segment should be your "document" by setting the parameters openingTag and closingTag. E.g. if you have a corpus where paragraphs are annotated with <p> and </p>, then set openingTag=<p> and closingTag=</p> to define paragraphs as document features.

Parameter numberOfFeatureWords will be ignored; the number of features will be equal to the number of documents (as defined by openingTag and closingTag) in the corpus.

Of course you have to also set the parameters inputDir, outputDir, stopwordFile, and inputFileFormat. For other parameters like dontCompute2ndOrder, lemmaFeatures, lemma, and minFreq the same applies as is stated in the previous section.

Create a word space from parsed text

If you have a parsed corpus in CoNNL-U format, set inputFileFormat=CONNL. You should also set addInverseRelations=true. You do not have to define a co-occurrence context because the context is given by the syntactic dependency relations that hold between the words in a sentence. Therefore, leave leftContext, position, openingTag etc. blank.
All other parameters are the same as in section Create a standard DISCO word space. However, you can not use findMultiTokenWords.

Options in disco.config explained

Options you need to set:

Optionpossible valuesdescription
inputDirfile pathThe input directory must contain all your input files (corpus) that are to be processed by DISCO Builder. DISCO Builder will try to process all files in the input directory, regardless of their file name extension. DISCO Builder will not descend into subdirectories! Note that all files must be of the same input file format.
inputFileFormatTOKENIZED, LEMMATIZED, CONNL
  • TOKENIZED: tokenized text
  • LEMMATIZED: three tab-separated columns per line (wordform, POS tag, lemma). Whitespace in any of the three columns will be replaced by underscore.
  • CONNL: syntactic dependency relations in CoNNL-U format.
boundaryMarkscomma-separated list of tagsA window context always stops at a boundary mark. E.g., if you have a window context of +-5 words, and have a end-of-sentence marker </s> set as boundary mark, then the last word of the sentence will have no right context. Boundary marks are ignored if the parameters openingTag and closingTag are set.
Tags in your corpus files will only be recognized as boundary marks if nothing else is on the same line. In the example

This is a tokenized sentence .
<s>
Another sentence .<s>
The next sentence .

only the first occurrence of <s> is recognized as boundary mark.
This option is not to be confused with the options openingTag and closingTag, which define a context for word-by-document spaces.
findMultiTokenWordstrue, falseAutomatically identify multi-token words in corpus and merge them into single tokens by connecting the tokens with an underscore (e.g.: New_York_City). The algorithm described in Mikolov et al. 2013 is applied, n-grams of size 2 and 3 are considered. The multi-token words found in the corpus are written to the file disco.phraseFreqs together with their corpus frequencies.
This is not applicable with inputFileFormat=CONNL!
This step is quite time and memory consuming!
multiTokenWordDictionaryfile pathSpecify a dictionary with multi-token words to be used with findMultiTokenWords additionally to the phrases found automatically. The format of the dictionary is one multi-token word per line. The tokens can be separated by space or underscore.
outputDirpathdirectory where the word space directory DISCO-idx will be created and where other, temporary files are written to.
stopwordFilepathpath to a stopword list that contains one stopword per line. Stopwords will be ignored. Note that the stopwords have to match the words in the input corpus exactly, there is no lowercasing. If you only supply the stopword and, but not And, then the occurrence in ... road. And there was no... will not be ignored. Moreover, if you use lemmata (lemmaFeatures=true or lemma=true), then your stopwords also have to include lemmata.
lemmatrue, falseThis parameter decides what you will be able to look up in the final word space: inflected word forms or lemmata (base forms). If you set this parameter to true, only lemmata will be stored, but no inflected word forms. That means you will be able to look up speak but not speaks. Word space quality will be higher for lemma=true, because the data for all the word forms of each lemma will be combined, which leads to more reliable statistics.
lemma=true is not allowed if inputFileFormat is TOKENIZED.
rightContext0..nSize of context word window to the right of the target word (in token). Default value is 3.
leftContext0..nSize of context word window to the left of the target word (in token). Default value is 3.
positiontrue, falseIf true, the position of the feature word in the context word window is added to the feature word to create the feature. For instance, if the feature word barks occurs directly after the target word dog, the feature will be barks_1. The effect of this parameter is a stricter context that leads to tighter similarities (comparable to syntactic dependency relations). This parameter is only relevant in conjunction with rightContext and leftContext. The default value is true.
minFreq1..nHere you can select the minimum number of times a token has to occur in the corpus in order to be indexed. Words with a smaller frequency will be completely ignored by DISCO Builder, i.e. they won't be present in the resulting word space and they will also not be used as feature words. Normally, the minimum frequency should be some number between 20 and 100. In order to minimize the size of the resulting word space and the computation time larger values can be selected, for example 200 or even 500. The absolute minimum number should be 2 to at least filter out hapax legomena (words occurring only once), which effectively halves the number of word types that will have to be indexed.
lemmaFeaturestrue, falseIf your inputFileFormat is LEMMATIZED or CONNL, you should always set this parameter to true, in order to use the word's base forms (lemmata) as features. This increases word space quality.
If you have TOKENIZED input text, you have to set this parameter to false.
numberFeatureWords1..VThe maximum number of words to be used as features. If you set this to n then the n most frequent words will be used as feature words. Note that this is independent of additional relations you possibly have activated (like window position using position=true). For instance, if you have set numberFeatureWords=10000, position=true with a +-3 words context, then your word space will have (at most) 10,000 x 6 = 60,000 features. This is because a word like eat can occur in any of the 6 window positions relative to the target word, giving rise to 6 different features: eat_1, eat_2, eat_3, ... The same holds for syntactic dependency relations (eat_N:subjOf:V, eat_A:mod:V, ...). In general, the number of features is numberFeatureWords x numberOfRelations. However, this is the upper bound since not all feature words will occur with all relations.
Normally, the value of numberFeatureWords should be in the range 10,000 - 100,000. If you create a word space that is intended for the look-up of collocations only (with option dontCompute2ndOrder=true) then you should use a higher value to include all words from the vocabulary in the set of possible collocates.
weightingMethodlin, loglikelihood, poisson, relative frequencyThis determines which measure to use to compute the significance of a word's features. The standard measure for semantic similarity in DISCO is lin. If you want to retrieve collocations loglikelihood is a good choice.
minWeightfloatThe minimum significance value of a feature to be included in a word's word vector. A good choice for the lin measure is 0.1. For loglikelihood use a larger threshold, like 10.0.
similarityMeasurecosine, kolbThe method to compute the similarity between two word vectors. This option only applies if you build a word space of type SIM (dontCompute2ndOrder=false).
dontCompute2ndOrdertrue, falseDetermines which word space type to build. See section on word space types.
addInverseRelationstrue, falseOnly relevant if inputFileFormat=CONNL. If true the inverse of a dependency relation is added as feature. For instance, if the input file has the dependency relation horse SUBJ_OF gallop then gallop<SEP>SUBJ_OF is used as a feature to describe horse. If addInverseRelations=true then horse<SEP>SUBJ_OF_INV is added as a feature to describe gallop.
This option should be set to true, except the input files already contain the inverse relations.
wordByDocumenttrue, falseIf this parameter is true, then a word-by-document word space will be created. In this case, both parameters openingTag and closingTag have to be set to define the document context.
openingTag, closingTagStringThese two parameters have to be set when wordByDocument=true to define the document context. When wordByDocument=false these two parameters override rightContext and leftContext. Parameter position is always regarded as false when openingTag and closingTag are set.
existingCoocFilefile pathtbd...
existingWeightFilefile pathtbd...
allowedCharactersFeatureStringThe characters that are allowed to occur in a word used as feature. The set of unicode letters (\p{L}) is added to this automatically. If a word contains other letters it is ignored.
maximumFeatureLengthThe maximum length of a word (in characters) to be used as feature. All words longer than this are ignored and not used as features.
minimumFeatureLengthThe minimum length of a word (in characters) to be used as feature.
maximumWordLengthThe maximum length of a word (in characters) to be indexed. All words longer than this are ignored, no word vectors will be build for them and they cannot be looked up in the final word space.
allowedCharactersWordSet of allowed characters in a word. The set of all unicode letters (\p{L}) is added to this by DISCO Builder automatically.
minimumWordLengthWords shorter than this are ignored.

Don't touch these: the following options have to be left blank because they will be filled by DISCO Builder:
tokencount
vocabularySize
maxFreq
stopwordList
discoVersion

Import word spaces from other tools

Import word2vec or GloVe vectors

DISCO Builder allows to convert vector files produced with word2vec or GloVe into a DISCO word space index that can be queried with the DISCO API.

java -Xmx<N> -cp DISCOBuilder.jar de.linguatools.disco.builder.Import
-in <vectorFile>
-out <outputDir>
-wsType COL|SIM
[-threads <N>]
[-wlfreq <wordFrequencyList> | -corpus <corpusFile>]
[-nBest <N>]

where <vectorFile> is the vector file (text format) created by word2vec or GloVe and <outputDir> is the directory where the DISCO word space will be written.

Option -wsType specifies the DISCO word space type that will be created. The type COL only stores the word vectors, whereas the type SIM also computes the most similar words for each word and stores these lists of similar words, too.
If -wsType SIM you should specify the number of threads to run using the -threads option (-threads has no effect when -wsType COL).
If -wsType SIM you can specify how many similar words to store for each word using the option -nBest. The default value is 300. This option is ignored when -wsType COL.

Some methods of the DISCO API need corpus information like the frequency of the words. Since these informations are not contained in the vector files, you can supply a <wordFrequencyList> (format one word with its frequency per line, separated by white space). Alternatively, you can supply the corpus file itself using the option <corpusFile>.

Evaluation of your word spaces

DISCO Builder contains a method for evaluating a word space against word pairs with gold standard similarity values in a CSV file. The CSV file should have the format

word1,word2,similarity

The first line of the CSV file is regarded as header and is ignored. To start the evaluation type:

java -Xmx<N> -cp DISCOBuilder-1.0.jar de.linguatools.disco.builder.Evaluate <csvFile> <wordSpaceDir> <DISCO.SimilarityMeasure> <separator>

with <N> being enough memory to hold the word space, and separator the character used as separator in the csvFile.
The method will compute the Spearman rank correlation coefficient between the gold standard similarities and the similarities computed by DISCO. You can find evaluation data for several languages here:

Share your word spaces

If you have built a word space that you would like to share with others, drop us a note. We will be happy to link your word space on the DISCO download page.