DISCO - Wordspaces for DISCO API 2.0 and above

Arabic

ar-cc-fasttext-col

This Arabic word space was imported from fastText and contains word forms.
Word space type: COL
Word space size: 2.4 gigabytes
Corpus size: unknown
Number of queriable words: 2,000,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization with ICU Tokenizer.
Parameters used for word space computation: Imported the pre-trained word vectors from fasttext.cc: cc.ar.300.vec using DISCOBuilder. The vectors were trained using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negative samples.
Corpus: Common Crawl
License: Creative Commons Attribution-Share-Alike License 3.0
Download and installation: Download the archive cc.ar.300-COL.denseMatrix.bz2 and unpack it.


ar-cc-fasttext-sim

This Arabic word space was imported from fastText and contains word forms.
Word space type: SIM
Word space size: 5.4 gigabytes
Corpus size: unknown
Number of queriable words: 2,000,000
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization with ICU Tokenizer.
Parameters used for word space computation: Imported the pre-trained word vectors from fasttext.cc: cc.ar.300.vec using DISCOBuilder. The word space stores the 200 most similar words for each word, computed with vector similarity measure COSINE.
The vectors were trained with fastText using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negative samples.
Corpus: Common Crawl
License: Creative Commons Attribution-Share-Alike License 3.0
Download and installation: Download the archive cc.ar.300-SIM.denseMatrix.bz2 and unpack it.

English

enwiki-20130403-sim-lemma-mwl-lc

This English word space contains lowercased lemmata.
Word space type: SIM
Word space size: 2.3 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-sim-lemma-mwl-lc.tar.bz2 and unpack it.


enwiki-20130403-word2vec-lm-mwl-lc-sim

This English word space contains lowercased lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.4 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-word2vec-lm-mwl-lc-sim.tar.bz2 and unpack it.


enwiki-20130403-word2vec-lm-mwl-lc-sim.denseMatrix

This English word space contains lowercased lemmata. It was created using word2vec and then converted into a DenseMatrix DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.6 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-word2vec-lm-mwl-lc-sim.denseMatrix.bz2 and unpack it.


French

fr-general-20151126-lm-sim

This French word space contains lemmata.
Word space type: SIM
Word space size: 2.1 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 276,967
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

  • 692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
  • 598,392,935 tokens: News
  • 520,189,432 tokens: debates from EU and UN
  • 185,987,928 tokens: subtitles
  •    2,093,280 tokens: books from the Project Gutenberg

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-sim.tar.bz2 and unpack it.


fr-general-20151126-lm-word2vec-sim

This French word space contains lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.7 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 281,484
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

  • 692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
  • 598,392,935 tokens: News
  • 520,189,432 tokens: debates from EU and UN
  • 185,987,928 tokens: subtitles
  •    2,093,280 tokens: books from the Project Gutenberg

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-word2vec-sim.tar.bz2 and unpack it.


German

de-general-20150421-lm-sim

This German word space contains lemmata.
Word space type: SIM
Word space size: 3.5 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

  • 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
  • 335,937,237 tokens: Webcrawl (2014)
  • 317,677,649 tokens: Newspapers and magazines (1949-2007)
  •   64,174,384 tokens: parliamentary debates
  •   25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
  •   13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-sim.tar.bz2 and unpack it.


de-general-lm-word2vec-sim

This German word space contains lemmata. It was created using word2vec and then converted into a DISCOLuceneIndex word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 3.0 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

  • 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
  • 335,937,237 tokens: Webcrawl (2014)
  • 317,677,649 tokens: Newspapers and magazines (1949-2007)
  •   64,174,384 tokens: parliamentary debates
  •   25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
  •   13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-word2vec-sim.tar.bz2 and unpack it.


de-general-20150421-lm-word2vec-sim.denseMatrix

This German word space contains lemmata. It was created using word2vec and then converted into a DenseMatrix DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.8 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

  • 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
  • 335,937,237 tokens: Webcrawl (2014)
  • 317,677,649 tokens: Newspapers and magazines (1949-2007)
  •   64,174,384 tokens: parliamentary debates
  •   25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
  •   13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-word2vec-sim.denseMatrix.bz2 and unpack it.


Russian

ru-ruwac-ruwiki-lm-sim

This Russian word space contains lemmata.
Word space type: SIM
Word space size: 2.8 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lm-sim.tar.bz2 and unpack it.


ru-ruwac-ruwiki-lm-word2vec-sim

This Russian word space contains lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 2.6 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 100, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 100, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lm-word2vec-sim.tar.bz2 and unpack it.


ru-ruwac-ruwiki-lem-col

This Russian word space contains lemmata.
Word space type: COL
Word space size: 2.3 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.

Parameters used for word space computation: Context window +-5 words, 200,000 most frequent lemmata as features, significance measure LOGLIKELIHOOD with threshold 7.0.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lem-col.tar.bz2 and unpack it.


ru-ruwac-ruwiki-col

This Russian word space contains word forms.
Word space type: COL
Word space size: 4.9 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 508,350
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100.

Parameters used for word space computation: Context window +-3 words, 50,000 most frequent word forms as features, significance measure from Kolb 2009 with threshold 0.5.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-col.tar.bz2 and unpack it.