DISCO - Wordspaces for DISCO API 2.0

English

enwiki-20130403-sim-lemma-mwl-lc

This English word space contains lowercased lemmata.
Word space type: SIM
Word space size: 2.3 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

Parameters used for word space computation: Context window +-3 words regarding exact position, 30,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-sim-lemma-mwl-lc.tar.bz2 and unpack it.


enwiki-20130403-word2vec-lm-mwl-lc-sim

This English word space contains lowercased lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.4 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 420,184
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, lemmatization, converting all words in the corpus to lower case, identifiying multi-word lexemes (they contain underscore instead of space).

Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive enwiki-20130403-word2vec-lm-mwl-lc-sim.tar.bz2 and unpack it.


French

fr-general-20151126-lm-sim

This French word space contains lemmata.
Word space type: SIM
Word space size: 2.1 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 276,967
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

  • 692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
  • 598,392,935 tokens: News
  • 520,189,432 tokens: debates from EU and UN
  • 185,987,928 tokens: subtitles
  •    2,093,280 tokens: books from the Project Gutenberg

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-sim.tar.bz2 and unpack it.


fr-general-20151126-lm-word2vec-sim

This French word space contains lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 1.7 gigabytes
Corpus size: 1.9 billion token
Number of queriable words: 281,484
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

  • 692,319,667 tokens: the French Wikipedia (dump from 4th August 2014)
  • 598,392,935 tokens: News
  • 520,189,432 tokens: debates from EU and UN
  • 185,987,928 tokens: subtitles
  •    2,093,280 tokens: books from the Project Gutenberg

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive fr-general-20151126-lm-word2vec-sim.tar.bz2 and unpack it.


German

de-general-20150421-lm-sim

This German word space contains lemmata.
Word space type: SIM
Word space size: 3.5 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 50, and lemmatization.

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

  • 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
  • 335,937,237 tokens: Webcrawl (2014)
  • 317,677,649 tokens: Newspapers and magazines (1949-2007)
  •   64,174,384 tokens: parliamentary debates
  •   25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
  •   13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-sim.tar.bz2 and unpack it.


de-general-lm-word2vec-sim

This German word space contains lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 3.0 gigabytes
Corpus size: 1.5 billion token
Number of queriable words: 470,788
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 50, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 50, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

  • 747,547,646 tokens: the German Wikipedia (dump from 25th July 2014)
  • 335,937,237 tokens: Webcrawl (2014)
  • 317,677,649 tokens: Newspapers and magazines (1949-2007)
  •   64,174,384 tokens: parliamentary debates
  •   25,195,504 tokens: books (mainly fiction from the period 1850-1920) from the Project Gutenberg
  •   13,553,836 tokens: tv & movie subtitles

License: Creative Commons Attribution-NonCommercial 3.0 Unported
Download and installation: Download the archive de-general-20150421-lm-word2vec-sim.tar.bz2 and unpack it.


Russian

ru-ruwac-ruwiki-lm-sim

This Russian word space contains lemmata.
Word space type: SIM
Word space size: 2.8 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.

Parameters used for word space computation: Context window +-3 words regarding exact position, 50,000 most frequent lemmata as features, significance measure from Kolb 2009 with threshold 0.1, similarity measure from Lin 1998.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lm-sim.tar.bz2 and unpack it.


ru-ruwac-ruwiki-lm-word2vec-sim

This Russian word space contains lemmata. It was created using word2vec and then converted into a DISCO word space with the import functionality of DISCO Builder.
Word space type: SIM
Word space size: 2.6 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of all words with frequency < 100, and lemmatization.
Parameters used for word space computation: word2vec was run with the parameters -size 400 -negative 10 -min-count 100, producing a CBOW word space. Vector similarity measure used by DISCO Builder was COSINE.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lm-word2vec-sim.tar.bz2 and unpack it.


ru-ruwac-ruwiki-lem-col

This Russian word space contains lemmata.
Word space type: COL
Word space size: 2.3 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 226,108
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100, and lemmatization.

Parameters used for word space computation: Context window +-5 words, 200,000 most frequent lemmata as features, significance measure LOGLIKELIHOOD with threshold 7.0.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-lem-col.tar.bz2 and unpack it.


ru-ruwac-ruwiki-col

This Russian word space contains word forms.
Word space type: COL
Word space size: 4.9 gigabytes
Corpus size: 2.2 billion token
Number of queriable words: 508,350
Character encoding: UTF-8 (Unicode)
Corpus preprocessing: Tokenization, removal of stop words, removal of all words with frequency < 100.

Parameters used for word space computation: Context window +-3 words, 50,000 most frequent word forms as features, significance measure from Kolb 2009 with threshold 0.5.
Corpus:

  • 2,006,578,670 tokens: RuWaC (webcrawl)
  •    223,358,656 tokens: ruwiki (Russian Wikipedia)

License: Creative Commons Attribution 3.0 Unported
Download and installation: Download the archive ru-ruwac-ruwiki-col.tar.bz2 and unpack it.