DIRE Dataset

Dataset from Boleda et al. IWCS 2017

true" ? copyright : '' }

DIRE Dataset

Type

Corpus

Description

This page provides the dataset from Boleda et al. IWCS 2017. The dataset consists of six individual files:

  • stimuli.train.gz, stimuli.valid.gz, stimuli.test.gz: The stimuli themselves, one sequence per line, for train set 
    (40K sequences), dev set (5K sequences), and test set (10K sequences). Total size: 4.5 MB.
  • image.dm.gz: The corresponding image vectors (from Lazaridou et al. NAACL 2015). Größe: 167MB.
  • word.dm.gz: The corresponding word embeddings (aus Baroni et al. ACL 2014). Größe: 2.5MB.

The syntax of the stimulus files is as follows:

line      = query query_position || entities || stimuli
query     = category:modifier:modifier
entities = 6(entity )  
entity    = category_picindex
stimuli   = 12(modifier:entity )

The values of "category" serve as keys in word.dm, and the values of "entity" as keys in image.dm.gz.
These two files are simple line-based hash tables with the syntax "key value" which map string keys onto vectors.

The DIRE implementation is available on this page: TBC.

Reference

Living a discrete life in a continuous world: Reference in cross-modal entity tracking.
Proceedings of IWCS. Montpellier, France, 2017.
Gemma Boleda, Sebastian Padó, Nghia The Pham and Marco Baroni.

 

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

 

Webmaster of the IMS

  • Write e-mail
  • If you have any problems with the website, please directly contact the webmaster.
To the top of the page