DIRE Dataset

Dataset from Boleda et al. IWCS 2017

DIRE Dataset

Type

Corpus

Description

This page provides the dataset from Boleda et al. IWCS 2017. The dataset consists of six individual files:

  • stimuli.train.gz, stimuli.valid.gz, stimuli.test.gz: The stimuli themselves, one sequence per line, for train set 
    (40K sequences), dev set (5K sequences), and test set (10K sequences). Total size: 4.5 MB.
  • image.dm.gz: The corresponding image vectors (from Lazaridou et al. NAACL 2015). Größe: 167MB.
  • word.dm.gz: The corresponding word embeddings (aus Baroni et al. ACL 2014). Größe: 2.5MB.

The syntax of the stimulus files is as follows:

line      = query query_position || entities || stimuli
query     = category:modifier:modifier
entities = 6(entity )  
entity    = category_picindex
stimuli   = 12(modifier:entity )

The values of "category" serve as keys in word.dm, and the values of "entity" as keys in image.dm.gz.
These two files are simple line-based hash tables with the syntax "key value" which map string keys onto vectors.

The DIRE implementation is available on this page: TBC.

Reference

Living a discrete life in a continuous world: Reference in cross-modal entity tracking.
Proceedings of IWCS. Montpellier, France, 2017.
Gemma Boleda, Sebastian Padó, Nghia The Pham and Marco Baroni.

 

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

 

Webmaster of the IMS

  • Write e-mail
  • If you have any problems with the website, please directly contact the webmaster.
To the top of the page