Position within the page tree

Institute for Natural Language Processing
Research
Resources
Corpora
DIRE Dataset

DIRE Dataset

Dataset from Boleda et al. IWCS 2017

DIRE Dataset

Type

Corpus

Description

This page provides the dataset from Boleda et al. IWCS 2017. The dataset consists of six individual files:

stimuli.train.gz, stimuli.valid.gz, stimuli.test.gz: The stimuli themselves, one sequence per line, for train set
(40K sequences), dev set (5K sequences), and test set (10K sequences). Total size: 4.5 MB.
image.dm.gz: The corresponding image vectors (from Lazaridou et al. NAACL 2015). Größe: 167MB.
word.dm.gz: The corresponding word embeddings (aus Baroni et al. ACL 2014). Größe: 2.5MB.

The syntax of the stimulus files is as follows:

line      = query query_position || entities || stimuli
query     = category:modifier:modifier
entities  = 6(entity )  
entity    = category_picindex
stimuli   = 12(modifier:entity )

The values of "category" serve as keys in word.dm, and the values of "entity" as keys in image.dm.gz.
These two files are simple line-based hash tables with the syntax "key value" which map string keys onto vectors.

The DIRE implementation is available on this page: TBC.

Reference

Living a discrete life in a continuous world: Reference in cross-modal entity tracking.
Proceedings of IWCS. Montpellier, France, 2017.
Gemma Boleda, Sebastian Padó, Nghia The Pham and Marco Baroni.

Write e-mail
If you have any problems with the website, please directly contact the webmaster.

DIRE Dataset

DIRE Dataset

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Audience

Formalities

Services

Organization

DIRE Dataset

DIRE Dataset

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Here you can reach us

Audience

Formalities

Services

Organization