The basis of the task will be a dataset consisting of several thousand entries. This dataset will be made public in several stages (see start page). Starting with the trial dataset, which is already available at data/trial.csv. The trial dataset gives participants an early opportunity to familiarize themselves with the structure and possible ways of processing the data. We also welcome any feedback and are happy to answer questions via e-mail.


We have started to upload the training data. It is available at data/train.csv and will be uploaded iteratively until the evaluation period starts. Be sure to check out the latest version the achieve the best results possible. Please notice, that we have included sentences with 0 statements, which means that those sentences are incomplete or erroneous. In this case, even sentences with multiple statements can be annotated by 0. Nonetheless, we left them in the data to create a real world experiences.


Our data structure:

Column name Description Example value
topic Topic. Defined by the lower-case title of the hurraki article title. abfalltrennung
phrase Original phrase. Der Abfall kommt in Tonnen oder Säcke.
phrase_number Number of the phrase. _long indicates, that the phrase belongs to Genaue Erklärung 6_long
genre Genre(s) extracted based on the hurraki Kategorien (categories) Entsorgung|Öffentliche_Verwaltung|Seiten_mit_defekten_Dateilinks
timestamp Time of the article’s last modification. 2016-07-28T07:55:27Z
phrase_tokenized Phrase separated in tokens. Each token is given an index for referencing. 0:=Der 1:=Abfall 2:=kommt 3:=in 4:=Tonnen 5:=oder 6:=Säcke.
num_statements Number of the statements. 2
statement_spans List of statements. Each statement is a span of word, represented by their indices. [ [4], [6] ]
author Pseudonymized (md5-hash) author id 76bf1508c054395f67a605468d76c22f
notes Optional notes by the annotators. disjunctive coordinating conjunction used

Example:

Here is an example of how to process the data in python using pandas:

Content notice for sensitive content

This dataset contains descriptions of violence, death, abuse and discrimination. We are including this content, because we randomly selected articles from the source site as a whole without applying any content-aware filtering, in order to maintain a neutral perspective on the data and avoid any cherry-picking. Since we are not the authors of the articles, we do not take responsibility for the content. The chances of success of submissions from participants are independent of the content of the sentences, but are based solely on linguistic and statistical evaluations.

This notice follows the ideas describe in this article from Stanford