dataset | Statement Segmentation in German Easy Language (StaGE)

The basis of the task will be a dataset consisting of several thousand entries. This dataset will be made public in several stages (see start page). Starting with the trial dataset, which is already available at data/trial.csv. The trial dataset gives participants an early opportunity to familiarize themselves with the structure and possible ways of processing the data. We also welcome any feedback and are happy to answer questions via e-mail.

We have started to upload the training data. It is available at data/train.csv and will be uploaded iteratively until the evaluation period starts. Be sure to check out the latest version the achieve the best results possible. Please notice, that we have included sentences with 0 statements, which means that those sentences are incomplete or erroneous. In this case, even sentences with multiple statements can be annotated by 0. Nonetheless, we left them in the data to create a real world experiences.

Our data structure:

Column name	Description	Example value
topic	Topic. Defined by the lower-case title of the hurraki article title.	abfalltrennung
phrase	Original phrase.	Der Abfall kommt in Tonnen oder Säcke.
phrase_number	Number of the phrase. `_long` indicates, that the phrase belongs to `Genaue Erklärung`	6_long
genre	Genre(s) extracted based on the hurraki Kategorien (categories)	Entsorgung\|Öffentliche_Verwaltung\|Seiten_mit_defekten_Dateilinks
timestamp	Time of the article’s last modification.	2016-07-28T07:55:27Z
phrase_tokenized	Phrase separated in tokens. Each token is given an index for referencing.	0:=Der 1:=Abfall 2:=kommt 3:=in 4:=Tonnen 5:=oder 6:=Säcke.
num_statements	Number of the statements.	2
statement_spans	List of statements. Each statement is a span of word, represented by their indices.	`[ [4], [6] ]`
author	Pseudonymized (md5-hash) author id	`76bf1508c054395f67a605468d76c22f`
notes	Optional notes by the annotators.	disjunctive coordinating conjunction used

Example:

Here is an example of how to process the data in python using pandas:

Content notice for sensitive content

This dataset contains descriptions of violence, death, abuse and discrimination. We are including this content, because we randomly selected articles from the source site as a whole without applying any content-aware filtering, in order to maintain a neutral perspective on the data and avoid any cherry-picking. Since we are not the authors of the articles, we do not take responsibility for the content. The chances of success of submissions from participants are independent of the content of the sentences, but are based solely on linguistic and statistical evaluations.

This notice follows the ideas describe in this article from Stanford