Data Release: 27 March 2017

E2E
NLG Challenge

Entry Submission Deadline: 31 October 2017

Motivation

Natural language generation plays a critical role for Conversational Agents as it has a significant impact on a user’s impression of the system. This shared task focuses on recent end-to-end (E2E), data-driven NLG methods, which jointly learn sentence planning and surface realisation from non-aligned data, e.g. (Wen et al., 2015; Mei et al., 2016; Dusek and Jurcicek, 2016; Lampouras and Vlachos, 2016) etc.

So far, E2E NLG approaches were limited to small, de-lexicalised data sets, e.g. BAGEL, SF Hotels/ Restaurants, or RoboCup. In this shared challenge, we will provide a new crowd-sourced data set of 50k instances in the restaurant domain, as described in (Novikova, Lemon and Rieser, 2016). Each instance consist of a dialogue act-based meaning representation (MR) and up to 5 references in natural language. In contrast to previously used data, our data set includes additional challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena. For example:

MR:

name[The Eagle],
eatType[coffee shop],
food[French],
priceRange[moderate],
customerRating[3/5],
area[riverside],
kidsFriendly[yes],
near[Burger King]

NL:

“The three star coffee shop, The Eagle, gives families a mid-priced dining experience featuring a variety of wines and cheeses. Find The Eagle near Burger King.”

The full data set can now be downloaded here. A detailed description of the data can be found in our SIGDIAL 2017 paper. A brief summary of the E2E NLG Challenge results is now available in our INLG 2018 paper.

This challenge follows on from previous successful shared tasks on generation, e.g. SemEval’17 task 9 on text generation from AMR, and Generation Challenges 2008-11. However, this is the first NLG task to concentrate on (1) generation from dialogue acts, (2) using semantically un-aligned data.

The Task

The task is to generate an utterance from a given MR, which is a) similar to human generated reference texts, and b) highly rated by humans. Similarity will be assessed using standard metrics, such as BLEU and METEOR. Human ratings will be obtained using a mixture of crowd-sourcing and expert annotations. We will also test a suite of novel metrics to estimate the quality of a generated utterance.

The metrics used for automatic evaluation are available on Github.

Download Data

The full E2E dataset is now available for download here. The package includes a description of the data format. A paper with a detailed description of the dataset appeared on SIGDIAL 2017 and is also available on arXiv.

To cite the dataset, use:

@inproceedings{novikova2017e2e,
  title={The {E2E} Dataset: New Challenges for End-to-End Generation},
  author={Novikova, Jekaterina and Du{\v{s}}ek, Ond\v{r}ej and Rieser, Verena},
  booktitle={Proceedings of the 18th Annual Meeting 
             of the Special Interest Group on Discourse and Dialogue},
  address={Saarbr\"ucken, Germany},
  year={2017},
  note={arXiv:1706.09254},
  url={https://arxiv.org/abs/1706.09254},
}

A package with the outputs of all participating systems on the test set as well as raw human ratings used for the evaluation is now available for download here. The package includes a short description of the data formats.

See the Proceedings section below for citing the E2E NLG Challenge results.

Baseline System

We used TGen (Dusek and Jurcicek, 2016) as the baseline system for the challenge. It is a seq2seq model with attention (Bahdanau et al., 2015) with added beam search and a reranker penalizing outputs that stray away from the input MR. The baseline scores on the development set are as follows:

MetricScore
BLEU0.6925
NIST8.4781
METEOR0.4703
ROUGE-L0.7257
CIDEr2.3987

The full baseline system outputs can be downloaded here for both the development and test sets (one instance per line). If you want to run the baseline yourself, basic instructions are provided in the TGen Github repository.

The scripts used for evaluation are available on Github.

Important Dates

13 March 2017:
Registration opens
27 March 2017:
Training and development data are released (MRs + references)
27 June 2017:
The baseline system is released.
16 October 2017:
Test data is released (MRs only)
31 October 2017:
Entry submission deadline
15 November 2017:
Evaluation results are released
15 December 2017:
Participants submit a paper describing their systems
1 March 2018:
Final versions of the description papers due
7 November 2018:
Results presented at INLG

Evaluation Results

We are happy to announce that the interest in the E2E NLG shared task has by far outperformed our expectations. Heriot-Watt University has set out this challenge for the first time this year, and we received a total of 62 submissions by 17 institutions, with about 1/3 of these submissions coming from industry. In comparison, the well established Conference in Machine Translation WMT’17 (running since 2006) got 31 institutions submitting to a total of 8 tasks.

Participants map

A brief summary of the E2E NLG Challenge results is now available in our INLG 2018 paper, a more detailed analysis is in preparation.

Automatic Metrics

The automatic evaluation results were obtained using the metrics scripts provided with the baseline. The table is sortable – just click on the metric you want use for sorting. Click again to reverse the sort.

SubmitterAffiliationSystem nameP?BLEUNISTMETEORROUGE_LCIDEr
BASELINEHeriot-Watt UniversityBaseline0.65938.60940.44830.68502.2338
Biao ZhangXiamen Universitybzhang_submit0.65458.18400.43920.70832.1012
Chen ShuangHarbin Institute of TechnologyAbstract-beam10.58545.46910.39770.67471.6391
Chen ShuangHarbin Institute of TechnologyAbstract-beam20.59165.94770.39740.67011.6513
Chen ShuangHarbin Institute of TechnologyAbstract-beam30.61506.80290.40680.67501.7870
Chen ShuangHarbin Institute of TechnologyAbstract-greedy0.66358.39770.43120.69092.0788
Chen ShuangHarbin Institute of TechnologyNonAbstract-beam20.58606.16020.38330.66191.6133
Chen ShuangHarbin Institute of TechnologyNonAbstract-beam30.60886.97900.38990.66281.7015
Chen ShuangHarbin Institute of TechnologyPrimary_NonAbstract-beam10.58595.43830.38360.67141.5790
ZHAWZurich University of Applied Sciencesbase0.65448.33910.44480.67832.1438
ZHAWZurich University of Applied Sciencesprimary_10.58648.02120.43220.59981.8173
ZHAWZurich University of Applied Sciencesprimary_20.60048.13940.43880.61191.9188
FORGePompeu Fabra UniversityE2E_UPF_10.42076.51390.36850.54371.3106
FORGePompeu Fabra UniversityE2E_UPF_20.41136.32930.36860.55931.2467
FORGePompeu Fabra UniversityE2E_UPF_30.45997.10920.38580.56111.5586
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var10.60158.30750.44050.67782.1775
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var20.62338.17510.43780.68872.2840
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var30.56908.03820.42020.63482.0956
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var40.57997.91630.43100.66702.0691
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem2_var10.54365.74620.35610.61521.4130
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem2_var20.53567.83730.38310.55131.5825
HarvardNLP & Henry ElderHarvard SEAS & Adaptmain_1_support_10.65818.57190.44090.68932.1065
HarvardNLP & Henry ElderHarvard SEAS & Adaptmain_1_support_20.66188.60250.45710.70382.3371
HarvardNLP & Henry ElderHarvard SEAS & Adaptmain_1_support_30.67378.60610.45230.70842.3056
HarvardNLP & Henry ElderHarvard SEAS & AdaptPrimary_main_10.64968.52680.43860.68722.0850
Heng GongHarbin Institute of TechnologyPrimary_test_20.64228.34530.44690.66452.2721
Heng GongHarbin Institute of Technologytest_10.63968.31110.44660.66202.2272
Heng GongHarbin Institute of Technologytest_30.63958.31270.44570.66282.2442
Heng GongHarbin Institute of Technologytest_40.63958.31270.44570.66282.2442
AdaptAdaptprimary_submission-temperature_1.10.50927.19540.40250.58721.5039
AdaptAdaptsupporting_submission-temperature_0.90.55737.70130.41540.61301.8110
AdaptAdaptsupporting_submission-temperature_1.00.52657.39910.40950.59921.6488
<anonymous 1><anonymous 1><anonymous 1 combined>0.29214.76900.25150.43610.6674
<anonymous 1><anonymous 1><anonymous 1 primary>0.47236.19380.31700.56161.2127
Shubham AgarwalNLEsubmission_primary0.65348.53000.44350.68292.1539
Shubham AgarwalNLEsubmission_second0.66698.53880.44840.69912.2239
Shubham AgarwalNLEsubmission_third0.66768.54160.44850.69912.2276
UCSC-Slug2SlugUC Santa CruzSlug2Slug0.66198.61300.44540.67722.2615
UCSC-Slug2SlugUC Santa CruzSlug2Slug-alt (late submission)0.60358.39540.43690.59912.1019
Thomson Reuters NLGThomson ReutersNonPrimary_1_test_output_model_11_post0.65368.32930.45500.68052.1050
Thomson Reuters NLGThomson ReutersNonPrimary_2_test_output_model_13_post0.65628.39420.45710.68762.1706
Thomson Reuters NLGThomson ReutersNonPrimary_3_test_output_beam_5_model_11_post0.68058.77770.44620.69282.3195
Thomson Reuters NLGThomson ReutersNonPrimary_4_test_output_beam_5_model_13_post0.67428.65900.44990.69832.3018
Thomson Reuters NLGThomson ReutersNonPrimary_5_submission_60.62088.06320.44170.66922.1127
Thomson Reuters NLGThomson ReutersNonPrimary_6_submission_4_beam0.62018.09380.44190.67402.1251
Thomson Reuters NLGThomson ReutersNonPrimary_7_submission_40.61828.06160.44170.67292.0783
Thomson Reuters NLGThomson ReutersNonPrimary_8_test_train_only0.41116.75410.39700.54351.4096
Thomson Reuters NLGThomson ReutersPrimary_1_submission_6_beam0.63368.18480.43220.68282.1425
Thomson Reuters NLGThomson ReutersPrimary_2_test_train_dev0.42026.76860.39680.54811.4389
UCSC-TNT-NLGUC Santa CruzSystem 1/Primary-Sys10.65618.51050.45170.68392.2183
UCSC-TNT-NLGUC Santa CruzSystem 1/Sys1-Model10.64768.43010.45080.67952.1233
UCSC-TNT-NLGUC Santa CruzSystem 2/Primary-Sys20.65028.52110.43960.68532.1670
UCSC-TNT-NLGUC Santa CruzSystem 2/Sys2-Model10.66068.62230.44390.67722.1997
UCSC-TNT-NLGUC Santa CruzSystem 2/Sys2-Model20.65638.54820.44820.68352.1953
UCSC-TNT-NLGUC Santa CruzSystem 2/Sys2-Model30.36816.60040.38460.52591.5205
UIT-DANGNTVNU-HCM University of Information Technologytest_e2e_result_2 final_TSV0.59907.92770.43460.66342.0783
UKP-TUDATechnische Universität Darmstadttest_e2e-Puzikov0.56577.45440.45290.66141.8206
Note: “P?” denotes primary submissions.

Human Evaluation (updated results)

The human evaluation was conducted on the 20 primary systems and the baseline using the CrowdFlower platform. We used our newly developed RankME method (Novikova et al., 2018) to obtain the ratings. Crowd workers were presented with five randomly selected outputs of different systems corresponding to a single meaning representation, and were asked to rank these systems from the best to worst, ties permitted. A single human-authored reference was provided for comparison. We collected separate ranks for quality and naturalness.

Quality is defined as an overall quality of the utterance, in terms of its grammatical correctness, fluency, adequacy and other important factors. When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation.

Naturalness is defined the extent to which the utterance could have been produced by a native speaker. When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation.

If used in a real-life NLG system, quality would be considered the primary measure.

The final evaluation results were produced using the TrueSkill algorithm (Sakaguchi et al., 2014). For naturalness, the algorithm performed 1890 pairwise comparisons per each system (37800 comparisons in total), for quality – 1260 comparisons per system (25200 comparisons in total). In results tables, systems are ordered by their inferred system TrueSkill scores, and clustered. Systems within a cluster are considered tied. The system clusters have been created using bootstrap resampling, with a p-level of p ≤ 0.05.

Quality

#TrueSkillRangeSystem nameSubmitter
10.300(1.0, 1.0)Slug2SlugUCSC-Slug2Slug
20.228(2.0, 4.0)ukp-tudaUKP-TUDA
0.213(2.0, 5.0)Primary_test_2Heng Gong
0.184(3.0, 5.0)test_e2e_result_2_final_TSVUIT-DANGNT
0.184(3.0, 6.0)BaselineBASELINE
0.136(5.0, 7.0)Slug2Slug-alt (late submission)UCSC-Slug2Slug
0.117(6.0, 8.0)primary_2ZHAW
0.084(7.0, 10.0)System 1/Primary-Sys1UCSC-TNT-NLG
0.065(8.0, 10.0)System 2/Primary-Sys2UCSC-TNT-NLG
0.048(8.0, 12.0)submission_primaryNLE
0.018(10.0, 13.0)primary_1ZHAW
0.014(10.0, 14.0)E2E_UPF_1FORGe
-0.012(11.0, 14.0)sheffield_primarySystem1_var1Sheffield NLP
-0.012(11.0, 14.0)Primary_main_1HarvardNLP & Henry Elder
3-0.078(15.0, 16.0)Primary_2_test_train_devThomson Reuters NLG
-0.083(15.0, 16.0)E2E_UPF_3FORGe
4-0.152(17.0, 19.0)primary_submission-temperature_1.1Adapt
-0.185(17.0, 19.0)Primary_1_submission_6_beamThomson Reuters NLG
-0.186(17.0, 19.0)bzhang_submitBiao Zhang
5-0.426(20.0, 21.0)Primary_NonAbstract-beam1Chen Shuang
-0.457(20.0, 21.0)sheffield_primarySystem2_var1Sheffield NLP

Naturalness

#TrueSkillRangeSystem nameSubmitter
10.211(1.0, 1.0)sheffield_primarySystem2_var1Sheffield NLP
20.171(2.0, 3.0)Slug2SlugUCSC-Slug2Slug
0.154(2.0, 4.0)Primary_NonAbstract-beam1Chen Shuang
0.126(3.0, 6.0)Primary_main_1HarvardNLP & Henry Elder
0.105(4.0, 8.0)submission_primaryNLE
0.101(4.0, 8.0)BaselineBASELINE
0.091(5.0, 8.0)test_e2e_result_2 final_TSVUIT-DANGNT
0.077(5.0, 10.0)ukp-tudaUKP-TUDA
0.060(7.0, 11.0)System 2/Primary-Sys2UCSC-TNT-NLG
0.046(9.0, 12.0)Primary_test_2Heng Gong
0.027(9.0, 12.0)System 1/Primary-Sys1UCSC-TNT-NLG
0.027(10.0, 12.0)bzhang_submitBiao Zhang
3-0.053(13.0, 16.0)Primary_1_submission_6_beamThomson Reuters NLG
-0.073(13.0, 17.0)Slug2Slug-alt (late submission)UCSC-Slug2Slug
-0.077(13.0, 17.0)sheffield_primarySystem1_var1Sheffield NLP
-0.083(13.0, 17.0)primary_2ZHAW
-0.104(15.0, 17.0)primary_1ZHAW
4-0.144(18.0, 19.0)E2E_UPF_1FORGe
-0.164(18.0, 19.0)primary_submission-temperature_1.1Adapt
5-0.243(20.0, 21.0)Primary_2_test_train_devThomson Reuters NLG
-0.255(20.0, 21.0)E2E_UPF_3FORGe

Proceedings (Full System Descriptions)

A brief description of the challenge results was published at INLG. To cite the challenge, use:

@inproceedings{dusek2018findings,
  title={Findings of the {E2E} {NLG} {Challenge}},
  author={Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena},
  booktitle={Proceedings of the 11th International Conference 
             on Natural Language Generation},
  address={Tilburg, The Netherlands},
  year={2018},
  note={arXiv:1810.01170},
  url={https://arxiv.org/abs/1810.01170},
}

System outputs and human ratings can now be downloaded from here. Please use the same citation to refer for this data release.

All submitters participating in human evaluation provided a description of their primary systems as a technical paper. The papers are linked below:

SystemPaper
AdaptHenry Elder, Sebastian Gehrmann, Alexander O'Connor and Qun Liu: E2E NLG Challenge Submission: Towards Controllable Generation of Diverse Natural Language
Chen ShuangSuang Chen: A General Model for Neural Text Generation from Structured Data
FORGe (both systems)Simon Mille and Stamatia Dasiopoulou: FORGe at E2E 2017
HarvardNLP & Henry ElderSebastian Gerhmann, Falcon Z. Dai, Henry Elder and Alexander M. Rush: End-to-End Content and Plan Selection for Natural Language Generation
Heng Gong (final paper pending)
NLEShubham Agarwal, Marc Dymetman and Éric Gaussier: A char-based seq2seq submission to the E2E NLG Challenge
Sheffield NLP (both systems)Mingjie Chen, Gerasimos Lampouras and Andreas Vlachos: Sheffield at E2E: structured prediction approaches to end-to-end language generation
UCSC-Slug2SlugJuraj Juraska, Panagiotis Karagiannis, Kevin K. Bowden and Marilyn A. Walker: Slug2Slug: A Deep Ensemble Model with Slot Alignment for Sequence-to-Sequence Natural Language Generation
UCSC-TNT-NLG, System 1Shereen Oraby, Lena Reed, Shubhangi Tandon, Sharath T.S., Stephanie Lukin and Marilyn Walker: TNT-NLG, System 1: Using a Statistical NLG to Massively Augment Crowd-Sourced Data for Neural Generation
UCSC-TNT-NLG, System 2Shubhangi Tandon, Sharath T.S., Shereen Oraby, Lena Reed, Stephanie Lukin and Marilyn Walker: TNT-NLG, System 2: Data Repetition and Meaning Representation Manipulation to Improve Neural Generation
Thomson Reuters NLP, System 1Elnaz Davoodi, Charese Smiley, Dezhao Song and Frank Schilder: The E2E NLG Challenge: Training a Sequence-to-Sequence Approach for Meaning Representation to Natural Language Sentences
Thomson Reuters NLP, System 2Charese Smiley, Elnaz Davoodi, Dezhao Song and Frank Schilder: The E2E NLG Challenge: End-to-End Generation through Partial Template Mining
UIT-DANGNTDang Tuan Nguyen and Trung Tran: Structure-based Generation System for E2E NLG Challenge
UKP-TUDAYevgeniy Puzikov and Iryna Gurevych: E2E NLG Challenge: Neural Models vs. Templates
Biao ZhangBiao Zhang, Jing Yang, Qian Lin and Jinsong Su: Attention Regularized Sequence-to-Sequence Learning for E2E NLG Challenge
ZHAW (both systems)Jan Deriu and Mark Cieliebak: End-to-End Trainable System for Enhancing Diversity in Natural Language Generation

Other papers using the E2E dataset

Published versions of the systems participating in the Challenge:

Further works that use the E2E dataset but did not participate in the official E2E challenge:

Contacts

Organising Comittee

Jekaterina Novikova
Ondrej Dusek
Verena Rieser

Heriot-Watt University, Edinburgh, UK.

Contact Details

e2e-nlg-challengegooglegroups.com

Advisory Committee

Mohit Bansal, University of Northern Carolina Chapel Hill
Ehud Reiter, University of Aberdeen
Amanda Stent, Bloomberg
Andreas Vlachos, University of Sheffield
Marilyn Walker, University of California Santa Cruz
Matthew Walter, Toyota Technological Institute at Chicago
Tsung-Hsien Wen, University of Cambridge
Luke Zettlemoyer, University of Washington