Workshop on
Cross-Framework and Cross-Domain
Parser Evaluation

Manchester, August 23, 2008

(In connection with the 22nd International Conference on Computational Linguistics)

Background and Motivation

Broad-coverage parsing has come to a point where distinct approaches can offer (seemingly) comparable performance: statistical parsers acquired from the PTB; data-driven dependency parsers; ‘deep’ parsers trained off enriched treebanks (in linguistic frameworks like CCG, HPSG, or LFG); and hybrid ‘deep’ parsers, employing hand-built grammars in, for example, HPSG, LFG, or LTAG. Evaluation against trees in the WSJ section of the Penn Treebank (PTB) has helped advance parsing research over the course of the past decade. Despite some scepticism, the crisp and, over time, stable task of maximizing ParsEval metrics over PTB trees has served as a dominating benchmark. However, modern treebank parsers still restrict themselves to only a subset of PTB annotation; there is reason to worry about the idiosyncrasies of this particular corpus; it remains unknown how much the ParsEval metric (or any intrinsic evaluation) can inform NLP application developers; and PTB-style analyses leave a lot to be desired in terms of linguistic information.

The Grammatical Relations (GR) scheme, inspired by Dependency Grammar, offers a level of abstraction over specific syntactic analyses. It aims to capture the ‘gist’ of grammatical relations in a fashion that avoids reference to a token linguistic theory. GR has recently been applied successfully in a series of cross-framework parser evaluation studies. At the same time, rather little GR gold standard data is available, and the GR scheme has been questioned for some of its design decisions. More specifically, GR builds on a combination of syntactic and, albeit very limited, some semantic information. Existing studies suggest that the GR gold standard can be both overly rich and overly shallow in some respects. Furthermore, the mapping of ‘native’ parser outputs into GR introduces noise, and it raises a number of theoretical and practical questions.

Gold standard representations at the level of propositional semantics have at times been proposed for cross-framework parser evaluation, specifically where the parsing task is broadly construed as a tool towards ‘text understanding’, i.e. where the parser is to provide all information that is grammaticalized and contributing to interpretation. PropBank would seem a candidate gold standard, but to date very few studies exist that report on the use of PropBank for parser evaluation. The reasons might be that (at least some) parser developers believe that PropBank goes too far beyond the grammatical level to serve for parser evaluation, and that starting from PTB structures may have led to some questionable annotation decisions.

Finally, a complementary topic to cross-framework evaluation is the increasing demand for cross-domain parser evaluation. At conferences in 2007, concerns were expressed about results that might rely on particular properties of the WSJ PTB, and over idiosyncrasies of this specific sample of natural language. For example, it remains a largely open question to what degree progress made in PTB parsing can carry over to other genres and domains; a related question is on the fitness of some specific approach (when measured in parser evaluation metrics) for actual NLP applications. In summary, it may be necessary that the WSJ- and PTB-derived parser benchmarks be complemented by other gold standards, both in terms of the selection of texts and target representations. And to further the adaptation of parser evaluation to more languages, it will be important to carefully distill community experience from ParsEval and GR evaluations.

This workshop aims to bring together developers of broad-coverage parsers who are interested in questions of target representations and cross-framework and cross-domain evaluation and benchmarking. From informal discussions that the co-organizers had among themselves and with colleagues, it seems evident that there is comparatively broad awareness of current issues in parser evaluation, and a lively interest in detailed exchange of experience (and beliefs). Specifically, the organizers hope to attract representatives from diverse parsing approaches and frameworks, ranging from ‘traditional’ treebank parsing, over data-driven dependency parsing, to parsing in specific linguistic frameworks. For the latter class of parsers, in many frameworks there is a further sub-division into groups pursuing ‘classic’ grammar engineering vs. ones who rely on grammar acquisition from annotated corpora.

Quite likely for the first time in the history of these approaches, there now exist large, broad-coverage parsing systems representing diverse traditions that can be applied to running text, often producing comparable representations. In our view, these recent developments present a new opportunity for re-energizing parser evaluation research.

Workshop Programme

Session 1 (Chair: Ann Copestake)
9:00–9:30Workshop Motivation and Overview
Aoife Cahill, Yusuke Miyao, and Stephan Oepen
9:30–10:00The Stanford Typed Dependencies Representation
Marie-Catherine de Marneffe and Christopher D. Manning
10:00–10:30Exploring an Auxiliary Distribution Based Approach to Domain Adaptation of a Syntactic Disambiguation Model
Barbara Plank and Gertjan van Noord
10:30–11:00Coffee Break
Session 2 (Chair: John Carroll)
11:00–11:30Toward an Underspecifiable Corpus Annotation Scheme
Yuka Tateisi
11:30–12:00Toward a Cross-Framework Parser Annotation Standard
Dan Flickinger
12:30–14:00Lunch Break
Session 3 (Chair: Kenji Sagae)
14:00–14:30Report from CoNLL 2008 Shared Task
Joakim Nivre
14:30–15:00Parser Evaluation Across Frameworks without Format Conversion
Wai Lok Tam, Yo Sato, Yusuke Miyao and Junichi Tsujii
15:00–15:30Large Scale Production of Syntactic Annotations to Move Forward
Anne Vilnat, Gil Francopoulo, Olivier Hamon, Sylvain Loiseau, Patrick Paroubek and Eric Villemonte de la Clergerie
15:30–16:00Coffee Break
Session 4 (Chair: Tracy Holloway King)
16:00–16:30Constructing a Parser Evaluation Scheme
Laura Rimell and Stephen Clark
16:30–17:00‘Deep’ Grammatical Relations for Semantic Interpretation
Mark McConville and Myroslava O. Dzikovska

Workshop Proceedings

It appears the proceedings volumes for COLING 2008 workshops did not (yet) make it into the ACL Anthology. Hence, we provide an electronic copy of the complete workshop proceedings.

Call for Papers

The workshop organizers invite papers on all aspects of parser evaluation, qualitative and quantitative, including but not limited to:

Seeing the general theme of this workshop, submissions that discuss aspects of cross-framework, cross-domain, or cross-linguistic parser evaluation are especially welcome.

All submissions must be formatted according to the standard templates for the main conference. Contrary to the main conference, however, submissions must not exceed a length of six pages, excluding bibliographic references (i.e. the bibliography is not counted against the page limit). For papers accepted for presentation at the workshop, we anticipate making available another few pages, so as to enable authors to accommodate reviewer suggestions, as appropriate. Submissions must be entered into the on-line START system on or before the submission deadline (May 5, see below), as a single PDF file. Submissions should be anonymous, i.e. not show author names and affiliations on the title page and avoid self-references that would reveal the identity of authors. See the general guidelines for the main conference for further background.

All papers submitted before May 5 will be reviewed by at least three members of the programme committee. As part of the electronic submission procedure, you will be asked to report potential conflicts of interest for members of the programme committee. To ensure fair and blind reviewing, we want to avoid assigning a submission to a close colleague, active collaborator, family member, or personal friend of any of the co-authors on the submission. Reciprocally, all committee members will initially register conflicts of interest with any of the submissions received, and the conference management tool ensures that no reviewing information on such submissions is accessible to committee members with conflicts of interest.

Lightweight Shared Task

One of the workshop goals is to establish an improved shared knowledge among participants of the strengths and weaknesses of extant annotation and evaluation schemes. In order to create a joint focus and in-depth discussion, there will be a ‘lightweight’ shared task. For a selection of 50 sentences (of which ten are considered obligatory, the rest optional) for which PTB, GR, and PropBank (and maybe other) annotations are available, we will invite contributors to scrutinize existing gold-standard representations contrastively, identify perceived deficiencies, and sketch what can be done to address these. As an optional component, participants in the shared task are welcome to include ‘native’, framework-specific output representations and actual results for a parsing system of their choice (be it their own or not) in the contrastive study. In either case, submissions to the shared task should aim to reflect on the nature of different representations, highlight which additional distinctions are made in either scheme, and argue why these are useful (for some task) or unmotivated (in general).

Details on the shared task and the relevant data will be made available on or before March 22 through a dedicated parser evaluation shared task web site. Authors submitting to the workshop can decide whether they want to participate in the shared task or submit a more general paper on parser evaluation (or both). Depending on the volume and distribution of accepted papers, we anticipate that the presentation and discussion of shared task results may account for about half of the available time at the workshop. Finally, where format conversions have been established successfully, we will establish a repository of existing tools (post-processing scripts, typically) and results that we hope may provide utility beyond the workshop itself.

Important Dates

Initial Call for PapersMarch 1
Shared Task ReleaseMarch 22
Paper Submission DeadlineMay 5
Notification of AcceptanceJune 6
Camera-Ready Papers DeadlineJuly 1
One-Day WorkshopAugust 23

Workshop Organizers and Programme Committee

The workshop aims to appeal to a wide range of researchers across frameworks, hence it has a relatively large and diverse group of organizers. The co-organizers will jointly make all decisions regarding the workshop form and programme, and it is expected that most of the co-organizers participate in the actual workshop.

Stephan Oepen (on.oiu.ifi@eo) serves as the administrative contact for the workshop. To direct questions or feedback to the full group of organizers, please feel free to email Please include the string PE08 in the subject line of all email sent to the organizers.

(revision: 6-mar-10; oe)