CORAL - Labelled spoken dialogue corpus

Esta página está disponível em Português.

Consortium
Team
Summary of initial proposal
Summary of main achievements
Tasks:
- T1 - Corpus specification
- T2 - Collection of spoken corpus
- T3 - Orthographic labelling
- T4 - Phonetic labelling
- T5 - Prosodic labelling
- T6 - Syntactic labelling
- T7 - Semantic labelling
- T8 - Mapping between prosody and syntax / semantics
- T9 - CDROM structure
List of publications in the scope of the project:
- Spoken Language Corpora for Speech Recognition and Synthesis in European Portuguese, (in alphabetical order) C. Martins, I. Mascarenhas, H. Meinedo, J. Neto, L. Oliveira, C. Ribeiro, I. Trancoso, C. Viana, RECPAD'98 - Proc. 10th Portuguese Conference on Pattern Recognition, Lisbon, March 1998.
- Apresentação do Projecto CORAL - Corpus de Diálogo Etiquetado", C. Viana, I. Trancoso, I. Mascarenhas, I. Duarte, G. Matos, L. Oliveira, H. Campos, C. Correia (orally presented by I. Trancoso), 1º Workshop de Linguística Computacional, Lisbon, May 1998.
- La Négation en Linguistique - Quelques Configurations Spécifiques, H. Campos, Proceedings of Colóquio sobre Filosofia da Linguagem, Linguística e Operações Cognitivas, FCSH, June 1998, to be published in the special issue of Cadernos de Filosofia, Instituto de Filosofia da Linguagem, UNL.
- A Negação Polémica num Corpus de Diálogo, H. Campos, C. Correia, Proceedings of XIV Encontro da Associação Portuguesa de Linguística, Aveiro, September 1998.
- Mapeamento Sintáctico-Prosódico em PE (Evidência Fornecida por um Corpus de Fala Espontânea), I. Duarte, C. Viana, G. Matos, I. Trancoso, J. Costa, I. Mascarenhas, Summary of the oral presentation at XIV Encontro Nacional da Associação Portuguesa de Linguística, Aveiro, September 1998.
- Corpus de Diálogo CORAL, I. Trancoso, C. Viana, I. Duarte, G. Matos, PROPOR'98 - Proceedings of III Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, Porto Alegre, Brazil, November 1998.
Example of orthographic labelling (pilot dialogue - check maps and general description in the oral presentation of the project mentioned above). Given their length, annotation files at other levels were not included in this page.

Consortium

INESC (Instituto de Engenharia de Sistemas e Computadores), Lisbon
CLUL (Centro de Linguística da Universidade de Lisboa)
FLUL (Faculdade de Letras da Universidade de Lisboa)
FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa)

Summary of initial proposal

The purpose of this project is the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, prosodic, syntactic and semantic. The corpus should be sufficiently representative in terms of number of speakers, and it should focus on a selected theme in order to a priori limit the vocabulary which is used. This type of corpus is essential to research in spontaneous speech processing, which is characterised by a number of phenomena that seriously difficult its automatic understanding - hesitations, restarts, ill-formed sentences, etc.. It is also essential for the study of dialogues, in particular of their structuring and integration with speech recognition. The project does not envisadge to study all these problems but rather to create a linguistic infra-structure which enables this study in future projects by interdisciplinary research teams, such as the one involved in its creation. It is therefore important that, besides including the transliteration of the entire corpus, with an indication of all the para-linguistic phenomena, the corpus also includes labelling at other levels - phonetic, prosodic, syntactic and semantic. Although there are automatic tools for certain types of labelling, their robustness for spontaneous speech is quite reduced relatively to read speech, which means that most of this work is manual, therefore demanding human resources well above the scope of this program. Hence, only a subset of the corpus will include all the types of labelling.

The project starts by a design phase in which the topic will be chosen, and the number of speakers and other parameters whose variability we wish to study will be specified. This phase will be followed by the collection phase and the successive labelling stages, with some overlap in between them. The project ends with the preparation and packing of the data files for CD-ROM pressing, in order to allow its wide dissemination by the community of Portuguese language researchers.

Summary of main achievements

The project CORAL had as its main achievement the production of a linguistic resource that did not exist for European Portuguese at the time of its proposal - a spoken dialogue corpus, with several levels of labelling, which is sufficiently significant in terms of number of speakers (32, grouped into 8 quartets, amounting to 64 dialogues), and which is focused on a pre-selected theme in order to a priori restrict the scope of the vocabulary (the well known map task).

This type of corpus is, in fact, essential for the progress of research in processing spontaneous speech, which is characterized by several phenomena that seriously affect the task automatic speech understanding. This type of corpus is also important for the study of dialogue, particularly of its structure and relationship with speech understanding in the scope of spoken human-machine interfaces. We think that such a linguistic resource will allow the study of the above mentioned problems in projects to be defined in a later stage.

A systematic exploitation of this corpus ranging from the test of the adequacy of the segmentation/labelling criteria to a more detailed study of the mapping between several analysis levels is clearly beyond the objectives of the proposal.

The corpus is presently available in 5 CDROMs, amounting to 1.6 Gb, if only signal files are accounted for, assuming a sampling frequency of 16kHz. Its availability in wav format is also possible. All dialogues have been annotated orthographically. Only a relatively small subset has been annotated at different levels. The only multi-level annotated dialogue included in the CDROMs is the pilot dialogue. For further information about the corpus and its availability, please contact Isabel Trancoso.

Start: 30/12/96

End: 30/06/99 (prolongation of 6 months relative to initial planning of 2 years)

Isabel Trancoso
03/11/99