MMT – Modern Machine Translation

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645487. MMT – Modern Machine Translation Data Management Plan Author(s): Eduard Barbu, Nicola Bertoldi, Achim Ruopp, Anna Samiotou Dissemination Level: Public Date: June 30th, 2015 MMT – Modern Machine Translation Data Management Plan Grant agreement no. Project acronym Project full title 645487 MMT MMT will deliver a language independent commercial online translation service based on a new open-‐source machine translation distributed architecture Collaborative project Alessandro Cattelan (TRANSLATED) 1 January 2015, 36 months Public th
June 30 , 2015 st
July 1 , 2015 6.2 Data Management Plan Report Final 9 TRANSLATED, FBK, TAUS TRANSLATED TRANSLATED Eduard Barbu, Nicola Bertoldi, Achim Ruopp, Anna Samiotou Saila Rinne Funding scheme Coordinator Start date, duration Distribution Contractual date of delivery Actual date of delivery Deliverable number Deliverable title Type Status and version Number of pages Contributing partners WP leader Task leader Authors EC project officer The partners in MMT are: Translated S.r.l. (TRANSLATED), Italy Fondazione Bruno Kessler (FBK), Italy The University of Edinburgh (UEDIN), United Kingdom TAUS B.V. (TAUS), The Netherlands For copies of reports, updates on project activities and other MateCat-‐related information, contact: TRANSLATED MMT Alessandro Cattelan Via Nepal, 29 I-‐00144 Rome, Italy alessandro@translated.net Phone: (+39) 06 90 254 244 Fax: (+39) 06 233 200 102 © 2015, Eduard Barbu, Nicola Bertoldi, Achim Ruopp, Anna Samiotou No part of this document may be reproduced or transmitted in any form, or by any means, electronic or me-‐
chanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. 2 MMT – Modern Machine Translation Data Management Plan Table of Contents 1 Executive Summary ............................................................................................................. 4 2 Introduction ........................................................................................................................ 5 3 Data Management .............................................................................................................. 7 3.1 MMT-‐TM (Modern Machine Translation -‐ Translation Memory) ................................. 7 3.2 MMT-‐PWI (Modern Machine Translation -‐ Parallel Web Index) .................................. 7 3.3 MMT -‐ Human Evaluation Data -‐ A/B testing ............................................................... 8 3.4 MMT -‐ Human Evaluation Data -‐ Productivity ............................................................. 9 3 MMT – Modern Machine Translation Data Management Plan 1 Executive Summary This document reports on the current state of the data collected and generated by the MMT project and on how such data will be used and made available. The data management plan will be updated during the project and we expect to release updates in months 18 and 30. As stated in the project proposal, the MMT consortium decided to opt out of the Open Research Data Pilot in Horizon 2020 since part of the data collected during the project may include confidential data owned by clients of the commercial partners (Translated, TAUS) and there may also be intellectual property rights (IPR) involved in part of the data collected from the web or third parties. We did not and do not expect to be able to obtain permission to publish such confidential or IPR protected information in all cases. The consortium will provide answers to such concerns and report on the published data and its availability beyond the end of the project at a later stage in the project in an update of this deliverable. The consortium will work towards releasing part of the data for research purposes wherever possible. 4 MMT – Modern Machine Translation Data Management Plan 2 Introduction Data collection is a key part of the MMT project. The data collected comprises very large amounts of monolingual and bilingual data collected from the web for the language pairs covered by MMT (Task 2.2), and evaluation data collected from users of the technology in real-‐life tests of the technology that we are developing. As for the former, the full index of discovered parallel web pages will be made available. Subject to a positive outcome of a pending legal review, we will also release publicly actual web data to the degree that current copyright law allows it. The consortium (Task 2.3) also plans to identify, contact and motivate third parties such as institutions or individuals to contribute data to the shared platform, in exchange to early access to the technology. Data collected from third parties will be made available to the degree that the copyright holders grant us permission to do so. We will work actively to obtain this permission, e.g. by offering more favorable bartering terms to parties that agree to make their data available publicly. The consortium will not be able to release data that belonged to a member of the consortium prior to the beginning of the project. Such data, typically translation memories from TAUS Data or Translated MyMemory, will be kept private and will not be shared outside of the consortium due to confidentiality issues. The index of discovered parallel web pages will be made available in XML format. The translation memories will be encoded with the Translation Memory Exchange format (TMX). The evaluation data will be presented as text files. The translation memories from parallel web data (TMX files), the parallel web index files (XML) and the evaluation text files will be stored in a web-‐accessible repository and available for download. The data will be stored in databases or text files and, during the life cycle of the project, the data will be evaluated according to sets of data quality measures developed during the project, cleaned and improved (Task 5.2). 5 MMT – Modern Machine Translation Data Management Plan The consortium will be able to report on the volume, availability (due to, mainly, privacy and IPR issues) and accessibility of the datasets beyond the end of the project in the upcoming updates of this deliverable at a later stage in the project (months 18 and 30). 6 MMT – Modern Machine Translation Data Management Plan 3 Data Management This section describes the data that are being collected in the MMT project and how it is being managed. 3.1 MMT-‐TM (Modern Machine Translation -‐ Translation Memory) Data set reference and name
MMT-‐TM (Modern Machine Translation-‐Translation Memory) Data set description
MMT-‐TM is a translation memory obtained by finding and aligning at the sentence level parallel documents from the web. Standards and metadata
MMT-‐TM will be available as a set of TMX files -‐ one for each language pair covered by MMT. Data sharing
MMT-‐TM will be uploaded in a repository and available for download. We are planning to make the web parallel corpora public pending legal review. Archiving and preservation During the duration of the project, the MMT-‐TM will be stored in a database. At the end of the project it will be uploaded in a web-‐accessible repository. 3.2 MMT-‐PWI (Modern Machine Translation -‐ Parallel Web Index) Data set reference and name
MMT-‐PWI (Modern Machine Translation-‐Parallel Web Index) 7 MMT – Modern Machine Translation Data Management Plan Data set description
MMT-‐PWI is an index of parallel web page URLs obtained by aligning documents from the web. Standards and metadata
MMT-‐PWI will be available as a set of XML files -‐ one for each language pair covered by MMT. Data sharing
MMT-‐PWI will be uploaded in a repository and available for public download. Archiving and preservation During the duration of the project the MMT-‐PWI will be stored in a database. At the end of the project it will be uploaded in a web-‐accessible repository. 3.3 MMT -‐ Human Evaluation Data -‐ A/B testing Data set reference and name
MMT-‐HED-‐A/B (Modern Machine Translation -‐ Human Evaluation Data -‐ A/B testing) Data set description
This data set will contain the direct comparison of MMT system output against an established system (“A/B” preference testing). Evaluators will be presented with the output of different translation systems (in random order, so that they do not know which system produced which translation), and asked which one they prefer. This evaluation will use the Dynamic Quality Framework (DQF) tools available from TAUS. Standards and metadata
CSV data files that contain the human evaluation of the 8 MMT – Modern Machine Translation Data Management Plan translations. Data sharing
MMT-‐HED-‐A/B will be uploaded in a repository and available for public download. Archiving and preservation During the duration of the project the MMT-‐HED-‐A/B will be stored in text files. At the end of the project it will be uploaded in a web-‐accessible repository. 3.4 MMT -‐ Human Evaluation Data -‐ Productivity Data set reference and name
MMT-‐HED-‐P (Modern Machine Translation -‐ Human Evaluation Data -‐ Productivity) Data set description
The data will contain the productivity tests, where we measure the effect on productivity of using the MMT system in a real post-‐editing project. In these tests, professional translators will use MateCat to post-‐edit raw machine translation output. We will make public the statistical results but not the edited translations. Standards and metadata
CSV data files that contain the anonymous statistical results. Data sharing
MMT-‐HED-‐P will be uploaded in a repository and available for public download. Archiving and preservation During the duration of the project the MMT-‐HED-‐P will be stored in text files. At the end of the project it will be 9 MMT – Modern Machine Translation Data Management Plan uploaded in a web-‐accessible repository. 10

Download Report