COMPUTATIONAL METHODS IN ENGINEERING AND SCIENCE EPMESC X, Aug. 21-23, 2006, Sanya, Hainan, China ©2006 Tsinghua University Press & Springer Application of Translation Corresponding Tree (TCT) Annotation Schema for Chinese to Portuguese Machine Translation C. W. Tang 1*, F. Wong 2, K. S. Leong 2, M. C. Dong 2, Y. P. Li 2 1 2 INESC Macau. 1/F, Block-III, University of Macau, Macau SAR, China Faculty of Science and Technology, University of Macau, Macau SAR, China Email: {kevin, derek, sejohnny, dmc}, Abstract In order to resolve the weakness of the Structure Based EBMT system and linguistic problems between Chinese and Portuguese, Tang and Wong propose a new Portuguese to Chinese Machine Translation method and this method is based on a novel technology called Translation Corresponding Tree (TCT) which is an example based knowledge annotation method for Portuguese to Chinese translation. In this paper, it adopts the TCT annotation scheme and introduces the knowledge based construction and translation for Chinese to Portuguese MT. In this research, it also proposes a conversion algorithm to reuse the existing translation knowledge of Portuguese to Chinese MT system, which represented in terms of TCT trees. Based on the transformation algorithm, the knowledge trees of Portuguese to Chinese translation are converted into that of the translation knowledge which can be used to facilitate the Chinese to Portuguese. Based on the research result of this paper, a Chinese to Portuguese prototyping machine translation system is implemented and the empirical results show that the MT system can achieve the translation accuracy of 80% in the domain of Macau Law statements. Key words: Structure Based EBMT, Translation Corresponding Tree, Chinese and Portuguese Translation BACKGROUND In Example Based Machine Translation (EBMT) system, there are three types of translation mechanisms; they are Surface Based approach, Pattern Based approach, and Structure Based approach. In Structure Based approach, like [1, 2], it introduces syntax structure and linguistic information in knowledge representing and analysis. Generally, although Structure Based approach has more accuracy but it requires two syntactic parsers for source and target languages. Unfortunately, a robust syntax parser is not always available. Moreover, for the translation pairs belong to different language families, such as Chinese and Portuguese, it is hard to establish bilingual relationship between source and target languages. To resolve the weakness in Structure Based approach and language barratry between Chinese and Portuguese, Tang and Wong [3] propose a new Portuguese to Chinese Machine Translation method and this method is based on a novel technology called Translation Corresponding Tree (TCT) which is an example based knowledge annotation method for Portuguese to Chinese translation. TRANSLATION CORRESPONDING TREE SCHEMA Translation Corresponding Tree (TCT) structure is an extension of Structure String Tree Correspondence (SSTC) representation [2, 4]. It only needs a source language parser to build up a syntactic tree. Inside tree structure, there is an explicitly associate the string from its translation in target language in order to describe the correspondence between source and target language. The TCT representation structure uses a triple sequence intervals [SNODE(n)/STREE(n)/STC(n)] to encode each node in the tree to represent the corresponding relation between the structure of source sentence and the substrings from both source and target sentence. The corresponding in TCT schema is composed with three parts: (1) SNODE(String Node): Notation between the node and the substring in source sentence, which indicate the head word that in the source substring corresponding to the node. ⎯ 1105 ⎯ (2) STREE(String Tree): A relation between the sub tree and the substring of source sentence. It denotes the interval of substring that is dominated by the subtree. (3) STC(String Target Corresponding): Between the subtree of source sentence and the substring of target sentence. It indicates the interval containing the substring in target sentence corresponding to the subtree of source sentence. For further explanation of TCT annotation, please refer to the paper [3]. In the following section, it presents the problem of TCT in Chinese to Portuguese annotation and translation together with the knowledge conversions from Portuguese to Chinese to Chinese to Portuguese Translation. TCT IN CHINESE TO PORTUGUESE TRANSLATION Besides the common problems in foreign languages to Chinese MT system, there are some specific problems for Chinese to Portuguese annotation and translation. The first thing needed to consider in Chinese to foreign language machine translation is the Chinese segmentation problem. In Chinese, sentence is written continuously without any specific delimiters between words, therefore, in Chinese language processing, the word boundary must be determined for sentence understanding, no matter in human understanding or machine understanding. Although the current segmentation technologies become mature and most of them have more than 96% segmentation accuracy, but it cannot avoid any analysis error. Even a little bit of error in segmentation may cause serious problems to the depending processes afterwards. The second thing of Chinese process is Syntax analysis. In Chinese, a sentence may be very long in some cases and some authors are accustomed to write passage only with a single full stop in the end of paragraph and use comma to separate each sentence inside the paragraph. Sometime, this writing style is not wrong but this kind of text is hard to analysis by machine. If the input sentence is too long, the computational space will become very large and it will consume many resources to get an unacceptable result. Therefore, it makes Chinese analysis become complex. To overcome the additional difficulties of Chinese to Portuguese translation, we adopt Leong’s CSAT system [5] to produce segmentation and part-of-speech tagging processing. According to the tagging result, we use the CSG [6] parser as our system parser. Although CSG is a kind of rule based parser, the grammar rules of CSG can be automatically acquired from the collection of constructed TCTs. To construct the translation knowledge for Chinese to Portuguese, the main different is to carry out the word segmentation before the tagging and other analysis processes start. To illustration the construction process for Chinese . In to Portuguese translation example, let’s consider the example the first phase during the TCT construction, the source sentence analyzed with the help of the Chinese segmenter, the source sentence then becomes , where words are delimited with spaces. After that, the segmented result is analyzed with a POS tagger similar to that in tagging the Portuguese sentence, and the POS annotated sentence is . Based on the tagging POSs, the structure of the Chinese sentence is then analyzed with the syntax parser and a syntax tree for the sentence is obtained as shown in Fig. 1. Figure 1: Source syntax tree for Chinese About the processing of Portuguese sentence, basically the same analysis steps as described in Tang & Wong system [3] are applied to the morphological and the POS processing. In this example, the analyzed Portuguese sentence is “Domicílio/N legal/ADJ de/PRP os/DET trabalhadores/N de/PRP a/DET Administração Pública/PROP de/PRP ⎯ 1106 ⎯ o/DET território/N de/PRP Macau/PROP”. In the second phase of TCT construction, it takes the syntax structure of the Chinese sentence and the morphological annotated Portuguese sentence, a preliminary TCT structure of the translation example is constructed. In the final phase, a post-verification and post-edition is carried out by human if any amendment to the constructed TCT structure for the example is needed. And the validated representation structure is saved to the system’s example base as the translation knowledge as in the Fig. 2. Figure 2: Constructed TCT tree for Chinese to Portuguese Translation Example CONVERSIONS ALGORITHM To build a Chinese to Portuguese TCT tree, besides the construction from the original bilingual sentence pair. We may reuse the existing Portuguese to Chinese TCT trees by converting them into that of Chinese to Portuguese TCT trees. So that the existing translation knowledge can be easily reused to reduce the work of knowledge construction from beginning. Since Chinese and Portuguese are two different languages, their syntax trees have natural deviation problem. Therefore, some of them cannot directly convert into the Chinese to Portuguese Tree. In this conversion algorithm, only the trees that all words in target language must be referenced in STC in the leaf node of original TCT tree are allowed to be used for conversion. It is because the TCT tree should represent the complete source sentence after the conversion. The following algorithm is proposed to do the conversion for an existing tree and the conversions result as Fig. 3. (a) (b) Figure 3: (a) Before Conversions [P to C]. (b) After Conversions [C to P] 1. Swap STREE and STC in all nodes In the original TCT tree, STREE represents the corresponding substring for the source sentence, while the STC is for the target sentence. Since in the target structure, the STREE is used for representing Chinese and STC is for Portuguese after the conversion, the first step is to swap the intervals STREE and STC for all nodes in the original TCT tree. 2. Drop unnecessary node After the swapping process, there are some nodes in the leaf of the structure become “empty” node since that node contain “∅” in the attribute STREE and it represents nothing. As a result, drop out these kind of unnecessary node to make the tree tidy. After dropping the empty leaf nodes, some of the upper nodes only ⎯ 1107 ⎯ contain a single branch and this kind of node is no longer necessary also. As a result, drop the node again to reduce the tree size. 3. Swap representing position To rearrange the tree presenting in the proper format, the position of the source and target sentences are interchanged. For the basic idea of TCT annotation, the source sentence is presented at the bottom of the tree and the target sentence is put at the right side of the structure. Therefore, according to this arrangement, the positions of Chinese and Portuguese sentences are changed accordingly. For the Chinese sentence, according to the substring’s intervals represented by the STREE, the substrings of sentence are allocated according to the positions stated by the STREE of the leaf nodes. 4. Recalculate SNode for all nodes The next step is to update the values of SNODE after the reallocating the substrings to the corresponding nodes of structure. The values calculation process follows the algorithm in construction process. For all the leaf nodes, the value of SNODE remains the same as that of the STREE. For the other inter nodes, we use the same method as that of the second phase in the construction process. 5. Reorder all branches In order to deal with the crossing dependences which represent as horizontal lines graphically, all the nodes which contain a constraint must reorder the sub-branch of that node to fit to the sequence of Chinese constituents in a sentence. In order to retain the crossing relationships between the Chinese and Portuguese sentences, the grammatical constraints encoded within the related nodes remain unchanged. 6. Reset the grammatical part Although the structure transformation of the syntax tree is done; the grammatical category label of the original tree is constructed based on the Portuguese sentence. Hence these POSs cannot directly be applied to for the Chinese sentence, and these needs to be revised according to that of the Chinese tagging result. Therefore, the original grammatical information should be changed to the right POS for Chinese to Portuguese TCT tree according the tagged categories of the Chinese sentence. 7. Human validation After the TCT structure is converted by the system automatically; the produced TCT for the reverse languages is verified manually. Then a new Chinese to Portuguese TCT is successfully converted from the existing Portuguese to Chinese TCT tree. As illustrated in this example, the newly converted TCT tree has not error and can be kept in the knowledge base as learnt knowledge to be used for the Chinese to Portuguese translation. According to this conversions algorithm, in our testing environment, we found that there are about 52% of existing Portuguese to Chinese TCT annotated tree can be converted. Although the successful conversion rate are not so high, but it give a significant meaning in reusing existing knowledge for the translation direction migration. EVALUATION AND CONCLUSION In this section, we give the evaluation results about the translation system. In our evaluation criteria, we adopt the Van Slype method [7] to evaluate translation quality and the evaluation criteria is mainly divided into three categories to the quality of the translation results: Good (G), Acceptable (A), and Failure (F). For the grade “Good” means that the target translation is understand by human and do not require any post-editing. The grade “Acceptable” means that the translation result needs further editing and the correction ratio is lower than 20%. And the grade “Failure” is the rest case and this translation result cannot be understood. Table 1 Evaluation result of Chinese to Portuguese translation Translation Quality Amount ( ratio ) Good 29 (58%) Acceptable 11 (22%) 40 (80%) Failure 10 (20%) Total 50 (100%) To evaluate the Chinese to Portuguese translation, we randomly select the Macau law statements from the “Código Civil” as the translation knowledge. During the testing, we used 50 sentences from the “Código Civil” and asked the system to translate the Chinese sentences into Portuguese. The accuracy of translation is around 80% and list in Table 1. ⎯ 1108 ⎯ From the evaluation results, we found that the translation accuracy can reach around 80%. These results prove three issues of our research work about the TCT annotation schema: (1) the use of TCT representation structure can effectively denoting the translation corresponding for bilingual text; (2) the translation model based on this representation tree can produce a promising quality of translation; and (3) the TCT annotation schema can be applied to other pair of languages for construction the translation knowledge. Acknowledgements The research work reported in this paper was supported by "Fundo para o Desenvolvimento das Ciências e da Tecnologia" under grant 041/2005/A. REFERENCES 1. Meyers A, Yangarber R, Grishman R et al. Deriving transfer rules from dominance-preserving alignments. Proceedings of the 17th International Conference on Computational Linguistics [COLING-98], 2, Montreal, Quebec, Canada, ACL Press, 1998, pp. 843-847. 2. Mosleh H, Al-Adhaileh, Tang EK. Example-based machine translation based on the synchronous SSTC annotation schema. in Proceeding of Machine Translation Summit VII, Singapore, 1999, pp. 244-249. 3. Tang CW, Wong F, Li YP. TCT schema in EBMT and its application. Proceedings of The Symposium on Applied Science and Technology in Macau, 2004, pp. 19-27. 4. Boitet C, Zaharin Y. Representation trees and string-tree correspondences. in Proceeding of COLING-88, Budapest, 1988, pp. 59-64. 5. Leong KS, Wong F, Tang CW et al. CSAT: a Chinese segmentation and tagging module based on the interpolated probabilistic model. in Proceedings of EPMESC X. Sanya, Hainan, China, 2006. 6. Wong F, Hu DC, Mao YH et al. Machine translation by parsing constraint-based synchronous grammar. in Tsighua Science and Technology, 2005. 7. Van SG. Critical study of methods for evaluating the quality of machine translation. The Commission of European Communities Directorate General Scientific and Technical Information and Information Management, 1979. ⎯ 1109 ⎯
© Copyright 2025 Paperzz