R-151_TangCW.pdf

COMPUTATIONAL METHODS IN ENGINEERING AND SCIENCE
EPMESC X, Aug. 21-23, 2006, Sanya, Hainan, China
©2006 Tsinghua University Press & Springer
Application of Translation Corresponding Tree (TCT) Annotation
Schema for Chinese to Portuguese Machine Translation
C. W. Tang 1*, F. Wong 2, K. S. Leong 2, M. C. Dong 2, Y. P. Li 2
1
2
INESC Macau. 1/F, Block-III, University of Macau, Macau SAR, China
Faculty of Science and Technology, University of Macau, Macau SAR, China
Email: {kevin, derek, sejohnny, dmc}@inesc-macau.org.mo, ypli@umac.mo
Abstract In order to resolve the weakness of the Structure Based EBMT system and linguistic problems between
Chinese and Portuguese, Tang and Wong propose a new Portuguese to Chinese Machine Translation method and this
method is based on a novel technology called Translation Corresponding Tree (TCT) which is an example based
knowledge annotation method for Portuguese to Chinese translation. In this paper, it adopts the TCT annotation
scheme and introduces the knowledge based construction and translation for Chinese to Portuguese MT. In this
research, it also proposes a conversion algorithm to reuse the existing translation knowledge of Portuguese to Chinese
MT system, which represented in terms of TCT trees. Based on the transformation algorithm, the knowledge trees of
Portuguese to Chinese translation are converted into that of the translation knowledge which can be used to facilitate
the Chinese to Portuguese. Based on the research result of this paper, a Chinese to Portuguese prototyping machine
translation system is implemented and the empirical results show that the MT system can achieve the translation
accuracy of 80% in the domain of Macau Law statements.
Key words: Structure Based EBMT, Translation Corresponding Tree, Chinese and Portuguese Translation
BACKGROUND
In Example Based Machine Translation (EBMT) system, there are three types of translation mechanisms; they are
Surface Based approach, Pattern Based approach, and Structure Based approach. In Structure Based approach, like [1,
2], it introduces syntax structure and linguistic information in knowledge representing and analysis. Generally,
although Structure Based approach has more accuracy but it requires two syntactic parsers for source and target
languages. Unfortunately, a robust syntax parser is not always available. Moreover, for the translation pairs belong to
different language families, such as Chinese and Portuguese, it is hard to establish bilingual relationship between
source and target languages. To resolve the weakness in Structure Based approach and language barratry between
Chinese and Portuguese, Tang and Wong [3] propose a new Portuguese to Chinese Machine Translation method and
this method is based on a novel technology called Translation Corresponding Tree (TCT) which is an example based
knowledge annotation method for Portuguese to Chinese translation.
TRANSLATION CORRESPONDING TREE SCHEMA
Translation Corresponding Tree (TCT) structure is an extension of Structure String Tree Correspondence (SSTC)
representation [2, 4]. It only needs a source language parser to build up a syntactic tree. Inside tree structure, there is an
explicitly associate the string from its translation in target language in order to describe the correspondence between
source and target language.
The TCT representation structure uses a triple sequence intervals [SNODE(n)/STREE(n)/STC(n)] to encode each node
in the tree to represent the corresponding relation between the structure of source sentence and the substrings from both
source and target sentence. The corresponding in TCT schema is composed with three parts:
(1) SNODE(String Node): Notation between the node and the substring in source sentence, which indicate the head
word that in the source substring corresponding to the node.
⎯ 1105 ⎯
(2) STREE(String Tree): A relation between the sub tree and the substring of source sentence. It denotes the interval of
substring that is dominated by the subtree.
(3) STC(String Target Corresponding): Between the subtree of source sentence and the substring of target sentence. It
indicates the interval containing the substring in target sentence corresponding to the subtree of source sentence.
For further explanation of TCT annotation, please refer to the paper [3]. In the following section, it presents the
problem of TCT in Chinese to Portuguese annotation and translation together with the knowledge conversions from
Portuguese to Chinese to Chinese to Portuguese Translation.
TCT IN CHINESE TO PORTUGUESE TRANSLATION
Besides the common problems in foreign languages to Chinese MT system, there are some specific problems for
Chinese to Portuguese annotation and translation. The first thing needed to consider in Chinese to foreign language
machine translation is the Chinese segmentation problem. In Chinese, sentence is written continuously without any
specific delimiters between words, therefore, in Chinese language processing, the word boundary must be determined
for sentence understanding, no matter in human understanding or machine understanding. Although the current
segmentation technologies become mature and most of them have more than 96% segmentation accuracy, but it cannot
avoid any analysis error. Even a little bit of error in segmentation may cause serious problems to the depending
processes afterwards. The second thing of Chinese process is Syntax analysis. In Chinese, a sentence may be very long
in some cases and some authors are accustomed to write passage only with a single full stop in the end of paragraph and
use comma to separate each sentence inside the paragraph. Sometime, this writing style is not wrong but this kind of
text is hard to analysis by machine. If the input sentence is too long, the computational space will become very large
and it will consume many resources to get an unacceptable result. Therefore, it makes Chinese analysis become
complex.
To overcome the additional difficulties of Chinese to Portuguese translation, we adopt Leong’s CSAT system [5] to
produce segmentation and part-of-speech tagging processing. According to the tagging result, we use the CSG [6]
parser as our system parser. Although CSG is a kind of rule based parser, the grammar rules of CSG can be
automatically acquired from the collection of constructed TCTs.
To construct the translation knowledge for Chinese to Portuguese, the main different is to carry out the word
segmentation before the tagging and other analysis processes start. To illustration the construction process for Chinese
. In
to Portuguese translation example, let’s consider the example
the first phase during the TCT construction, the source sentence
analyzed with the help of the Chinese segmenter, the source sentence then becomes
, where words are delimited with spaces. After that, the segmented result is analyzed with a POS
tagger similar to that in tagging the Portuguese sentence, and the POS annotated sentence is
. Based on the tagging POSs, the structure of the Chinese sentence
is then analyzed with the syntax parser and a syntax tree for the sentence is obtained as shown in Fig. 1.
Figure 1: Source syntax tree for Chinese
About the processing of Portuguese sentence, basically the same analysis steps as described in Tang & Wong system
[3] are applied to the morphological and the POS processing. In this example, the analyzed Portuguese sentence is
“Domicílio/N legal/ADJ de/PRP os/DET trabalhadores/N de/PRP a/DET Administração Pública/PROP de/PRP
⎯ 1106 ⎯
o/DET território/N de/PRP Macau/PROP”. In the second phase of TCT construction, it takes the syntax structure of the
Chinese sentence and the morphological annotated Portuguese sentence, a preliminary TCT structure of the translation
example is constructed. In the final phase, a post-verification and post-edition is carried out by human if any
amendment to the constructed TCT structure for the example is needed. And the validated representation structure is
saved to the system’s example base as the translation knowledge as in the Fig. 2.
Figure 2: Constructed TCT tree for Chinese to Portuguese Translation Example
CONVERSIONS ALGORITHM
To build a Chinese to Portuguese TCT tree, besides the construction from the original bilingual sentence pair. We may
reuse the existing Portuguese to Chinese TCT trees by converting them into that of Chinese to Portuguese TCT trees.
So that the existing translation knowledge can be easily reused to reduce the work of knowledge construction from
beginning.
Since Chinese and Portuguese are two different languages, their syntax trees have natural deviation problem.
Therefore, some of them cannot directly convert into the Chinese to Portuguese Tree. In this conversion algorithm,
only the trees that all words in target language must be referenced in STC in the leaf node of original TCT tree are
allowed to be used for conversion. It is because the TCT tree should represent the complete source sentence after the
conversion. The following algorithm is proposed to do the conversion for an existing tree and the conversions result as
Fig. 3.
(a)
(b)
Figure 3: (a) Before Conversions [P to C]. (b) After Conversions [C to P]
1. Swap STREE and STC in all nodes In the original TCT tree, STREE represents the corresponding substring for
the source sentence, while the STC is for the target sentence. Since in the target structure, the STREE is used for
representing Chinese and STC is for Portuguese after the conversion, the first step is to swap the intervals STREE and
STC for all nodes in the original TCT tree.
2. Drop unnecessary node After the swapping process, there are some nodes in the leaf of the structure become
“empty” node since that node contain “∅” in the attribute STREE and it represents nothing. As a result, drop out these
kind of unnecessary node to make the tree tidy. After dropping the empty leaf nodes, some of the upper nodes only
⎯ 1107 ⎯
contain a single branch and this kind of node is no longer necessary also. As a result, drop the node again to reduce the
tree size.
3. Swap representing position To rearrange the tree presenting in the proper format, the position of the source and
target sentences are interchanged. For the basic idea of TCT annotation, the source sentence is presented at the bottom
of the tree and the target sentence is put at the right side of the structure. Therefore, according to this arrangement, the
positions of Chinese and Portuguese sentences are changed accordingly. For the Chinese sentence, according to the
substring’s intervals represented by the STREE, the substrings of sentence are allocated according to the positions
stated by the STREE of the leaf nodes.
4. Recalculate SNode for all nodes The next step is to update the values of SNODE after the reallocating the
substrings to the corresponding nodes of structure. The values calculation process follows the algorithm in
construction process. For all the leaf nodes, the value of SNODE remains the same as that of the STREE. For the other
inter nodes, we use the same method as that of the second phase in the construction process.
5. Reorder all branches In order to deal with the crossing dependences which represent as horizontal lines
graphically, all the nodes which contain a constraint must reorder the sub-branch of that node to fit to the sequence of
Chinese constituents in a sentence. In order to retain the crossing relationships between the Chinese and Portuguese
sentences, the grammatical constraints encoded within the related nodes remain unchanged.
6. Reset the grammatical part Although the structure transformation of the syntax tree is done; the grammatical
category label of the original tree is constructed based on the Portuguese sentence. Hence these POSs cannot directly
be applied to for the Chinese sentence, and these needs to be revised according to that of the Chinese tagging result.
Therefore, the original grammatical information should be changed to the right POS for Chinese to Portuguese TCT
tree according the tagged categories of the Chinese sentence.
7. Human validation After the TCT structure is converted by the system automatically; the produced TCT for the
reverse languages is verified manually. Then a new Chinese to Portuguese TCT is successfully converted from the
existing Portuguese to Chinese TCT tree. As illustrated in this example, the newly converted TCT tree has not error
and can be kept in the knowledge base as learnt knowledge to be used for the Chinese to Portuguese translation.
According to this conversions algorithm, in our testing environment, we found that there are about 52% of
existing Portuguese to Chinese TCT annotated tree can be converted. Although the successful conversion
rate are not so high, but it give a significant meaning in reusing existing knowledge for the translation
direction migration.
EVALUATION AND CONCLUSION
In this section, we give the evaluation results about the translation system. In our evaluation criteria, we
adopt the Van Slype method [7] to evaluate translation quality and the evaluation criteria is mainly divided
into three categories to the quality of the translation results: Good (G), Acceptable (A), and Failure (F). For
the grade “Good” means that the target translation is understand by human and do not require any
post-editing. The grade “Acceptable” means that the translation result needs further editing and the
correction ratio is lower than 20%. And the grade “Failure” is the rest case and this translation result cannot
be understood.
Table 1 Evaluation result of Chinese to Portuguese translation
Translation Quality
Amount ( ratio )
Good
29 (58%)
Acceptable
11 (22%)
40 (80%)
Failure
10 (20%)
Total
50 (100%)
To evaluate the Chinese to Portuguese translation, we randomly select the Macau law statements from the
“Código Civil” as the translation knowledge. During the testing, we used 50 sentences from the “Código
Civil” and asked the system to translate the Chinese sentences into Portuguese. The accuracy of translation
is around 80% and list in Table 1.
⎯ 1108 ⎯
From the evaluation results, we found that the translation accuracy can reach around 80%. These results
prove three issues of our research work about the TCT annotation schema: (1) the use of TCT representation
structure can effectively denoting the translation corresponding for bilingual text; (2) the translation model
based on this representation tree can produce a promising quality of translation; and (3) the TCT annotation
schema can be applied to other pair of languages for construction the translation knowledge.
Acknowledgements
The research work reported in this paper was supported by "Fundo para o Desenvolvimento das Ciências e da
Tecnologia" under grant 041/2005/A.
REFERENCES
1. Meyers A, Yangarber R, Grishman R et al. Deriving transfer rules from dominance-preserving alignments.
Proceedings of the 17th International Conference on Computational Linguistics [COLING-98], 2, Montreal,
Quebec, Canada, ACL Press, 1998, pp. 843-847.
2. Mosleh H, Al-Adhaileh, Tang EK. Example-based machine translation based on the synchronous SSTC
annotation schema. in Proceeding of Machine Translation Summit VII, Singapore, 1999, pp. 244-249.
3. Tang CW, Wong F, Li YP. TCT schema in EBMT and its application. Proceedings of The Symposium on
Applied Science and Technology in Macau, 2004, pp. 19-27.
4. Boitet C, Zaharin Y. Representation trees and string-tree correspondences. in Proceeding of COLING-88,
Budapest, 1988, pp. 59-64.
5. Leong KS, Wong F, Tang CW et al. CSAT: a Chinese segmentation and tagging module based on the
interpolated probabilistic model. in Proceedings of EPMESC X. Sanya, Hainan, China, 2006.
6. Wong F, Hu DC, Mao YH et al. Machine translation by parsing constraint-based synchronous grammar. in
Tsighua Science and Technology, 2005.
7. Van SG. Critical study of methods for evaluating the quality of machine translation. The Commission of
European Communities Directorate General Scientific and Technical Information and Information Management,
1979.
⎯ 1109 ⎯