Study on Supervised Learning of Vietnamese Word Sense Disambiguation Classifiers

It is said that Vietnamese is a language with highly ambiguous words. However, there has been no published Word Sense Disambiguation (WSD hereafter) research on this language. This current research is the first attempt to study Vietnamese WSD. Es pecially, we would like to explore the effective features for training WSD classifiers and verify the applicability of the 'pseudoword' technique to both investigating effec tiveness of features and training WSD classifiers. Three tasks have been conducted, using two corpora which were built manually based on Vietnamese Treebank and automatically by applying pseudowords technique. Experiment results showed that Bag-Of-Word feature performs well for all three categories of words (verbs, nouns, and adjectives). However, its combination with POS, Collocation or Syntactic fea tures can not significantly improve the performance of WSD classifiers. Moreover, the experiment results confirmed that pseudoword is a suitable technique to explore the effectiveness of features in disambiguation of Vietnamese verbs and adjectives. Furthermore, we empirically evaluated the applicability of the pseudoword technique as an unsupervised learning method for real Vietnamese WSD.


Study on Supervised Learning of Vietnamese Word Sense Disambiguation Classifiers
Minh Hai Nguyen t and Kiyoaki Shirai tt It is said that Vietnamese is a language with highly ambiguous words.However, there has been no published Word Sense Disambiguation (WSD hereafter) research on this language.This current research is the first attempt to study Vietnamese WSD.Especially, we would like to explore the effective features for training WSD classifiers and verify the applicability of the 'pseudoword' technique to both investigating effectiveness of features and training WSD classifiers.Three tasks have been conducted, using two corpora which were built manually based on Vietnamese Treebank and automatically by applying pseudowords technique.Experiment results showed that Bag-Of-Word feature performs well for all three categories of words (verbs, nouns, and adjectives).However, its combination with POS, Collocation or Syntactic features can not significantly improve the performance of WSD classifiers.Moreover, the experiment results confirmed that pseudoword is a suitable technique to explore the effectiveness of features in disambiguation of Vietnamese verbs and adjectives.Furthermore, we empirically evaluated the applicability of the pseudoword technique as an unsupervised learning method for real Vietnamese WSD.

Introduction
WSD plays an important role in natural language processing applications, such as machine translation, information retrieval, speech processing, etc.So far, this problem has been studied for English, Japanese and many other languages for more than half a century, and many effective knowledge sources as well as disambiguation methods have been discovered.Vietnamese is said to be a language including many highly ambiguous words.For example, the word 'bien' in Vietnamese can have different meanings: the sea, a sign-board, a large group of people.Hence, WSD is also an important task in Vietnamese language processing.However, to the best of our knowledge, there is no research on Vietnamese WSD.Vietnamese is an isolating language with some general characteristics as follows: • Words do not have morphological forms.Vietnamese has a number of tense markers to t School of Information Science, Japan Advanced Institute of Science and Technology.nhminh@jaist.ac.jp tt School of Information Science, Japan Advanced Institute of Science and Technology.kshirai@jaist.ac.jp indicate the tense of a sentence.Therefore, the grammatical relationship is expressed by word order and auxiliary words.
• Word boundary is not obviously determined by blank.
• There are many 'classifiers' which come before nouns like Chinese.
• Vietnamese also has the same basic SVO word order as English.
In this study, one of our goals is to carry out the first attempt to establish a WSD method for Vietnamese.Since approaches based on supervised machine learning achieved great success in WSD, the present authors are also interested in it.Especially, this paper will discuss the following two issues: • What are effective features in Vietnamese WSD7 Various types of features for WSD were proposed in previous work.Our question here is, "What kinds of features are effective for disambiguation of word senses in Vietnamese 7" • Is pseudoword technique applicable for Vietnamese WSD7 For supervised learning of WSD classifiers, a sense-tagged corpus is required as training data.However, there is no Vietnamese sense-tagged corpus available to the public.Pseudoword technique is often used to evaluate supervised WSD methods when no training data is available.Two words WI and W2 are regarded as an imaginary word (pseudoword) p, then machine learning methods are applied to train classifiers which predict if the original word of p in texts is WI or W2.The performance of trained classifiers can be evaluated without heavy human intervention.Our interest is whether the pseudoword technique is useful for Vietnamese WSD or not.
Considering the above issues, this paper has three goals.The first one is to empirically explore effective features for Vietnamese WSD.Supervised WSD classifiers with several kinds of features are trained, then their performance is compared.Effectiveness of feature combination is also considered.The second is to check the applicability of the pseudoword technique.This paper will investigate the possibility of the pseudoword technique for finding the most effective features.
The last goal is, as an alternative to unsupervised methods, we explore a method to apply the pseudoword technique for training WSD classifiers when no sense-tagged corpus is available.
In the next section, we will discuss some work related to our research.Then, we describe the development of our system for Vietnamese WSD in Section 3. Section 4 introduces three tasks which were conducted in this research.Section 5 shows results and some discussion.Finally, we summarize the research and indicate future work in Section 7.

Related work
The first experiment by Kaplan proved that just one or two words on both sides of an ambiguous word can be evidence to disambiguate that word (Kaplan 1955).Later, more useful information frbm context was discovered by numerous works in WSD.Yarowsky introduced simple set of features (context around the ambiguous words) in accent restoration task (Yarowsky 1994).This led to many other improved sets of features, such as syntactic dependencies (Martinez, Agirre, and Marquez 2002;Dang, Chia, Palmer, and Chiou 2002;Yarowsky and Florian 2002), or cross language evidence (Gale, Church, and Yarowsky 1992a).Beside the approaches utilizing the evidence provided by the surrounding context of the ambiguous word, there are many other researches which take advantage of knowledge bases without using any corpus evidence, such as approaches using dictionaries, thesauri, and lexical knowledge bases (Lesk and Michael 1986;Agirre and Martinez 2001).These knowledge sources have been used in various ways to improve WSD systems in English.Numerous studies have also been devoted to WSD in languages other than English.However, Vietnamese WSD has not been studied so far.Vietnamese is a language with characteristics different from those of English.For example, words in Vietnamese are not separated by empty spaces, an adjective can be a subject of a sentence, etc.It is necessary to investigate the effective features for Vietnamese WSD.
According to the knowledge sources used in sense disambiguation, methods in WSD are classified as knowledge-based, unsupervised corpus-based, supervised corpus-based, and combinations of these (Agirre and Edmonds 2006).Among these methods, the approach to supervised learning is the hot topic, since it has been one of the most successful approaches in the last fifteen years in WSD.However, the biggest problem of supervised learning methods is the knowledge acquisition bottleneck, which poses challenges to the supervised learning approach for WSD.For Vietnamese WSD, the problem is serious, since no sense-tagged corpus is available to the public.Dinh attempted to construct a sense-tagged corpus in Vietnamese by using English semantically-tagged corpus and bilingual English-Vietnamese texts (Dinh 2002).However, he mainly annotated English texts, in order to disambiguate English words to be applied in an English-Vietnamese machine translation system.And there was no evaluation of WSD based on his corpus, either.
Gale et al. introduced a technique called 'pseudowords' to overcome the obstacles of supervised methods (Gale, Church, and Yarowsky 1992b).However, two words to be combined as a pseudoword in Gale's experiments are randomly chosen.Thus pseudowords may have different linguistic characteristics from real ambiguous words.Lu et al. presented 'equivalent' pseudowords (Lu, Wang, Yao, Liu, and Li 2006), in which they built up pseudowords based on real ambigu-ous words.However, they only performed evaluation on pseudowords, and have no comparison between pseudowords and real ambiguous words.The task of classifying two different words may be easier than distinguishing two senses of the same word.Therefore, our research aims to empirically evaluate the validity of the 'pseudoword' method for Vietnamese WSD.

Our method
In this section, we describe our method to disambiguate word senses.SVM is used as a machine learning algorithm which is introduced in Subsection 3.1.Features used in the SVM classifiers are also explained in Subsection 3.2.

Support Vector Machine as classifier for WSD
Support Vector Machine (SVM) (Corinna and Vladimir 1995) learns a linear discriminant hyperplane that separates two classes of data represented as high-dimensional vectors.In this research, the number of senses for an ambiguous word is limited to two, since it is rather difficult to prepare a large scale corpus covering all senses of an ambiguous word1 .The linear kernel is used for training WSD classifiers, because in high dimensional space (when the number of features is large), we expect that mapping data to a higher dimensional space does not improve performance.We actually found that other kernels gave poorer results than linear kernel in our preliminary experiment.

Feature set
For each target instance w, we encode its surrounding context as a feature vector.The feature set F of w is denoted as in (1), where fi represents a feature. (1) In our experiment, the feature vector is weighted according to the context of target instances in the training corpus (Eq.( 2)), where Wi is a weight of fi.Methods for defining fi and Wi will be described in detail for each type of feature. (2) where t{ is the frequency of fi that appears in the context of sense s j of w in the training corpus.
While fi is weighted as in Eq. ( 4) in the test data, since the sense of w is unknown3 .Wi = { ~t} +tr)/2 if fi appears in l if fi does not appear in l (4)

POS
This feature encodes part-of-speech of each word in a context window c around the target instance w as in Eq. ( 5), where Pi is the position of the word and Pi is its POS.Pi is an integer in the range [-c, c] indicating the distance between a target word and a word in the context.
If Pi is positive, the context word appears in the context after the target word.Similarly, Pi is negative for words in the context before the target word.If Pi exceeds the sentence boundary, Pi is denoted by the null symbol E. For POS feature, F is a set of all possible pairs of the position of the word in the context and its POS found in the training corpus.For each sentence in the corpus, fi is weighted by Wi as in Eq. ( 6).Note that POS categories used in our classifiers are coarse, such as A (Adjective), V (Verb), N (Noun) and E (Preposition).
Unlike the case of BOW, we do not remove punctuation symbols or numbers in the collocations.
For the COL feature, F is a set of all possible collocation strings with W in the training data.For each sentence l containing the target word w in the corpus, fi is weighted by Wi as in Eq. ( 9).
Syntactic relations can be extracted from an annotated syntactic tree, such as subject-verb, verb-object, etc.In this paper, target words are supposed to verbs, nouns or adjectives.For each category of target word, we used different features according to Vietnamese grammar.Since characteristics of Vietnamese are different from English, the extracted features are not the same as in the previous approaches based on syntactic relations of English.For example, an adjective can be subject of a sentence in Vietnamese, while it is impossible in English.Table 1 shows the list of syntactic feature (SYN feature hereafter) used in our WSD classifiers.In Table 1, each type of syntactic feature is presented as 'R-P' (e.g.Subj-N) where R stands for syntactic relation between the target word and the word used as a feature, and P stands for POS of a feature word.
Table 1 List of syntactic features.
Syntactic feature for verbs Subj-N The word that is subject of the target verb w.DOB-N The direct object of w.

IOB-N
The indirect object of w.Head-V The verb that is modified by w.Mod-V The verb that modifies w.Mod-A The adjective that modifies w.

Mod-P
The preposition that modifies w.Syntactic feature for nouns OB-V The verb that is modified by the target noun w where w is its object.Head-N The noun that is a head of w.Head-P The head preposition of the prepositional phrase including w. Mod-A The adjective that modifies w.Mod-N The noun that modifies w.

Mod-P
The head preposition of the prepositional phrase that modifies w.Subj-V The predicative verb of w where w is a subject.Syntactic feature for adjectives Subj-N The subject of the target adjective w where w is a predicate.

S-V
The predicative verb of w where w is a subject.Head-V The verb that is modified by w.Head-N The noun that is modified by w.
The SYN feature vector is constructed in the same manner as in POS and Collocation features.
Let Sli denotes the syntactic relation (Subj-V,Mod-A, ... ), ti is a word which has a syntactic relation Sli with the target word.Each syntactic feature is represented as in (10).For Syntactic feature, F is a set of all possible words that have some syntactic relations with the target word in the training corpus.For each sentence l containing target instance w in the corpus, Ii is weighted as in Eq. ( 11).
In addition to 4 types of features, the feature combinations are considered as in Table 2.In feature combination, feature vectors for target instances are built by just concatenating vectors for individual features.  .This section describes three tasks which were conducted to explore the effective features for learning Vietnamese WSD classifiers, as well as to evaluate pseudoword technique.Since there is no sense-tagged corpus for Vietnamese WSD, two kinds of sense-tagged corpora were built based on Vietnamese Treebank (Nguyen, Vu, Nguyen, Nguyen, and Le 2009), a corpus which contains around 10,000 sentences manually annotated with syntactic trees.Details of these two corpora are explained in the succeeding sections.

Real Word task
We first conducted the ordinary WSD experiments in order to investigate which features are effective for Vietnamese WSD classifiers.We called this task Real Word task (RW task hereafter).
Since there is no sense-tagged corpus for Vietnamese WSD, in order to train SVM classifiers, a manually sense-tagged corpus named 'RW corpus' is built using Vietnamese Treebank (Nguyen et al. 2009)4.The tagging process was conducted as follows: we first choose 9 verbs, 11 nouns and adjectives for target words.These words are chosen considering the following conditions: it is a high frequency word in Vietnamese Treebank, it is ambiguous and both senses of it are expected to appear sufficiently in the Treebank.For each target word, about 100 sentences were chosen for sense tagging, resulted in around 3,000 sentences for all verbs, nouns and adjectives.Two Vietnamese native speakers were invited to judge independently which sense a target word had in those sentences.Chosen senses are those defined in VDict Vietnamese dictionary5.Average number of senses for target words in VDict is 3.1.However, not all but only two coarse grained senses for each target word are annotated.The inter-annotator aggreement is 90.63%.For the disagreed sentences, two annotators discussed together and determined the final sense.We call the above sense tagged corpus 'RW corpus'.The average numbers of sentences for verbs, nouns and adjectives are 92.3,116.7 and 92.1, respectively.Full lists of chosen target words and their senses are shown in Figure 1.Now we can regard the original word VI or V 2 as a sense (we call it 'pseudo-sense' hereafter) of V I -V 2 .Note that the corpus after VI or V 2 are replaced by V I -V 2 can be regarded as a sense tagged corpus.Pseudoword task (PW task hereafter) is a task to determine the pseudo-sense (VI or V 2 ) of the pseudoword Vr V 2 in a sentence.We call the obtained corpus 'PW corpus'.

ID
Although it is not a real WSD, a pseudo-sense tagged corpus can be easily created without any human intervention.
In many previous studies applying pseudoword technique to evaluate WSD methods, two words VI and V 2 are selected randomly.However, in this research, VI and V 2 are chosen considering the meanings of a certain word, similar to 'equivalent pseudoword' proposed by Lu et al. (Lu et al. 2006).
Let us suppose w is a target word.We use VDict to look up meanings of w.Let 81, 82 be two meanings (or senses) of w.Then, we find two Vietnamese words VI, V 2 that reflect the meanings of 81, 82 respectively.VI, V 2 are supposed to be monosemous.Disambiguation of the pseudoword V I -V 2 would simulate the disambiguation of the original target word w.For example, the Vietnamese verb 'mang' has two meanings: "to bring something" and "to contain some characteristic of something".Then 'dem' (bring) and 'chua' (contain) are selected as pseudo-senses of 'mang'.We chose 9 verbs, 9 nouns, and 5 adjectives as target words in PW task, which are the subset of target words in RW task.Some target words in RW task are discarded in PW task because of the lack of data in our corpus.Figure 2 -------------------------+-----::4--- (reasonable) 121 Fig. 2 List of pseudowords and their pseudo-senses words of PW task 6 .The PW corpus comprises 1,162 sentences for verbs, 1,483 sentences for nouns and 568 sentences for adjectives.The average samples of pseudo-verbs, pseudo-nouns and pseudo-adjectives are 129.1, 164.8 and 113.6, respectively.The number of adjective instances is less than verbs and nouns because the frequency of ambiguous adjectives in the corpus is low.
Also, since the adjectives have fine-grained senses, it is more difficult to disambiguate them.

Pseudoword and Real Word task
We will present a method to train WSD classifiers without sense-tagged corpora in this subsection.In Pseudoword and Real Word task (PW-RW task hereafter), we use PW corpus for training WSD classifiers, then classifiers are tested using RW corpus.This task is conducted in order to evaluate the effectiveness of pseudoword technique applied to real WSD.Since the target words are shared in our PW and RW tasks, and a pseudo-sense (VI or V 2 ) in PW task corresponds to a sense (31 or 32) in RW task, WSD classifiers trained from PW corpus could be applicable for RW task.The attractive advantage of this approach is that no sense-tagged corpus is required for supervised learning of WSD systems.

Evaluation
For each experiment, we first evaluate the effectiveness of each feature separately, then the In following subsections, accuracies of trained WSD classifiers for individual target words are reported.Average accuracies for verbs, nouns, adjectives and all target words are also shown.
For the results of individual target words, not all but only the first and second ranked feature combinations are shown.
Study on Supervised Learning of Vietnamese WSD Classifiers  First, we see that almost all WSD classifiers of single features except POS and SYN for adjectives, are significantly better than the Baseline method.When only a single feature is used, BOW was better than the other three features in almost all words.This is reasonable because BOW can capture the most contextual information of a target word.As a human usually does when facing an ambiguous word, BOW utilizes the context around the target word to find the key words that help disambiguate it.The POS feature only contains the grammatical information of several words around the target word, but not the (meanings i of these words.So, their surrounding POS may not be clearly discriminative.The results of POS feature are usually the lowest in comparison with the others, even with baseline.SYN feature is also not so effective for adjectives (only 1.9% higher than Baseline), since we only use 4 syntactic relations for an adjective.This may cause data sparseness for training SVM classifiers.However, SYN feature works well on verbs and nouns (with 10.6% accuracies higher than Baseline for verb and 17.3% for noun).On average, when applying a single feature in Vietnamese WSD, BOW is the most effective feature, followed by COL, SYN and POS feature.

Results of Real Word task
In Table 3, WSD classifiers with combined feature sets got equal or higher results compared to individual features for some target words.In Table 4, the best feature combination outperforms the best single feature BOW for nouns and adjectives on average.However, BOW+SYN, which is the best feature combination for all words, are not higher than BOW.Note that the differences 7 Tables 6, 6, 8 and 9 are also denoted in the same format.
between the best single and combined feature sets are insignificant (not marked by t), indicating that combining several features is not obviously better or worse than the use of only one type of feature.Increasing the number of feature types in feature combination could not lead to the improvement of accuracies.The 4 feature types combination is better than the combination of 2 or 3 features only for one verb (V7).Furthermore, the best feature combinations are different for individual target words, and differences between the best and second best of feature combination are insignificant (not marked by t) because of the relatively small size of the training corpus.
Therefore, we cannot conclude what is the best feature combination for Vietnamese WSD from our result.

Results of Pseudoword task
Table 5 shows results of each pseudoword in PW task, and Table 6 shows the average accuracies  We can see that results when only a single feature is used are similar to RW task, in which BOW feature gave the best performance.As we discussed in Subsection 5.1, BOW contains the most lexical information around the target word.Results of POS feature are not always the lowest in comparison with the others, however in some cases, they are lower than the Baseline (3 of 9 verbs, 1 of 9 nouns, 2 of 5 adjectives).COL feature also gave relatively high results for all parts-of-speech.This is because usages of two target words in two classes are different, so their collocations are very different.However, COL still could not perform better than BOW.
When two or more features are combined together, WSD classifiers gave better results compared to single features for 8 of 9 verbs, 6 of 9 nouns, and all adjectives.Table 6 showed that the most effective feature combination is BOW +COL+SYN for verbs and adjectives, while BOW +COL is most effective for nouns.However, the differences among feature combinations including BOW are not so great.The combinations without BOW are worse, since they do not take advantage of referring to the wide range of lexical information around the target word as BOW does.Similar to RW task, the best feature combinations in PW task vary for individual target words as shown in Table 5.This might be because our training corpus is not large enough.

Comparison of Effective Features in RW and PW task
If the best feature set found in PW task is same as one in RW task, it indicates that, even when we do not have a word sense tagged corpus, we can apply pseudoword technique to find the effective features for Vietnamese WSD.As shown in Table 6, on average, BOW is the most effective feature, followed by COL, SYN and POS features in PW task.The order is the same as for the RW task (in Table 4).Thus investigation of effective features by pseudoword sense disambiguation is reasonable.
Looking deeper to the similarity between results of PW task and RW task helps us to verify the applicability of pseudoword technique for investigating effective features in more details.Table 7 reveals two numbers in the form of alb: a is the number of target words where the best (or one of the best) feature set is the same in PW and RW tasks, while b is total number of target words shared in PW and RW tasks.The 'Single' column indicates the case in which the best single feature sets are the same, while 'Combined' column indicates the case of combined feature sets.
As shown in the table, pseudoword is only appropriate for choosing the best single feature when the target word is a verb or an adjective, since the best single feature of all target verbs and 4 of 5 target adjectives in PW task agreed with those in RW task.It seems ineffective for choosing the best single feature for nouns, as well as the best feature combination for all categories.
The reason why there are too few target nouns sharing the best feature sets in PW and RW tasks might be because nouns are used in a wide range of domains, compared to verbs and adjectives in the corpus.For example, the first sense of the ambiguous verb (V4.chuyen' is 'to send'.This sense can only be used in text related to email, postcard or documents.Similarly, the second sense of the adjective 'AS.nang' is 'serious'.This sense can only be used in a context related to health and disease.However, domains for using nouns are very large.For example, the second sense of the ambiguous noun (N6.gio' is (now'.This sense can be used in various topics, such as sports, news, literature, etc.However, since the corpus is small, its pseudoword cannot cover all possible contexts in which the real word might appear.

Results of pseudoword and Real Word task
In this task, we use two baselines.The first baseline, MFS-PW, is the system which always chooses the most frequent sense in PW corpus, the second one, MFS-RW, is the system choosing the most frequent sense in RW corpus.Comparison between these two baselines also enables us to verify how well pseudoword can simulate real word WSD.Table 8 shows results for each target  word.Table 9 shows average results for verbs, nouns, adjectives and all target words8 .
Comparing results in RW task (Tables 3 and 4) and PW-RW task (Tables 8 and 9), we can see that accuracies of WSD systems in RW-PW task are worse than those in RW task in all feature sets.It seems that WSD classifiers trained from PW corpus could not perform as well as ones trained from RW corpus, although two words of pseudo-senses were not randomly chosen but related with real senses.The first reason is that pseudowords are not actually real words, so there are certain differences among features extracted from PW corpus, and features from RW corpus.The second reason is that the most frequent sense of pseudowords in some cases totally different from the real most frequent sense.This can be empirically observed by seeing that there are great gaps between MFS-PW and MFS-RW in Table 8.For example, MFS-PW of 'V'l.mat' is 19.2% while its MFS-RW is 80.8%.Therefore, the training data for the least frequent sense in PW corpus could not learn the behavior of that sense in the RW corpus (which is the most frequent sense indeed).The worst case is adjectives where disagreement of the most frequent sense is found in 4 of 5 adjectives.This is also the reason why the accuracies for adjectives are much lower than for verbs and nouns.
As shown in Table 8, classifiers trained from PW corpus do not significantly outperform MFS-RW except for VI, N6 and N7 (marked by *).This might be because the training data (Vietnamese Treebank) used in our experiment is not so large.One way to enlarge the size of training data is to use not manually annotated but automatically analyzed syntactic trees for SYN features.However, no public syntactic parser for Vietnamese is currently available.
On average, in Table 9, systems without BOW feature achieved relatively better results.
Although BOW works well on RW and PW task, it performs poorest compared to other feature sets.One of the reasons might be the mismatch of words appearing in the context of target words in PW and RW corpus.Many words in the test RW corpus might be 'unknown' in the training PW corpus, causing the decline of accuracy.Comparing BOW and POS, BOW would suffer from the mismatch, since the variety of words (feature space of BOW in other words) is much broader than that of POS.This assumption would be supported by the fact that POS is better than BOW in Table 9.

Discussion
In this section, we will discuss three issues: comparison between SVM and Naive Bayes model in 6.1, differences of effective WSD features for different languages in 6.2, and the previous work on the pseudoword technique in 6.3 similar to BOW, POS, COL and SYN in this paper, and reported that COL was the best feature type, followed by BOW, POS and SYN.When we implemented the SVM classifiers with the exactly same BOW, POS and COL feature proposed by (Lee and Ng 2002) and evaluated the performance of them for Vietnamese WSD, we found that COL was also the best (the average accuracy was 85.3 for all words), followed by SYN (83.4), POS (79.5) and BOW (79.3)9.On the other hand, when we used our own features described in Subsection 3.2, BOW was significantly better than COL for Vietnamese WSD as shown in Table 4.Our features seem more appropriate for Vietnamese WSD than Lee's ones, since the accuracy of our method was much betterlO.We may say that local collocations near the target word would be useful for English WSD, while words in the context in a wide range would be effective for Vietnamese.

Martinez et al. explored the contribution of syntactic features by training Decision List
and AdaBoost on the SENSEVAL-2 English data (Martinez et al. 2002).The paper revealed that COL was more effective than SYN, although syntactic features contributed to the gain of WSD precision when they combined with COL and BOW.Mohammad and Pedersen have also reported similar results (Mohammad and Pedersen 2004).They trained Decision Tree on the data of SENSEVAL-2, SENSEVAL-1 and others, and showed that (1) COL was better feature than SYN, (2) simple ensemble of two classifiers using COL and SYN achieved the increase of the accuracy.As shown in Table 4, SYN was also less effective than COL for Vietnamese WSD.
Seeing results of two feature combinations with SYN (BOW+SYN, POS+SYN and COL+SYN), SYN contributed to the gain of accuracies when it combined with POS and COL, but not with BOW since the performance of BOW was much better than SYN.
Murata et al. worked on the comprehensive study of supervised machine learning of Japanese WSD (Murata, Utiyama, Uchimoto, Ma, and Isahara 2003).They evaluated several machine learning methods (SVM, Naive Bayes, Decision List and ensembles of them) with several feature sets (COL, POS, SYN, BOW as well as topics of documents) on the data of SENSEVAL-2 Japanese dictionary task.The results of Naive Bayes classifiers, which was the best system except for ensembles of multiple learning algorithms, showed that the most effective feature was COL, followed by BOW, SYN and POS.Our results showed that BOW would be the most effective for Vietnamese WSD, but it might be less useful than COL in Japanese, like English.
Note that the above discussions are just rough comparisons between languages, since the feature sets used in previous work and ours are not exactly same.Furthermore, the effectiveness of features might be dependent not only on languages but also other factors, such as target words, sense definitions (fine or coarse grained), genres of texts and machine learning algorithmsll.To more precisely explore differences of effective features among different languages, more sophisticated designs of experiments would be required.That is, we should prepare parallel corpora with annotations of senses, use bilingual or multilingual lexicons to define the same set of target words and their senses, train WSD classifiers using the same machine learning algorithm, and use the exactly same feature set.Such an experiment is beyond the scope of this paper, since currently we do not have the necessary language resources.

Previous work on pseudoword
Gale et al. introduced the 'pseudoword' technique at first in English (Gale et al. 1992b).They built a pseudo-ambiguous word by combining two or three randomly chosen unambiguous words and tried to disambiguate these two or three pseudo-senses.The unambiguous words came from definition sentences in a dictionary, and they were chosen so that the frequencies of pseudowords were equal.Although this is not a real WSD system, the idea of pseudoword helps to develop large amounts of training material.In the study of (Gaustad 2001), the author constructed experiments to compare the performance of Naive Bayes classifier for real ambiguous word and pseudoword.Pseudowords were created by choosing words with the same frequency ratios to that of real senses.The paper reported that accuracies of pseudoword disambiguation were different from that of real WSD, indicating that pseudoword technique would not be valid for evaluation of WSD systems.
In most previous work, semantic properties of senses were not considered for the choice of pseudowords.While Lu et al. proposed the method for Chinese WSD to automatically choose unambiguous pseudowords similar to real senses using a thesaurus (Lu et al. 2006).Furthermore, like our PW-RW task, pseudowords in an unannotated corpus were used to estimate the probabilities of Naive Bayes model for real WSD.The trained NB achieved good results, even higher than supervised classifiers trained from a relatively small amount of sense tagged corpus.
Our pseudoword technique is similar to (Lu et al. 2006), which considers semantic properties of pseudowords.One of the differences is that pseudowords were automatically chosen using the Chinese thesaurus in (Lu et al. 2006), while manually chosen in this paper.Lu's method seems preferable to ours, since manual choice of pseudowords might be arbitrary.Another difference is the size of the training corpus.As discussed in 5.3, pseudoword technique did not work well in our experiment of PW-RW task, while it worked well with a large amount of training data in (Lu et al. 2006).From another point of view, the lack of language resources and tools in Vietnamese, such as a thesaurus (for automatic selection of pseudowords) and a syntactic parser (to obtain a large training corpus with parse tree), might be an obstacle to application of pseudoword technique for Vietnamese WSD.

Conclusion
In this research, we have developed a WSD system for Vietnamese language on two corpora: RW corpus (which was manually built) and PW corpus (collected automatically).In RW task, the best average accuracy for all words is 94.0%.We have experimented using three tasks to evaluate the effectiveness of each feature and feature combinations with and without a sensetagged corpus.For the first goal to explore effective features, we found that BOW is the most effective one.Combinations of BOW and other features enhance the performance of WSD system in some cases, but not significantly.For the other goal to check the applicability of the pseudoword technique, we found that it is useful to rank feature types according to effectiveness for WSD and find best single feature for individual target verbs and adjectives.In addition, pseudoword technique might be an alternative WSD approach when there is no training data.
However, there are some disadvantages in this research.For example, the data sparseness is problematic for training classification models, and the assumption of two senses per target word may not be realistic.Therefore, it will be interesting to investigate the effective features for WSD multi-class classifiers along with increasing the corpus size.Also, we could not clearly find the best feature combination.More large-scaled sense tagged corpus enables us to explore the best feature combination for Vietnamese WSD.Effectiveness of other types of features should also be investigated.For example, Cai et al. used features about the topics of documents (Cai, Lee, and Teh 2007), which are derived by Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003).They reported that topic features were effective for English, but not sure for Vietnamese.
Although the results of our experiments in PW-RW task showed that pseudoword technique did not work well as unsupervised WSD method, it should be evaluated again with a larger corpus.
Another interesting proposal is comparing the effective features between Vietnamese WSD and other languages in precise experiments as discussed in Subsection 6.2.
li = (Pi, Pi) Wi = {0 1 if POS of the word at the position Pi is Pi; 2-feature-combination BOW+POS, BOW+COL, BOW+SYN, POS+COL, POS+SYN, COL+SYN (example of feature vector: Fcombine = {FBOW, FeoL}) 3-feature-combination BOW+POS+COL, BOW+POS+SYN, BOW+COL+SYN, POS+COL+SYN (example of feature vector: Fcombine = {FBow, FeoL, Fsy N } ) 4-feature-combination BOW+POS+COL+SYN (example of feature vector: Fcombine = {FBOW, Fpos, FeoL, FSYN}) March 2012 feature combinations.LIBSVM (Chang and Lin 2001) is used for training SVM classifiers.Experiments in RW task and PW task are conducted by 10-fold cross validation.For PW-RW task, PW corpus is used as training set and RW corpus is used as test set.The Baseline used in the experiments is the most frequent sense method.That is, all test instances of a target word are determined to be the most frequent sense appearing in the training data.The evaluation criteria for WSD systems is the accuracy of sense classification defined as in Eq. , 15 feature sets are used for training WSD classifiers.The first four utilize one feature type, while the others utilize two, three, or four feature types (feature combination).
Function words 2 , proper nouns, numbers and punctuation marks are not used as features, since they would not be effective clues for WSD.For BOW feature, F is a set of all possible words appearing in the context of target instances in the training corpus.For each sentence l containing a target instance w in the training corpus, fi is weighted as in Eq. (3).if fi appears in l and sense of w is Sl if fi appears in l and sense of w is S2

Table 2
Combined feature sets.
Pseudoword taskAlthough using ordinary WSD classifiers can give us more reliable results, the problem is a sense tagged corpus is not easily built.Therefore, we applied the pseudoword technique to automatically develop a sense-tagged corpus, and trained WSD classifiers from it.We call this task Pseudoword task (PW task).The main goal of this task is to evaluate the applicability of pseudoword technique for exploring effective features of WSD by comparing results between RW and PW tasks.Let us suppose VI and V2 are two different words.Pseudoword V I -V 2 is an imaginary word implying it is VI or V 2 .Then VI or V 2 in the corpus are replaced with the pseudoword VI -V 2 .

Table 3
shows results for each target word, while Table4shows the average accuracies for verbs, nouns, adjectives and all target words in RW task.Results of SVM classifiers are verified by McNemar's test (p < 0.05).*means the case that it significantly outperforms Baseline.The bold number indicates the best accuracy achieved when one feature type is used, or when two or more feature types are used.If t is attached, the system significantly outperforms the second Table3Accuracy in RW task for each target word.

Table 4
Average accuracy in RW task for verbs, nouns, adjectives and all target words.
best system among one feature or combined feature groups.To clearly show the effectiveness of feature combination, :\: is attached if the difference between the best single and combined feature is statistically significant 7 .

Table 5
Accuracy in PW task of each pseudoword.

Table 6
Average accuracy in PW task of pseudo-verbs, pseudo-nouns, pseudo-adjectives and pseudowords all.
for pseudo-verbs, pseudo-nouns, pseudo-adjectives and all target words.

Table 7
The best feature comparison for each target word.

Table 8
Accuracy in PW-RW task for each target word.

Table 9
Average accuracies in PW-RW task for verbs, nouns, adjectives and all words.