Clause Splitting with Conditional Random Fields

In this paper, we present a Conditional Random Fields (CRFs) framework for the Clause Splitting problem. We adapt the CRFs model to this problem in order to use very large sets of arbitrary, overlapping and non-independent features. We also extend N-best list by using the Joint-CRFs (Shi and Wang 2007). In addition, we propose the use of rich linguistic information along with a new bottom-up dynamic algorithm for decoding to split a sentence into clauses. The experiments show that our results are competitive with the state-of-the art results

• Balancing between a number of start word positions and those of end word positions of clauses in a sentence.
• The clauses can be embedded in the outer clauses.
To overcome the drawbacks mentioned above, we use N-best list by adapting the Joint-CRFs (Shi and Wang 2007), and simultaneously we use rich linguistic information and propose a new bottomup dynamic algorithm for decoding. The experiments show that our results are competitive with the previous results. Especially, the precision of our results performs better than that of the previous methods. Additionally, with decoding process, our system is also more than approximately 50 times faster than that of (Carreras and Marquez 2005) written in Perl.
The rest of this paper is structured as follows. Section 2 reviews related work. Section 3 formulates the Clause splitting problem. Section 4 briefly introduces linear chain CRFs and Joint-CRFs and how to apply them to Clause Splitting. Section 5 describes and discusses the experimental results. Finally, conclusions are given in Section 6. Many supervised methods have been developed for Clause Splitting. (Carreras and Marquez 2005) used a discriminative model approach for it. They applied a global learning algorithm, FR-Perceptron (Collins 2002) to recognize structure of clauses. They divided the problem into two layers of local subproblems: a filtering layer, which reduces the search space by identifying plausible clause candidates; and a ranking layer, which builds the optimal clause structure by discriminating among competing clauses. A recognition-based feedback rule is presented, which reflects to each local function its committed error from a global point of view, and follow to train them together online as perceptrons. As a result, the learned function automatically behaves as a filter and ranker, rather than as a binary classifier. The FR-Perceptron method shows the best result for Clause Splitting now. (Carreras, Marquez, Punyakanok, and Roth 2002) applied the Adaboost algorithm (Carreras and Marquez 2001). They improved Clause Identification by using global inference on the top of the outcome clauses hierarchically learned by local classifiers. Other approaches such as Maximum Entropy, and Winnow are applied for CS too (Hachey 2002).
A number of different methods for the supervised learning approach were used for the CoNLL-

Clause Splitting Problem
At a deeper level of partial parsing is clause splitting. A clause is a sequence of words in a sentence and is a grammatical unit that includes, at minimum, a predicate and an explicit or implied subject, and expresses a proposition. For example: given an input sentence: Coach them in handling complaints so that they can resolve problems immediately The problem is to split a sentence into clauses as follows: (Coach them in (handling complaints) (so that (they can resolve problems immediately))) The problem is more difficult than simply detecting non-recursive phrases in sentences. Clause Splitting is divided into three parts: identifying clause starts, identifying clause ends, and finding complete clauses (Sang and Dejean 2001).

Formulation
Let X be a sentence space, and Y be a clause space. We can consider a model for finding clauses as a function R : X → Y which, given a sentence x, identifies the set of clauses y ⊂ Y of x ∈ X. First, we assume a filter function F which, given a sentence x consisting of a sequence of n words (x 1 , x 2 , . . . , x n ), identifies a set of candidate clauses, F (x) ⊆ P where P is the set of all possible clauses. A candidate clause is represented as (s, e) for the sentence x where (s, e) is the sequence of consecutive words from word x s to word x e , Second, we assume a score function which, given a clause, produces a real-value prediction of the clause. We identify a set of clauses for a sentence according to the following optimality criterion: in which C(x) is a set of clauses for a sentence x, and (s, e) k is a k-th clause in y.
We will identify the clause starts (Task 1) and the clause ends (Task 2) to predict a set of candidate clauses for finding complete clauses (Task 3).

Applying CRFs to Clause Splitting
In this section, we show how to overcome the drawbacks of applying CRFs and Joint-CRFs to CS as mentioned in the introduction. First, we present an overview of the CRFs and Joint-CRFs models, we then propose a decoding algorithm as well as exploiting rich linguistic information to deal with the problem when applying CRFs and Joint-CRFs to CS.

Conditional Random Fields
Conditional Random Fields (CRFs) (Lafferty et al. 2001) ) is a normalization factor over all state sequences. We denote δ to be the Kronecker-δ. Let F (s, o, t) be the sum of CRFs features at time position t: where f i (s t−1 , s t , t) = δ(s t−1 , l )δ(s t , l) is a transition feature function which represents sequential dependencies by combining the label l of the previous state s t−1 and the label l of the current state s t , such as the previous label l = AV (adverb) and the current label l = JJ (adjective).
where the second sum is a Gaussian prior over parameters (with variance σ 2 ) which provides smoothing to avoid overfitting in the training data.
When the training labels make the state sequence unambiguous, the likelihood function in exponential models such as CRFs is convex, and finding the global optimum is guaranteed. Parameter estimation of a CRFs model requires an iterative procedure. Currently, various methods can be used to optimize L Λ , including Iterative Scaling algorithms such as GIS and IIS (Lafferty et al. 2001), and quasi-Newton methods such as L-BFGS (Sha and Pereira 2003). Among these methods, L-BFGS is the most efficient (Malouf 2002;Sha and Pereira 2003).
L-BFGS requires only that one provides the first-derivative of the function to be optimized.
Let s (j) denote the state path of training sequence j, and then the first-derivative of the loglikelihood is is the count of feature f k , given s and o. The first two terms correspond to the difference between the empirical and the model expected values of feature f k . The last term is the first-derivative of the Gaussian prior.

Inference in CRFs
Given the conditional probability of the state sequence defined in (2) and set of the parameters Λ = {λ, . . . }, inference in CRFs is to find the most likely state sequence s * subject to: We can efficiently calculate s * with the Viterbi algorithm (Rabiner 1989). For Viterbi algorithm, we use the table for storing the probability of the most likely path up to time t, which accounts for the first t observations and ends in state s i . We define this probability to be ϕ t (s i ) is the probability of starting in each state s i . We are given a recursive formulation as follows: From s i * , we can backtrack through the dynamic programming table to recover s * .

Joint Conditional Random Fields
There is the limitation of applying CRFs to three sub problems: If we process Task 1, Task 2, and Task 3 separately then errors in processing nearly always cascade through chain, causing errors in the final output. To tackle this limitation, we introduce the use of Joint-CRFs of Task 1, Task 2, and Task 3. Our Joint-CRFs models is based on the Dual-layer Conditional Random Fields developed by (Shi and Wang 2007) for segmentation and tagger. We combine three subproblems: Task 1, Task 2, and Task 3 using the joint probability model with Joint-CRFs.
or a word which is not a start position word (*)}, Task 2 where E i ∈{an end position word (E), or a word which is not an end position word (*)}, . . , C m } denote a label of a clause where C i ∈ {the named clause labels for Task 3} (we can see the example in Section 5 in more detail). Our goal is to identify a start word of clause, an end word of clause and a boundary label of a clause that maximize the joint probability P (S, E, C|W ). We can formulate the joint problem as follows: where (S, E) * and C * is the most likely (boundary label at the start word, boundary label at the end word) and a boundary label of a clause, respectively, Applying Bayes's theorem, the above joint probability P (S, E, C|W ) is factorized into two terms, P (C|(S, E), W ) and P (S, E|W ). The first term represents the conditional probability of Task 3, given the result of Task 1 and Task 2 (Identify(S, E, W )), the second term represents the conditional probability of Task 1 and Task 2 given W . Note P (S, E|W ) ≈ P (S|W )P (E|W ), assuming that identifying a start word of clause (S) and identifying an end of word (E) of clause is independent together, in which P (S|W ) and P (E|W ) are the conditional probability of Task 1 given W and the conditional probability of Task 2 given W , respectively.
In training, the probability P (S, E, C|W ) can be rewritten (according to formula 2) as: where F 1 , F 2 and F 3 are the sum of CRFs features of Task 1, Task 2 and Task 3, respectively and are the normalizing term of the probability P (C|Identif y(S, E, W )), P (S|W ) and P (E|W ), respectively. Their properties and functions are the same as common CRFs described in 4.1.
We can consider the learning process into two steps: one for learning the first layer of Task 1 (S) and Task 2 (E), and one for learning the second layer of Task 3.

N-best List Approximation for Decoding
Adopting (Shi and Wang 2007), we also use a N-best list approximation method. We {S 1 , E 1 , S 2 , E 2 , . . . , S N , E N } is ranked by the probability P (S|W ) and P (E|W ). Therefore, maximum of the joint probability P (S, E, C|W ) can be defined approximately: ≈ arg max (S,E)∈Ψ,C P (C|Identif y(S, E, W ))P (S|W )P (E|W ) We obtain the N-best list of Task 1 (S) and Task 2 (E) and their corresponding probabilities P (S|W ) and P (E|W ) (S, E ∈ Ψ) by using a combination of forward Viterbi and backward A* search. Given a particular S and E, the most clause boundaries and its probability P (C|identif y(S, E, W )) can be calculated by the Viterbi algorithm in section 4.1.

Features
The set of features we use is the same as that of features reported in (Carreras and Marquez 2005). The set of features includes features at word level and features at sentence level.

Features at word level
The features are used with a window representation of size 2. For a window centered at the the word x t , we use the following features extracted from (x t−2 , x t−1 , x t , x t+1 , x t+2 ). Where x can be: • Word form (w) and POS tag (p).
• Count: the number of a particular linguistic element which appear in a sentence fragment.
We consider two fragments of a sentence, with separate features for each: from the beginning of the sentence to w i (CountBegin), and from w i to the end (CountEnd). The linguistic elements are enumerated as follows: -Relative pronouns (e.g "that", "where", "who", "which", "whom", "whose") -Punctuation marks (. , ; :) -Quotes -Verb phrase chunks -Relative phrase chunks The feature templates at word level is described in Table 1.

Features at sentence level
These features are used for capturing long-distance dependencies and identifying the clause boundaries of a clause candidate (s, e): • Top-most structure: A pattern representing the relevant elements of the top-most structure forming the candidate from s to e. The following elements are used to form the pattern: -Punctuation marks -Coordinate conjunctions (e.g., "and", "or") -The word "that" -Relative pronouns (e.g., "that", "which", "who", "whom", "whose") - where the pattern only considers the top-most structure. 1 We will ignore a clause which appears in the pattern. For example, the pattern for the clause "((to raise)VP rates on containers (carrying U.S. exports to Asia)S about 10%)" is VP-%-S-%.
• The number of clauses found inside the candidate [x s , . . . , x e ].

Decoder for Clause Splitting
As mention in section 1 with three weakness, we do not apply CRFs to Task 3 directly.
In this section, we will describe an algorithm for decoding Clause Splitting in a segment of a sentence from l to r. It is a dynamic algorithm presented in Figure 1 as a recursive function.
We use results of Task 1 and Task 2 as input of Task    In the Figure 1, beginning from line 4 to line 9 of the function uses two recursive calls on the sentence segment to enumerate all clause candidates (s i , e j ) (s i ∈ mstart[], e j ∈ mend[]) of segment (l, r). Line 10 of the function finds the optimal split k * for the current sentence segment.
The line 11 will assign the union of two disjoint splits BestClause[l, k * ] and BestClause[k * +1, r] which covers the segment (l, r) to BestClause[l,r] of (l, r) segment. The line 12 and 13 treat the case that a clause (l, r) is added to BestClause [l, r].
A sentence requires a function call for each clause candidate and there is a quadratic number of clause candidates over a number of start words and end words in the sentence. The function consumes a linear time for selecting the optimal split plus the cost of the scoring function. Consequently, computation time of identifying a clause split in a sentence is O(n 2 (n + cost(score))) where n is a number of start words and end words in a sentence. Because n is so small, computation time of CS is consumed essentially by computation time of Viterbi algorithm calculating cost(score).

Scoring
It is essential that we identify the score of a candidate clause. We use the Viterbi algorithm in the decoding process for Task 3. Denote Ω as a set of boundary labels of clauses in the outputs which Viterbi algorithm produces to predict labels of clauses in (l, r) segment. The score of a candidate clause (l, r) is defined as follows: in which ϕ T (s k ) is that of (6).

Experiments
We conducted the experiments and evaluated the results with our CRFs framework. We where R(x i ) is a set of clauses that are identified for a sentence x i .
For Task 1 and Task 2, we used the framework CRFs with the set of features in section 4.3 as unigram feature templates. We also used some constraints for Viterbi algorithm in the formula (6) as follows: • Start position of a clause must be the boundary of a chunk.
• End position of a clause must be the boundary of a chunk.
We combined outputs of Task 1 and Task 2 with chunking tag respectively to enrich the dependence of its linguistic information. This combining is described in Table 3. The results of Task 1 and Task 2 are shown in Table 4.
We experimented on Task 1 and Task 2 with a set of features as bigram feature templates.
The results of Task 1 and Task 2 are shown in Table 5. They show that F1 value of Task 1 for the test set improves 0.61% and F1 value of Task 2 for the test set improves 0.94%.
We also experimented Task 1, Task 2 and Task 3 using Joint-CRFs with a set of features as Table 3 Integrating Output tag of Task 1 and Task 2 with chunking tag. The first and second columns show words and POS tags, respectively. The Chunking tag are shows in the third column, in BIO notation. The forth and fifth columns show the outputs of Task 1 and Task 2, respectively. The sixth and seventh columns annotate combining outputs of Task 1 and Task 2 with  . bigram feature templates. We chose N = 10 for using in the N-best list. Our results are shown on

Using CRFs and Joint-CRFs to predict score
We used CRFs and Joint-CRFs (joint with Task 1, and Task 2) with the bigram feature templates for Task 3 presented in Section 4.3. Then we used the formula (6) for Viterbi algorithm to count score(l, r) of clause candidate (l, r) segment using the score function (13). The result of Task 3 (identifying the clauses) is shown in Table 6. The F1 performance of Task 3 using Joint-CRFs improves by 0.31% compared with that of Task 3 using CRFs.

Combining linguistic information
We improved F1 value of Task 3 by using linguistic information for smoothing the score function presented in Section 4.4 and the score function is defined as the formula (14). The result is showed in Table 7. The F1 performances are 84.09% and 84.66%, which are improved by 1.25% and 1.51% compared with the case of using the formula (13), respectively. The Table 7 also shows that the F1 performance using Joint-CRFs outperforms 0.57 % higher than using CRFs. Table 8 shows a comparison of our methods with the previous works on the same training and testing data. The results also show that our method is comparable to that of (Carreras and Marquez 2005), which is the state-of-the art result, and outperformed other methods. We see that the result of (Carreras and Marquez 2005) outperforms slightly our result because they combine the three tasks of CS together with end to end while we combine the three task with an intermediary role. With the error-driven method, they feedback errors on training process.
However, our method shows precision improves that of other methods. This is very useful when we apply CS for other applications such as machine translation because the clauses need to be identified correctly.
We carried out statistical significance tests using the t-test. Pairwise t-test showed that the precision of our results is significantly better than that of the result of (Carreras and Marquez 2005) under the significant level 4.98 × 10 −6 (p-value).

Relation between performance of Task 3 and results of Task 1 and Task 2
In order to test how the results of Task 1 and Task 2 effect on the performance of Task 3, we conducted an experiment by performing Task 3 using the gold standard data of Task 1 and Task 2. Table 10 shows that F 1 values of the Task 3 are 84.28% and 85.99% with adding linguistic information, respectively. We see that the results of using the gold standard data of Task 1 and Task 2 (85.99%) could improve our results (84.09%). This explains that the performance of Task 1 and Task 2 are important to the result of splitting clause. However, the main errors of the Task 3 are the miss-corresponding of the starting point (the result of Task 1) and the ending point (the result of Task 2). These errors are caused by the inappropriate scores in decoding algorithm.
Our future work is focused on how to find a better scoring method for the decoding algorithm.

Conclusion
In this paper, we have presented the CRFs-based framework approach for clause splitting.
We have proposed a new bottom-up dynamic algorithm for decoding and some effective linguistic information for clause splitting. We compared the results of exploiting our framework to the previous works in the CONLL 2001 shared task. The experiments show that our result is competitive with the state-of-the-art results of clause splitting.