Anna: A Dapper Open-Domain Dialogue Agent Based on a Joint Attention Network

We constructed a high-quality open-domain dialogue generation model called Anna that is composed of a hierarchical self-attention network with multiple convolution (cid:12)lters and a topic-augmented network. During daily conversations, humans typically respond by understanding a dialogue history and assembling their knowledge regarding the topic. However, existing dialogue generation models are weak at capturing the dependencies among words or utterances, resulting in an insu(cid:14)cient understanding of context and the generation of irrelevant responses. Previous works have largely ignored topic information modeling in multi-turn dialogue, making responses overly generic. Although pre-training using large-scale transformer models has recently resulted in enhanced performance, large parameter sizes complicate such models. Anna e(cid:11)ectively captures contextual dependencies and assigns greater weight to important words and utterances to compute context representations. We incorporate topic information into our model as prior knowledge to synthesize topic representations. Two types of representations jointly determine the probability distributions of responses, which e(cid:11)ectively simulates how people behave in real conversations. Empirical studies on both Chinese and English corpora demonstrate that Anna outperforms baseline models in terms of response quality, parameter size and decoding speed.


Introduction
Dialogue agents can be divided into goal-driven and non-goal-driven agents. Goal-driven systems are constructed in vertical domains for specific tasks such as technical support services (Young et al. 2013;Shawar and Atwell 2007). Non-goal-driven systems aim to have natural conversations with people regarding a wide range of topics in open domains and include chatbots. This paper focuses on multi-turn dialogue response generation in an open domain in which we attempt to train a response generation model using large-scale context-response pairs. A context refers to several previous utterances (turns). In practice, a model takes the context as an input and generates a response in the next turn.

Single-turn Dialogue Generation Models
Some studies have improved the anthropomorphic characteristics of generated responses. For example, the backgrounds of persona Mazaré et al. 2018;Madotto et al. 2019;Cui et al. 2018) and emotional information (Zhou et al. 2020;Huang et al. 2018;Rashkin et al. 2018;Li et al. 2019) have been incorporated into encoder-decoder architectures. These studies aimed to address the problem of generating generic responses, which is also one of the motivations for our study. This paper focuses on the sculpting of dialogue topic information.
However, previous studies have focused on single-turn dialogue generation, whereas we focused on response generation for multi-turn dialogue. Previously, there have been some pioneering works on the modeling of topic information (Li et al. 2015;Mou et al. 2016;Xing et al. 2017) for single-turn dialogue, but most have merely fixed a single topic word in a response.

RNN-based Model
Multi-turn dialogue response generation has attracted significant attention from academia in recent years. Various hierarchical recurrent models have been used in this area, including HRED 1 . Under the architecture of HRED, additional variants such as MrRNN (Serban et al. 2017a) and VHRED 2 (Serban et al. 2017b) have been proposed to incorporate stochastic latent variables to improve the diversity of responses. However, these models use all contexts indiscriminately, leading to unsatisfactory performance. To solve this problem, some researchers proposed HARN (Xing et al. 2018), which is a variant of HRED, by introducing a traditional attention mechanism into an RNN. However, RNN-based attention models are typically biased toward closer utterances in a context, meaning they suffer from the position bias problem (Hochreiter et al. 2001). Furthermore, double-layer dynamic attention changes the context representation in each step of the decoding process, which can lead to incoherent responses (Cui et al. 2018). (Zhang et al. 2020) detected topic-level attention relevance to handle the topic draft problem through the use of a pre-trained bi-term topic model. Because this method essentially integrates topic information into HRED, the shortcomings of the RNN-based networks discussed above are not resolved in their model. The main merit of our model compared to theirs is the effective capture of contextual dependencies.

RNN and Self-attention Hybrid Model
ReCoSa 3 (Zhang et al. 2019) focuses on modeling the relevance between response representations and contexts to allocate attention weights for utterances reasonably. However, ReCoSa is still inadequate in terms of context modeling because it uses an RNN-based utterance-level encoder that may lose important information in an utterance. Through experimentation, we also found that ReCoSa is prone to generating generic responses such as "Okay I know." These responses carry little information, making it difficult to keep a conversation going. (Zhao et al. 2020) devised a group of self-supervised auxiliary tasks that help their models produce better features for response generation. They introduced maximum likelihood estimation during learning with four self-supervised auxiliary tasks. The key concept of their method is to transfer the burden of context understanding from modeling to learning. However, they directly concatenate the utterances in a dialogue as a long sequence and feed the sequence in an encoder.

Auxiliary Task Model
It is unclear how auxiliary tasks can remedy the loss of utterance-level dependencies in this form of non-contextual modeling. Additionally, this approach makes it easy for the matrix inputs for the model to become enormous as the utterance length increases. This makes it difficult to accelerate the decoding process. This could be why utterances were truncated in their study and only the first 25 words were kept. However, the length of an utterance often exceeds 25 words in daily life when people narrate something or explain a perspective. Particularly in languages with a convention of equivocal expressions such as Japanese or Chinese, more words typically appear in an utterance. Therefore, their study was only evaluated on an English corpus. In Section 5.6.5, some comparisons between Anna and the method from (Zhao et al. 2020) are presented.
We then describe the advantages of our model in detail.

Large-scale Pre-training Model
Recently, task-agnostic pre-training using large-scale transformer models has achieved significant success in natural language generation. Meena (Adiwardana et al. 2020) scaled up the parameter size to 2.6 B to obtain a human-like chatbot. Blender (Roller et al. 2021) fine-tuned a pre-trained model using human-annotated datasets, making dialogue more personalized. PLATO-2 4 (Bao et al. 2020) constructed an effective training schema via curriculum learning and exhibited state-of-the-art performance. However, the parameter size was more than triple that of ReCoSa.

Self-attention Mechanism
Given a vector sequence is considered as a matrix. Matrices are denoted as Q seq ∈ R n×d , K seq ∈ R n×d , and V seq ∈ R n×d .
Different linear projections are implemented in these matrices to obtain queries, keys, and values, which are denoted as Q ∈ R n×c , K ∈ R n×c , and V ∈ R n×c , respectively.
where the parameter weight matrices W V are called attention head. The attention scores are calculated using the scaled dot product and the weighted sum of V is obtained as The self-attention mechanism adopts the multi-head approach which is multiple sets of weight matrices that project an original sequence into different representation subspaces to focus on information from different positions. For the i-th head, the weighted sum of V i is calculated as where the parameter weight matrices W Q i ∈ R d×c/h , W K i ∈ R d×c/h , and W V i ∈ R d×c/h , and h is the number of heads. All heads are concatenated together into a single matrix and a linear projection is then used to mix the subspaces from different heads.
where the parameter weight matrix W O ∈ R c×d . See (Vaswani et al. 2017) for more details.

LDA Model
LDA is a generative probabilistic model. LDA is typically used in the topic mining of text with unsupervised learning. The basic concept of LDA is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. This is similar to the standard bag-of-words model assumption. Fig. 1  The LDA model is constructed as follows.
(1) Draw a multinomial distribution θ m from a Dirichlet prior Dir(α) that represents the topic distribution of document m.
(2) Sample from the topic multinomial distribution θ m to generate a topic z m,n for the n-th word in document m.
(3) Draw a multinomial distribution ϕ k from a Dirichlet prior Dir(β) that represents the word distribution of topic z m,n . (4) Sample from word the multinomial distribution ϕ k to generate word w m,n . Each document in the corpus can be denoted as a joint probability distribution.
where n m is the vector of the number of words in each topic in document m and n k is the vector of the number of times each word appears in the corpus under topic k. Model training can be performed through the collapsed Gibbs sampling of z to maximize the joint probability to estimate θ and ϕ. We direct readers to (Blei et al. 2003) for additional details.

Problem Formalization
Given a context C = (U 1 , ..., U m ), response R = (y 1 , ..., y n ), and set of topic words Z = (k 1 , ..., k l ), U i is the i-th utterance, y i is the i-th response word, and k i is the i-th topic word.
∀i, U i = (w i,1 , ..., w i,ni ), where w i,j is the j-th utterance word. We denote a corpus as C , which consists of (C, R, Z) pairs. Our goal is to learn a mapping function P (y 1 , ..., y t | C, Z) from C that can generate a response G = (g 1 , ..., g t ) when a new dialogue context C and its topic words

Hierarchical Self-attention Network
Hierarchical self-attention consists of an utterance-level encoder and context-level encoder, both of which adopt a self-attention mechanism. For each word w i,j in an utterance U i , the token embedding is calculated as the sum of a word embedding and its position embedding as where W E is a function for obtaining a word embedding from an embedding table. P E is a function that provides position information. {I(w i,1 ), ..., I(w i,ni )} is considered as a matrix M Ui ∈ R ni×c . To obtain an utterance representation, M Ui is fed into the utterance-level encoder in the form of queries, keys, and values by using different linear projections. The multi-head attention computes the attention representation encoding U i in the form of hidden vectors.
A convolution operation involving a filter f ∈ R h×1 is then applied to a window with a height of h and width of one.
where the filter is scanning, and a feature s t i is generated from the t-th window as where b is a bias term and F ilter is a nonlinear function (Tanh was used in our experiments).
W F ∈ R 1×h denotes the parameter weight matrix. The stride of the filter along the height is k (k > 0.5h) and that along the width is one. The filter is applied to each possible window the width of filter w is one, this process performs convolution on each dimension of the hidden vectors to extract the semantic features of U i . In our implementation, the feature map M ap i is obtained using multiple convolution operations. A pooling operation is then applied to the resulting feature map and the utterance representation of U i is computed as S i .
where M axAbsP ool is a function that calculates the maximum absolute value of the last dimension and then takes on the corresponding value. The description above covers the extraction of a feature map from one filter. Our model employs multiple filters with varying window heights to obtain multiple semantic features and calculates averages to derive utterance representations, as shown in Fig. 2 where the computing process is a parallel process. Position information is included in S i to indicate which turn S i belongs to.
{I(S 1 ), ..., I(S m )} is considered as a matrix M C ∈ R m×c . To obtain the context representation O C , matrix M C is utilized as an input for the context-level encoder. The concept of the contextlevel encoder is similar to that the utterance-level encoder, which contains a multi-head attention layer followed by a feed-forward neural network (F F ).
To facilitate matrix operations in the decoding phase, we do not conduct dimensional reduction on O C . Therefore, O C is a vector sequence. To maintain the hierarchy of the utterance level and context level, parameter weight matrices between the utterance-level encoder and context-level encoder are not shared.

Topic-Augmented Network
We acquire the topic words of a dialogue from a pre-trained LDA model. Each context and its responses are concatenated to form a short document. Then, the collapsed Gibbs sampling algorithm is applied to each short document to estimate the parameters of the LDA model.
The pre-trained LDA model allocates the topic z with the highest probability in θ to a context.
We select the top-n words with the highest probabilities in ϕ under topic z as topic words. As discussed in Section 3.2, θ represents the topic distribution and ϕ represents the word distribution.
In the topic-augmented network, given a set of topic words Z = (k 1 , ..., k l ) for the context C, the embedding of topic words is calculated using the function W E, as discussed in Section 4.2.1. With a minor exploitation of notation, we also use (k 1 , ..., k l ) to denote the embeddings of the topic words. These embeddings are linearly combined as the topic embedding by a topicaugmented network. The combination weight of k n and the topic embedding O zi are given by where η is a multilayer perceptron. The topic-augmented network is different from a traditional attention network and the parameter matrix W Z is introduced to reduce the effects of different words with similar cosine similarities on the weight calculation. The topic embedding of each utterance in C is denoted as {O z1 , ..., O zm } and considered as a topic representation matrix Fig. 3 presents the architecture of Anna. In the next section, we will elaborate on O C and O Z , which jointly affect the likelihood of the response sequence in the decoding phase.

Decoder
For training, the hidden vectors of the response R are calculated using Eq. (7) and Eq. (8).
A mask matrix operator is applied to R to obtain the response representation O R . The mask operation aims to prevent positions from attending to its subsequent positions. Another multi- The probability distribution P (y 1 , ..., y n | C, Z) is defined as where P (y i | O di , y 1 , ..., y i−1 ) is given by The W E is a shared weight matrix composed of learned word embeddings, which is similar to the method proposed in (Press and Wolf 2017).
where V is the response vocabulary size. Ψ yi is a one-hot vector for y i in the response vocabulary.
Suppose that k (k < n) elements in the set of topic words Z appear in response R, meaning that k words in the response are topic words. The order in which these topic words appear in R is . Because the positions of the topic words appearing in R are random, O dx is an unfixed element.
For example, if the topic word y ′ 1 is the third word in R, its hidden vector O dx is O d3 . If the topic word y ′ 2 is the seventh word in R, its hidden vector O dx is O d7 . Therefore, the index of O dx depends on where the topic word appears in R. The distribution P (y ′ 1 , ..., y ′ n | C, Z) is defined as where The W is a parameter that converts O dx into vectors of dimension H, where H is the topic vocabulary size. Ψ y ′ i is a one-hot vector for y ′ i in the topic vocabulary. We let Θ denote the parameter set of Anna and estimate Θ from the corpus by minimizing the following loss function, where N refers to the number of dialogues in the corpus C .
One can see that the probability distribution of the response vocabulary and topic vocabulary jointly determine the parameters of Anna based on the above two loss items. Similarly, this joint probability distribution is used for inference. Suppose that the generated response G = {g 1 , ..., g t }. Then, the generation probability P (g i ) is defined by where Ψ mat is a one-hot matrix that projects the generation probability of the topic vocabulary into the corresponding position in the response vocabulary. Words in the response vocabulary that do not appear in the topic vocabulary corresponding to the columns of Ψ mat all have values of zero. Therefore, if g i is a topic word, then P (g i ) is the joint generation probability from the response vocabulary and the topic vocabulary. If g i is not a topic word, then P (g i ) is the generation probability of the response vocabulary. Therefore, topic words have a greater likelihood of appearing. For inference, the response words generated in each step are fed into the decoder as queries in the next time step.

Experiments
We comprehensively evaluated the quality of the responses generated by Anna and baseline models through automatic evaluations and human evaluations. Subsequently, we further investigated the effectiveness and superiority of Anna.

Datasets
In our experiments, Anna was trained on both Chinese and English public multi-turn dialogue datasets, which were extracted from open-domain social dialogue. Table 1 presents the statistics of the datasets. The details are discussed in the following subsections.

Chinese Dataset
The Chinese dataset was collected from the Douban Conversation Corpus , which was crawled from the Chinese social networking Douban group. For each dialogue, the last turn is considered as the response and the previous turns are considered as the context.
We performed meticulous preprocessing on these context-response pairs. Specifically, duplicates were first removed because they would dominate the generated results if they were included in the training data. We then used the Jieba tool 5 for word segmentation and removed pairs with any turn longer than 50 words. Finally, long pairs were split into multiple instances to make the length of each pair less than 20 turns. After preprocessing, we randomly split the pairs into training, validation, and testing sets containing 800,000, 10,000, and 1,000 pairs, respectively.
We separately constructed two vocabulary sets that were formed from all words appearing in the training data. The response vocabulary contained 56,330 words. The topic vocabulary contained 5,000 words. Words outside the response vocabulary were denoted as "UNK."

English Dataset
The English dataset was extracted from DailyDialog (Li et al. 2017). The raw data in Dai-lyDialog were crawled from websites focusing on English dialogue practice and resemble human communications in typical scenarios. The English dataset was preprocessed in a manner similar to the Chinese dataset. We randomly split the pairs into training, validation, and testing sets that contained 74,083, 1,500, and 1,000 pairs, respectively. We also constructed two vocabulary sets. The response vocabulary contained 20,251 words and the topic vocabulary contained 1,004 words. Words outside the response vocabulary were denoted as "UNK." In contrast to previous works, contexts and responses share a vocabulary set in Anna and the topic vocabulary can also be considered as a subset of the response vocabulary.

Baselines
The following four baseline methods were considered in our experiments.

Evaluation Metrics
For automatic evaluation, 6 we employed perplexity, distinctness, parameter size, and decoding time. For human evaluation, we recruited human annotators to perform side-by-side judgment.

Perplexity (PPL)
Perplexity is a measure of how well a model generates a response. A lower perplexity generally indicates that the probability distribution learned by the model is close to that of the training data. Therefore, perplexity is typically used to evaluate whether a generated response is humanlike in terms of syntax and semantics. The perplexity of a response R is defined as where − ∑ n i=1 log(P (y i )) is a term of cross-entropy. Therefore, we obtain the PPL by calculating an exponent based on e for the loss function. In our experiments, the PPL for the validation sets were used to determine when to stop training and for the testing sets were used for evaluation.

Distinctness
We evaluated the degrees of informativeness and diversity of the generated responses using metrics denoted as distinct-1 and distinct-2. These metrics were calculated as the numbers of distinct unigrams and bigrams divided by the total number of generated words, respectively.
Higher ratios indicate more content in generated responses and ensure that a dialogue is scalable.

Parameter Size and Decoding Time
A model should use a few parameters as possible to maximize efficiency. Therefore, parameter size is a helpful metric to evaluating the performance of a model. The number of all trainable variables in a model is considered as the parameter size.
The length of the decoding (response) time is also an important indicator for assessing response quality. A long response time may negatively affect user experiences when a model is applied in daily life. The average response time per word was calculated as the decoding time.

Human Annotation
For human evaluations, we recruited six native speakers (three Chinese speakers and three English speakers) as human annotators. We randomly sampled 100 pairs from the Chinese and English testing datasets, and each annotator was asked to judge which response was better between Anna and the baselines. The responses were selected using a greedy search and randomly shuffled. The criteria for response A being better than B were defined as follows: (1) (1) and (2), but A is more informative and interesting than B. If an annotator could not judge which response was better, they were required to record a tie.

Implementation Details
In our model, word embeddings were learned by a network initialized using the Xavier method (Glorot and Bengio 2010). The number of hidden nodes was set to 512 and the multi-head number for Anna was set to eight. The number of topics for pre-training the LDA model was set to 100 and the number of topic words was set to 50 for each topic. The open-source platform TensorFlow-Gpu 1.14.1 was utilized for model training. The Adam algorithm was utilized for optimization with a learning rate of 0.0002. The batch size was set to 64. For fair comparisons between the baselines and Anna, the maximum number of turns for all dialogues was set as 10 and the maximum length of each utterance was set to 50. We executed all the baselines on an NVIDIA Quadro RTX8000 GPU. All models were tuned using the validation sets based on perplexity. If the perplexity did not drop for five consecutive epochs, we considered that the algorithm had reached convergence and terminated training. Regarding decoding time, our model is the fastest, meaning that Anna can generate responses more rapidly than the baselines. Regarding the human annotation results, the win-loss ratios were calculated by combining the annotations from judges, as shown in Table 3 (the rates of wins, losses, and ties are rounded). The win percentage is always greater than the loss percentage for

Evaluation Results
Anna, meaning our model generates more high-quality responses than the baselines. All the Kappa values exceed 0.46, indicating moderate agreement among annotators. Among the four baselines, the RNN-based models HRED and VHRED perform worse than the attention-based models ReCoSa and PLATO-2 because they lose information regarding segments of the context that should receive focus. PLATO-2 is the closest to our model, but the averaged parameter size of Anna is one-fifth of that of PLATO-2. In other words, because we construct contexttopic information optimally, even though Anna makes use of fewer parameters, it still achieves good performance. We conducted a statistical test on the results of the automatic evaluations and human annotations. The results revealed significant differences between our model and the baselines on both the Chinese and English datasets (Anna versus ReCoSa for human annotation on the English dataset has a p-value < 0.05 and the other comparisons have p-values < 0.01). We also scrutinized the cases in which PLATO-2 lost to Anna. The results revealed that in 22 cases, PLATO-2 generated irrelevant responses and in 28 cases, PLATO-2 generated generic responses, whereas the responses generated by Anna were more relevant and diverse.

Discussions
In this section, some case studies are presented to facilitate a better understanding of Anna. To verify the effects of different components of Anna further, we discuss the following topics: Q1 how hierarchical self-attention lets Anna produce high-quality responses, Q2 how the encoders for the utterance level and context level affect performance differently, and Q3 how the topic-augmented network affects the performance of our model. Finally, analysis and additional experiments are presented to compare our model to the method proposed in (Zhao et al. 2020).

Case Study
Fig. 4 presents four cases extracted from the English testing dataset used to compare Anna to the baselines. One can see that Anna provides answers that are not only relevant to the context, but also diverse. In case 1, topic information provides prior knowledge that overseas study typically gives the impression of loneliness and is hard for people, which helps generate to a response that targets the topic of the dialogue (i.e., "study abroad," "lonely," and "friends").
In case 2, omitted mentions appear in the last turn (i.e., drinking beer will make a person fat and act stupidly). Clearly, Anna caught these dependencies and generated a proper response. In contrast, HRED and VHRED generate irrelevant responses. Although the responses from ReCoSa and PLATO-2 also echo the context, they carry little information. In case 3, Anna notices that sports are mentioned in u1, so it naturally asks the question "which sports do you like to play?" This response can extend the dialogue. In case 4, Anna associates travel and shopping in the context with relevant topic elements such as saving money, allowing it to generate an intelligent response of "I need to save money for travel." In contrast, the responses from the baselines ignore the topic information and contain less specific responses related to "go-to travel." The manner in which Anna focuses on which utterances and words are important in the context will be explained in Section 5.6.2.

Answer to Q1
We went one step further to analyze how Anna produces high-quality responses by visualizing the hierarchical self-attention network. The results are presented in Fig. 5. In each row and the leftmost column, a darker color indicates greater importance, which is computed from the average weight assigned by the hierarchical self-attention network. This visualization can represent the overall contributions of words and utterances. Responses are given at the bottom of each subfigure. Regarding the implementation of this visualization, we followed the method in (Xing et al. 2018). One can see that the hierarchical self-attention network pays more attention to important parts of the context. In Fig. 5(b), the word "beers" in u1 is more highlighted than the other words because the hierarchical self-attention network determined that all of the other words were strongly dependent on "beers." In this context, u1 and u4 are more important than the other elements and the word "fat" in u4 is the most important, which echoes the word "beers" in u1. This explains why the omitted information in u4 does not degrade the answer quality.
Instead, Anna answered with the phrase "drinking a little red wine," which is closely related to the context. Additionally, the highlighted word "it" in u3 refers to the "beers" mentioned in u1. The hierarchical self-attention network is well aware of this co-reference in the context.
In Fig. 5(c), the hierarchical self-attention network assigns greater weights to u2 and u3, which contain words such as "sports," "football," "game," and "courts." This leads to a response that asks about favorite sports. Overall, the parts of the context that the hierarchical self-attention networks focuses on are consistent with human intuition.

Answer to Q2
We kept the main architecture of Anna and replaced the utterance-level encoder and contextlevel encoder with an RNN. These models as "w/o UL" and "w/o CL" respectively. Additionally, the hierarchical self-attention network was completely replaced with a hierarchical RNN. We denoted this model as "w/o HS." We employed PPL as an automatic evaluation metric and conducted human judgment between these models and the full Anna model on both the Chinese and English testing datasets. context, "w/o HS" is clearly inferior to the other models in terms of both automatic evaluations and human evaluations.

Answer to Q3
We also used model ablation to confirm the effects of the topic-augmented network in Anna.
The components of topic loss and topic extra probability were removed from Anna. Specifically, we replaced (O C ⊕O Z ) in Eq. (15) with O C , we removed Eqs. (21) and (24) from Eqs. (22) and (25), and denoted the resulting model as "w/o TA." We performed automatic evaluation and human judgment between "w/o TA" and the full Anna model on both the Chinese and English testing datasets. ReCoSa was also added to the judgment for longitudinal comparisons to the baselines. Table 5 reports the results of model ablation. One can see that removing the topic-augmented network causes a performance drop for Anna, but "w/o TA" is still ahead of ReCoSa in terms of human judgment. This indicates that the hierarchical self-attention network contributes more to the improvement of the entire model than the topic-augmented network. Therefore, the topicaugmented network is useful for enriching generated responses when contextual dependencies are properly modeled.

Comparison of Anna to Zhao et al.'s Method
Our model possesses the following advantages.
(1) The effectiveness of Anna has been verified using two languages (English and Chinese) from different language families. Therefore, our model is more versatile than their model. (2)   an utterance to 25 words. We implemented their model 7 and performed a comparative trial by setting the maximum utterance length of their study to the same as that used in Anna. Fig. 6 presents the results for the decoding times of both models on the English testing dataset when the maximum utterance length varies from 25 words to 60 words. The average decoding time of Anna is 60% of that of their model for a given utterance length. This reveals that the response speed of our model is faster than that of their model. As the utterance length increases, the gap tends to widen.
The parameter size of their model is 27.3 M on the English dataset and 64.3 M on the Chinese dataset in our implementation. Anna has more parameters than their model because it only reduces the parameter size marginally, which was not our main motivation. According to the three primary challenges mentioned in the introduction, we aimed to develop a balanced modeling design. Accordingly, Anna uses a reduced number of parameters while ensuring that contextual dependencies and topic information are effectively modeled. In summary, Anna surpasses the method from (Zhao et al. 2020) in terms of comprehensive performance according to the above analysis. Fig. 6 Decoding time comparisons. 7 A transformer architecture, but we did not implement auxiliary tasks for a limited time.

Conclusions and Future Work
This paper presented a novel multi-turn dialogue generation model called Anna. Experimental results demonstrated that Anna achieves substantial improvements over state-of-the-art models.
The effectiveness of our model on both English and Chinese dialogue was verified. Further analysis revealed that Anna focuses on the important parts of contexts, which is consistent with human intuition. All of the components of Anna are useful. Our model successfully addresses the three primary challenges mentioned in the introduction. It can be concluded that contextual dependencies combined with topic information are useful for improving the quality of multi-turn dialogue response generation.
In future work, the following directions will be considered.
(1) Through the analysis of generation failure cases, we determined that the fluency of responses is affected by the excessive generation probability of the topic vocabulary in a few cases. Therefore, the tradeoff between topic information and the fluency of generated responses should be studied.
(2) Anna utilizes the traditional LDA model for topic mining, but Twitter LDA (Zhao et al. 2011) should also be considered. Twitter LDA belongs to a family of probabilistic topic models and its strength lies in the topic detection of short texts that are more likely to focus on one topic. Therefore, Twitter LDA may be more suitable for topic mining in multi-turn dialogue.
(3) A greedy search was employed to generate responses in this study, but the beam search (Tillmann and Ney 2003) could be incorporated into our method.