publications


2024

  1. StyEmp: Stylizing Empathetic Response Generation via Multi-Grained Prefix Encoder and Personality Reinforcement

    In Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue.

    Recent approaches for empathetic response generation mainly focus on emotional resonance and user understanding, without considering the system's personality. Consistent personality is evident in real human expression and is important for creating trustworthy systems. To address this problem, we propose StyEmp, which aims to stylize the empathetic response generation with a consistent personality. Specifically, it incorporates a multi-grained prefix mechanism designed to capture the intricate relationship between a system's personality and its empathetic expressions. Furthermore, we introduce a personality reinforcement module that leverages contrastive learning to calibrate the generation model, ensuring that responses are both empathetic and reflective of a distinct personality. Automatic and human evaluations on the EMPATHETICDIALOGUES benchmark show that StyEmp outperforms competitive baselines in terms of both empathy and personality expressions.
  2. Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks

    In The 14th International Workshop on Spoken Dialogue Systems Technology, Sapporo, Japan. (Oral)

    Personality recognition is useful for enhancing robots' ability to tailor user-adaptive responses, thus fostering rich human-robot interactions. One of the challenges in this task is a limited number of speakers in existing dialogue corpora, which hampers the development of robust, speaker-independent personality recognition models. Additionally, accurately modeling both the interdependencies among interlocutors and the intra-dependencies within the speaker in dialogues remains a significant issue. To address the first challenge, we introduce personality trait interpolation for speaker data augmentation. For the second, we propose heterogeneous conversational graph networks to independently capture both contextual influences and inherent personality traits. Evaluations on the RealPersonaChat corpus demonstrate our method's significant improvements over existing baselines.
  3. Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue

    In The 14th International Workshop on Spoken Dialogue Systems Technology, Sapporo, Japan. (Oral)

    In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces a framework designed to engender empathetic dialogue with validating responses. Our approach incorporates a tripartite module system: 1) validation timing detection, 2) users' emotional state identification, and 3) validating response generation. Utilizing Japanese EmpatheticDialogues dataset - a textual-based dialogue dataset consisting of 8 emotional categories from Plutchik's wheel of emotions - the Task Adaptive Pre-Training (TAPT) BERT-based model outperforms both random baseline and the ChatGPT performance, in term of F1-score, in all modules. Further validation of our model's efficacy is confirmed in its application to the TUT Emotional Storytelling Corpus (TESC), a speech-based dialogue dataset, by surpassing both random baseline and the ChatGPT. This consistent performance across both textual and speech-based dialogues underscores the effectiveness of our framework in fostering empathetic human-AI communication.

2023

  1. Dual variational generative model and auxiliary retrieval for empathetic response generation by conversational robot.

    Advanced Robotics 37 (21), 1406-1418 2023.

    Empathy in human-robot conversations aims to endow the robot with the ability to comprehend user emotion and experience, and then respond to it appropriately. Generally, empathy is embodied in the aspects of both contextual understanding and affective expression, which occur when there exist content and emotion consistencies between context and response. However, previous studies only focus on either aspect. In this paper, we propose a dual variational generative model (DVG) for empathetic response generation to achieve both. Specifically, we integrate an emotion classifier and a variational autoencoder (VAE) into a dual response and context generative model to learn the emotion and content consistencies efficiently. DVG utilizes VAE to mimic the process of context/response understanding. In addition to the generative model, our model can effectively switch to another retrieval system as a fallback solution. Automatic and human evaluations on Japanese and English EmpatheticDialogue datasets demonstrate the effectiveness of our method for empathetic response generation. Furthermore, we evaluate our model's ability in general response generation, which is not specific to empathetic but also chitchatting dialogue system.
  2. Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation.

    In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 645–656, Prague, Czechia. Association for Computational Linguistics. (Oral)

    Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user's experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user's perspective, ignoring the system's perspective. In this paper, we propose a commonsense-based causality explanation approach for diverse empathetic response generation that considers both the user's perspective (user's desires and reactions) and the system's perspective (system's intentions and reactions). We enhance ChatGPT's ability to reason for the system's perspective by integrating in-context learning with commonsense knowledge. Then, we integrate the commonsense-based causality explanation with both ChatGPT and a T5-based model. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.
  3. Causality Reasoning for Empathy-Enriched and Personality-Conditioned Spoken Dialogue System.
    Yahui Fu

    In Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems, pages 62–63, Prague, Czechia. Association for Computational Linguistics.

    The author’s objective centers around developing a spoken dialogue system (SDS) that can emulate the cognitive and conversational qualities of a human friend. Key attributes such as empathy, knowledge/causality reasoning, and personality are integral components of human interaction. The proposed approach involves the creation of an Empathy-enriched SDS, capable of comprehending human emotions and circumstances, thus providing companionship and assistance akin to a trusted friend. Additionally, the Causality-reasoning for SDS aims to ground the system in commonsense knowledge and equip it with the ability to reason about causalities, such as predicting user desires/reactions and system intentions/reactions, thereby enhancing the system’s intelligence and human-like behavior. Finally, the concept of a Personality-conditioned SDS involves enabling systems to exhibit distinct personalities, further enhancing the naturalness of human-robot interaction.
  4. Improving Empathetic Response Generation with Retrieval based on Emotion Recognition.

    In The 13th International Workshop on Spoken Dialogue Systems Technology, Los Angeles, USA.

    Endowing a robot with the ability to express empathy is crucial for building a human-like dialogue system. We propose to improve empathetic response generation with retrieval based on emotion recognition. In addition to a generative model, our model can effectively switch to the retrieval system based on the context and emotion of the user utterances. We incorporate an emotion classifier on top of the context encoder and use context encoding representation to select retrieval responses. Furthermore, it is straightforward to combine our model with the multimodal facial expression of the virtual agent for vivid empathy. Automatic and human evaluations on the Japanese EmpatheticDialogue dataset demonstrate that compared with the solely generative model, our model can generate empathetic responses with more diversity and better scores on the aspects of Empathy, Relevance, and Fluency. Implementing our model on the autonomous android ERICA further demonstrates the effectiveness and adaptivity of our method in achieving an empathetic attentive listening system.

2022

  1. Context-and Knowledge-Aware Graph Convolutional Network for Multimodal Emotion Recognition.

    IEEE MultiMedia 29 (3), 91-100 2022.

    This work proposes an approach for emotion recognition in conversation that leverages context modeling, knowledge enrichment, and multimodal (text and audio) learning based on a graph convolutional network (GCN). We first construct two distinctive graphs for modeling the contextual interaction and knowledge dynamic. We then introduce an affective lexicon into knowledge graph building to enrich the emotional polarity of each concept, that is the related knowledge of each token in an utterance. Then, we achieve a balance between the context and the affect-enriched knowledge by incorporating them into the new adjacency matrix construction of the GCN architecture, and teach them jointly with multiple modalities to effectively structure the semantics-sensitive and knowledge-sensitive contextual dependence of each conversation. Our model outperforms the state-ofthe-art benchmarks by over 22.6% and 11% relative error reduction in terms of weighted-F1 on the IEMOCAP and MELD databases, respectively, demonstrating the superiority of our method in emotion recognition.
  2. Emotion recognition with multimodal transformer fusion framework based on acoustic and lexical information.
    Lili Guo, Longbiao Wang, Jianwu Dang, Yahui Fu, Jiaxing Liu, Shifei Ding.

    IEEE MultiMedia 29 (2), 94-103 2022.

    People usually express emotions through paralinguistic and linguistic information in speech. How to effectively integrate linguistic and paralinguistic information for emotion recognition is a challenge. Previous studies have adopted the bidirectional long short-term memory (BLSTM) network to extract acoustic and lexical representations followed by a concatenate layer, and this has become a common method. However, the interaction and influence between different modalities are difficult to promote using simple feature fusion for each sentence. In this article, we propose an implicitly aligned multimodal transformer fusion (IA-MMTF) framework based on acoustic features and text information. This model enables the two modalities to guide and complement each other when learning emotional representations. Thereafter, the weighed fusion is used to control the contributions of different modalities. Thus, we can obtain more complementary emotional representations. Experiments on the interactive emotional dyadic motion capture (IEMOCAP) database and multimodal emotionlines dataset (MELD) show that the proposed method outperforms the baseline BLSTM-based method.

2021

  1. CONSK-GCN: conversational semantic-and knowledge-oriented graph convolutional network for multimodal emotion recognition.

    In 2021 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE. (Oral)

    Emotion recognition in conversations (ERC) has received significant attention in recent years due to its widespread applications in diverse areas, such as social media, health care, and artificial intelligence interactions. However, different from nonconversational text, it is particularly challenging to model the effective context-aware dependence for the task of ERC. To address this problem, we propose a new Conversational Semantic- and Knowledge-oriented Graph Convolutional Network (ConSK-GCN) approach that leverages both semantic dependence and commonsense knowledge. First, we construct the contextual inter-interaction and intradependence of the interlocutors via a conversational graph-based convolutional network based on multimodal representations. Second, we incorporate commonsense knowledge to guide ConSK-GCN to model the semantic-sensitive and knowledge-sensitive contextual dependence. The results of extensive experiments show that the proposed method outperforms the current state of the art on the IEMOCAP dataset.
  2. Multimodal emotion recognition with capsule graph convolutional based representation fusion.

    In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6339-6343. IEEE.

    Due to the more robust characteristics compared to unimodal, audio-video multimodal emotion recognition (MER) has attracted a lot of attention. The efficiency of representation fusion algorithm often determines the performance of MER. Although there are many fusion algorithms, information redundancy and information complementarity are usually ignored. In this paper, we propose a novel representation fusion method, Capsule Graph Convolutional Network (CapsGCN). Firstly, after unimodal representation learning, the extracted audio and video representations are distilled by capsule network and encapsulated into multimodal capsules respectively. Multimodal capsules can effectively reduce data redundancy by the dynamic routing algorithm. Secondly, the multimodal capsules with their inter-relations and intra-relations are treated as a graph structure. The graph structure is learned by Graph Convolutional Network (GCN) to get hidden representation which is a good supplement for information complementarity. Finally, the multimodal capsules and hidden relational representation learned by CapsGCN are fed to multihead self-attention to balance the contributions of source representation and relational representation. To verify the performance, visualization of representation, the results of commonly used fusion methods, and ablation studies of the proposed CapsGCN are provided. Our proposed fusion method achieves 80.83% accuracy and 80.23% F1 score on eNTERFACE05’.
  3. A sentiment similarity-oriented attention model with multi-task learning for text-based emotion recognition.

    In MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I 27 (pp. 278-289). Springer International Publishing.

    Emotion recognition based on text modality has been one of the major topics in the field of emotion recognition in conversation. How to extract efficient emotional features is still a challenge. Previous studies utilize contextual semantics and emotion lexicon for af- fect modeling. However, they ignore information that may be conveyed by the emotion labels themselves. To address this problem, we propose the sentiment similarity-oriented attention (SSOA) mechanism, which uses the semantics of emotion labels to guide the model’s attention when encoding the input conversations. Thus to extract emotion-related information from sen- tences. Then we use the convolutional neural network (CNN) to extract complex informative features. In addition, as discrete emotions are highly related with the Valence, Arousal, and Dominance (VAD) in psychophysiology, we train the VAD regression and emotion classifica- tion tasks together by using multi-task learning to extract more robust features. The proposed method outperforms the benchmarks by an absolute increase of over 3.65% in terms of the average F1 for the emotion classification task, and also outperforms previous strategies for the VAD regression task on the IEMOCAP database