Learning from Human Conversations: A Seq2Seq based Multi-modal Robot Facial Expression Reaction Framework in HRI

Zhegong Shangguan; Xiaoxuan Hei; Fangjun Li; Chuang Yu; Siyang  Song; Jianzhuang Zhao; Angelo Cangelosi; Adriana Tapus

Learning from Human Conversations: A Seq2Seq based Multi-modal Robot Facial Expression Reaction Framework in HRI

Zhegong Shangguan, Xiaoxuan Hei, Fangjun Li, Chuang Yu, Siyang Song^*, Jianzhuang Zhao, Angelo Cangelosi, Adriana Tapus

^*Corresponding author for this work

Machine Learning and Robotics

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

Abstract

Nonverbal communication plays a crucial role in both human-human and human-robot interactions (HRIs), where facial expressions convey emotions, intentions and trust. Enabling humanoid robots to generate human-like facial reactions in response to human speech and facial behaviours remains significant challenges. In this work, we leverage human-human interaction (HHI) datasets to train a humanoid robot, allowing it to learn and imitate facial reactions to both speech and facial expression inputs. Specifically, we extend a sequence-to-sequence (Seq2Seq)-based framework that enables robots to simulate human-like virtual facial expressions that are appropriate for responding to the perceived human user behaviours. Then, we propose a deep neural network-based motor mapping model to translate these expressions into physical robot movements. Experiments demonstrate that our facial reaction–motor mapping framework successfully enables robotic self-reactions to various human behaviours, where our model can best predict 50 frames (two seconds) of facial reactions in response to the input user behaviour of the same duration, aligning with human cognitive and neuromuscular processes.

Original language	English
Title of host publication	2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Publisher	IEEE
Publication status	Accepted/In press - 16 Jun 2025

Embargoed Document

zhegongIROS (2)
Accepted author manuscript, 5.52 MB
Licence: CC BY
Embargo ends: 16/06/26

Cite this

@inproceedings{8601f6cee34246f7addf60d45b02ed4c,

title = "Learning from Human Conversations: A Seq2Seq based Multi-modal Robot Facial Expression Reaction Framework in HRI",

abstract = " Nonverbal communication plays a crucial role in both human-human and human-robot interactions (HRIs), where facial expressions convey emotions, intentions and trust. Enabling humanoid robots to generate human-like facial reactions in response to human speech and facial behaviours remains significant challenges. In this work, we leverage human-human interaction (HHI) datasets to train a humanoid robot, allowing it to learn and imitate facial reactions to both speech and facial expression inputs. Specifically, we extend a sequence-to-sequence (Seq2Seq)-based framework that enables robots to simulate human-like virtual facial expressions that are appropriate for responding to the perceived human user behaviours. Then, we propose a deep neural network-based motor mapping model to translate these expressions into physical robot movements. Experiments demonstrate that our facial reaction–motor mapping framework successfully enables robotic self-reactions to various human behaviours, where our model can best predict 50 frames (two seconds) of facial reactions in response to the input user behaviour of the same duration, aligning with human cognitive and neuromuscular processes.",

author = "Zhegong Shangguan and Xiaoxuan Hei and Fangjun Li and Chuang Yu and Siyang Song and Jianzhuang Zhao and Angelo Cangelosi and Adriana Tapus",

year = "2025",

month = jun,

day = "16",

language = "English",

booktitle = "2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)",

publisher = "IEEE",

address = "United States",

}

TY - GEN

T1 - Learning from Human Conversations: A Seq2Seq based Multi-modal Robot Facial Expression Reaction Framework in HRI

AU - Shangguan, Zhegong

AU - Hei, Xiaoxuan

AU - Li, Fangjun

AU - Yu, Chuang

AU - Song, Siyang

AU - Zhao, Jianzhuang

AU - Cangelosi, Angelo

AU - Tapus, Adriana

PY - 2025/6/16

Y1 - 2025/6/16

N2 - Nonverbal communication plays a crucial role in both human-human and human-robot interactions (HRIs), where facial expressions convey emotions, intentions and trust. Enabling humanoid robots to generate human-like facial reactions in response to human speech and facial behaviours remains significant challenges. In this work, we leverage human-human interaction (HHI) datasets to train a humanoid robot, allowing it to learn and imitate facial reactions to both speech and facial expression inputs. Specifically, we extend a sequence-to-sequence (Seq2Seq)-based framework that enables robots to simulate human-like virtual facial expressions that are appropriate for responding to the perceived human user behaviours. Then, we propose a deep neural network-based motor mapping model to translate these expressions into physical robot movements. Experiments demonstrate that our facial reaction–motor mapping framework successfully enables robotic self-reactions to various human behaviours, where our model can best predict 50 frames (two seconds) of facial reactions in response to the input user behaviour of the same duration, aligning with human cognitive and neuromuscular processes.

AB - Nonverbal communication plays a crucial role in both human-human and human-robot interactions (HRIs), where facial expressions convey emotions, intentions and trust. Enabling humanoid robots to generate human-like facial reactions in response to human speech and facial behaviours remains significant challenges. In this work, we leverage human-human interaction (HHI) datasets to train a humanoid robot, allowing it to learn and imitate facial reactions to both speech and facial expression inputs. Specifically, we extend a sequence-to-sequence (Seq2Seq)-based framework that enables robots to simulate human-like virtual facial expressions that are appropriate for responding to the perceived human user behaviours. Then, we propose a deep neural network-based motor mapping model to translate these expressions into physical robot movements. Experiments demonstrate that our facial reaction–motor mapping framework successfully enables robotic self-reactions to various human behaviours, where our model can best predict 50 frames (two seconds) of facial reactions in response to the input user behaviour of the same duration, aligning with human cognitive and neuromuscular processes.

M3 - Conference contribution

BT - 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

PB - IEEE

ER -

Learning from Human Conversations: A Seq2Seq based Multi-modal Robot Facial Expression Reaction Framework in HRI

Abstract

Embargoed Document

Fingerprint

Cite this