IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

Bo Chen^1,2, Tao Liu¹, Qi Chen¹, Xie Chen¹, Zilong Zheng²

¹Shanghai Jiao Tong University, ²Beijing Institute for General Artificial Intelligence

IMTalker achieves 40 FPS (video-driven) and 42 FPS (audio-driven) generation on an RTX 4090. It enables diverse controllability including head-pose and eye-gaze.

Abstract

Talking face generation aims to synthesize realistic speaking portraits from a single image. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy within a unified latent space. We introduce an Identity-Adaptive Module to preserve speaker identity and a lightweight Flow-Matching Motion Generator for vivid implicit motion vectors. Experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and sync quality, achieving state-of-the-art performance.

Method Overview

Given a source image, IMTalker extracts identity and motion features. Driving motion is obtained via a Motion Encoder (video-driven) or a Flow-Matching Generator (audio-driven). These are processed by an Identity-Adaptive Module and fed into the Implicit Motion Transfer Module to render the final photorealistic image.

Diverse Audio-Driven Generation

IMTalker generates vivid facial expressions synchronized with various audio inputs.

Sample 1

Sample 3

Sample 2

Sample 4

Video-Driven Reenactment

Self-Reenactment

Reconstructing the video using its own driving video.

Cross-Reenactment

Driving a source identity with motion from a different person.

Controllable Generation

Explicit control over head pose and eye gaze while maintaining lip synchronization.

Pose Control

Gaze Control

Long Video Generation

Stable generation over long durations without artifact accumulation.

Qualitative Comparisons

We compare IMTalker with state-of-the-art methods. Each video below shows a side-by-side comparison.

Video-Driven Comparison

Comparison Sample 1

Comparison Sample 2

Comparison Sample 3

Comparison Sample 4

Audio-Driven Comparison

Comparison Sample 1

Comparison Sample 2

Comparison Sample 3

Comparison Sample 4

BibTeX

@article{imtalker2025,
  title={IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer},
  author={Bo, Chen and Tao, Liu and Qi, Chen and  Xie, Chen and  Zilong Zheng},
  journal={arXiv preprint arXiv:2511.22167},
  year={2025}
}