Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

Given the remarkable results of motion synthesis with diffusion models, a natural question arises: how can we effectively leverage these models for mo- tion editing? Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models, which enables manipulating the latent feature space; hence, they primarily center on handling the motion space. In this work, we explore the attention mechanism of pre-trained motion diffusion models. We uncover the roles and interactions of attention elements in capturing and representing intricate human motion patterns, and carefully integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer. Editing features associated with selected motions allows us to confront a challenge observed in prior motion diffusion approaches, which use general directives (e.g., text, music) for editing, ultimately failing to convey subtle nuances effectively. Our work is inspired by how a monkey closely imitates what it sees while maintaining its unique motion patterns; hence we call it Monkey see, Monkey Do, and dub it MoMo. Employing our technique enables accom- plishing tasks such as synthesizing out-of-distribution motions, style transfer, and spatial editing. Furthermore, diffusion inversion is seldom employed for motions; as a result, editing efforts focus on generated motions, limiting the editability of real ones. MoMo harnesses motion inversion, extending its application to both real and generated motions. Experimental results show the advantage of our approach over the current art. In particular, unlike methods tailored for specific applications through training, our approach is applied at inference time, requiring no training.

Leveraging our understanding of motion self-attention (detailed below), we have developed an innovative motion transfer framework, where the outline of a leader motion is transfered to a follower one, while preserving the motion motifs of the follower.

The term outline relates to what the character is doing, and when. It provides a visual blueprint for the sequence of actions and transitions needed to execute the motion.

The term motifs relates to how a motion is performed. It includes subtle nuances, gestures, or patterns that convey meaning and emotion.

The input to our model is two noisy tensors, \( X_T^\text{flw} \) and \( X_T^\text{ldr} \), produced by either inverting real motions or sampling a Gaussian noise. The two tensors represent leader and follower motions, and are given along with their associated text prompts. We initialize our output motion, \( X_T^\text{out} \), using the initial noise from the leader motion and pair it with the text prompt from the follower motion. The three noised motions \( X_t^\text{ldr} \), \( X_t^\text{flw} \) and \( X_t^\text{out} \), are passed to the frozen denoising network at each timestep \( t \), along with their prompts and with \( t \). Within the denoising network, \( X_t^\text{out} \) undergoes mixed-attention by combining the query from the leader motion with the key and value from the follower motion. Meanwhile, \( X_t^\text{ldr} \) and \( X_t^\text{flw} \) follow a standard diffusion process.

Understanding Self-Attention Features

Analyzing the profound potential embedded in the self-attention features, we show that keys mainly encode motion motifs, while queries mainly encapsulate its outline. This key insight leads the design of MoMo.

Correspondence via attention. Follower frames are color-coded according to consecutive indices (top row). Nearest neighbor follower frames (bottom) are the ones that achieve the highest mixed-attention ( \( Q^{\text{ldr}} \cdot K^{\text{flw}^T} \) ) activation, shown respectively to leader's frames (middle row). As shown, these correspondences are semantically aligned, e.g., moving ``up'' and ``down'' sub-motions are consistently assigned with follower moving ``up'' and ``down'' frames. Some of the nearest neighbors are highlighted with arrows.

Special Cases of Motion Transfer

Our framework offers a versatile motion transfer technique, facilitating various tasks of transferring motifs from one motion to another. Below are several tasks that constitute special cases of our framework.

Our work stands as the sole approach capable of utilizing motion DDIM inversion within diffusion models, extending its application to both real and generated motions.

BibTeX


@inproceedings{raab2024monkey,
title={Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer},
author={Raab, Sigal and Gat, Inbar and Sala, Nathan and Tevet, Guy and Shalev-Arkushin, Rotem and Fried, Ohad and Bermano, Amit H and Cohen-Or, Daniel},
booktitle={SIGGRAPH Asia 2024 Conference Papers},
pages={1--13},
year={2024}
}

Monkey See, Monkey Do (MoMo)

Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

Abstract

Motion Transfer

Pipeline

Understanding Self-Attention Features

Special Cases of Motion Transfer

Inversion

BibTeX