Monkey See, Monkey Do (MoMo)

Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

Abstract

Given the remarkable results of motion synthesis with diffusion models, a natural question arises: how can we effectively leverage these models for mo- tion editing? Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models, which enables manipulating the latent feature space; hence, they primarily center on handling the motion space. In this work, we explore the attention mechanism of pre-trained motion diffusion models. We uncover the roles and interactions of attention elements in capturing and representing intricate human motion patterns, and carefully integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer. Editing features associated with selected motions allows us to confront a challenge observed in prior motion diffusion approaches, which use general directives (e.g., text, music) for editing, ultimately failing to convey subtle nuances effectively. Our work is inspired by how a monkey closely imitates what it sees while maintaining its unique motion patterns; hence we call it Monkey see, Monkey Do, and dub it MoMo. Employing our technique enables accom- plishing tasks such as synthesizing out-of-distribution motions, style transfer, and spatial editing. Furthermore, diffusion inversion is seldom employed for motions; as a result, editing efforts focus on generated motions, limiting the editability of real ones. MoMo harnesses motion inversion, extending its application to both real and generated motions. Experimental results show the advantage of our approach over the current art. In particular, unlike methods tailored for specific applications through training, our approach is applied at inference time, requiring no training.

Motion Transfer

Leveraging our understanding of motion self-attention (detailed below), we have developed an innovative motion transfer framework, where the outline of a leader motion is transfered to a follower one, while preserving the motion motifs of the follower.

The term outline relates to what the character is doing, and when. It provides a visual blueprint for the sequence of actions and transitions needed to execute the motion.

The term motifs relates to how a motion is performed. It includes subtle nuances, gestures, or patterns that convey meaning and emotion.

Leader

Follower

Output (vs. Leader)

In the example above, the output motion precisely follows the steps and rhythm outline of the leader, while also incorporating the dancing motifs of the follower.

Pipeline

The input to our model is two noisy tensors, \( X_T^\text{flw} \) and \( X_T^\text{ldr} \), produced by either inverting real motions or sampling a Gaussian noise. The two tensors represent leader and follower motions, and are given along with their associated text prompts. We initialize our output motion, \( X_T^\text{out} \), using the initial noise from the leader motion and pair it with the text prompt from the follower motion. The three noised motions \( X_t^\text{ldr} \), \( X_t^\text{flw} \) and \( X_t^\text{out} \), are passed to the frozen denoising network at each timestep \( t \), along with their prompts and with \( t \). Within the denoising network, \( X_t^\text{out} \) undergoes mixed-attention by combining the query from the leader motion with the key and value from the follower motion. Meanwhile, \( X_t^\text{ldr} \) and \( X_t^\text{flw} \) follow a standard diffusion process.

Understanding Self-Attention Features

Analyzing the profound potential embedded in the self-attention features, we show that keys mainly encode motion motifs, while queries mainly encapsulate its outline. This key insight leads the design of MoMo.

When clustering key features, their unique motifs, such as ‘standing’, ‘walking’ or ‘turning’ are grouped into different clusters.

When clustering query features, periodic steps are grouped together, indicating a dominance of outline features, such as locomotion phases, over their unique motifs.



Correspondence via attention. Follower frames are color-coded according to consecutive indices (top row). Nearest neighbor follower frames (bottom) are the ones that achieve the highest mixed-attention ( \( Q^{\text\text{ldr}} \cdot K^{\text\text{flw}^T} \) ) activation, shown respectively to leader's frames (middle row). As shown, these correspondences are semantically aligned, e.g., moving ``up'' and ``down'' sub-motions are consistently assigned with follower moving ``up'' and ``down'' frames. Some of the nearest neighbors are highlighted with arrows.

Special Cases of Motion Transfer

Our framework offers a versatile motion transfer technique, facilitating various tasks of transferring motifs from one motion to another. Below are several tasks that constitute special cases of our framework.

Leader

Follower

Output (vs. Leader)

Spatial Editing, is where specific joints, like the arms, are edited, while the overall motion is preserved.

Leader

Follower

Output (vs. Leader)

Action Transfer, is where Here the leader and the follower motions are completely different, and yet the output still imitates the follower's actions, but in the same rhythm and limb order as the leader.

Leader

Follower

Output (vs. Leader)

Style Transfer refers to doing a given motion in a different way that represents an emotion or a physical state, such as ``happily'' or ``like a monkey’’.

Leader

Follower

Output (vs. Leader)

Out Of Distribution Synthesis entails uncommon motions that pose a challenge to the network's generalization capabilities. In this example, the follower is the network’s attempt to generate a dancing gorilla. However, this attempt fails to dance. By applying MoMo on the other hand, we generate a person that dances in the same outline as the leader, but with the motifs of the gorilla.

Inversion

Our work stands as the sole approach capable of utilizing motion DDIM inversion within diffusion models, extending its application to both real and generated motions.

Leader (from dataset)

Follower (from dataset)

Output (vs. Leader)

How easy it is to utilize the provided follower motion, while how challenging it would be to generate a motion similar to theirs, using a text prompt.