OmniHuman-1 📽️

An end-to-end multimodality-conditioned human video generation framework from Bytedance

Bytedance just released a human video generation framework, propose an end-to-end multimodality-conditioned human video generation framework which can generate human videos based on a single human image and motion signals (e.g., audio only, video only, or a combination of audio and video).

What does this exactly mean?

It is a framework for generating human like videos with GenAI with some really impressive examples/ demos. Check out the paper here:
https://omnihuman-lab.github.io/

Some highlights:
1. Support for all aspect ratio(portrait, half-body, full-body all in one), with aspects like lighting, motion and texture details, OmniHuman supports various visual and audio styles.

2. Gesture controls: OmniHuman can support input of any aspect ratio in terms of speech. It significantly improves the handling of gestures, which is a challenge for existing methods, and produces highly realistic results.

3. Diversity: OmniHuman supports cartoons, artificial objects, animals, and challenging poses, ensuring motion characteristics match each style's unique features. Now you can create Anime with a single image 😁 

4. Singing: OmniHuman can support various music styles and accommodate multiple body poses and singing forms. It can handle high-pitched songs and display different motion styles for different types of music.

5. Compatibility with Video Driving: Due to OmniHuman's mixed condition training characteristics, it can support not only audio driving but also video driving to mimic specific video actions.

What really impressed me:
1. Background animation control: It can not just comfortably create humans but also motion in the the background, like motion of accessories, backgrounds for moving cars and so on.
2. Hand gestures: Videos mostly have all the fingers for the human videos, which was funnily a criteria to decide the quality of GenAI videos 😁, OmniHuman does this well.

Training data:
1. Some input images and audio come from TED, Pexels and AIGC.
2. Test samples from the CelebV-HQ datasets. (https://celebv-hq.github.io/)

I am still exploring the technicals on this new framework, personally I am quite impressed with the quality of videos, after Google's Veo2, I feel this generates the most realistic videos.

Paper:
https://omnihuman-lab.github.io/

Checkout this short example on how impressive the videos are: