Wav2ARKit - Audio to Facial Expression (ONNX)

A fused, end-to-end ONNX model that converts raw audio waveforms directly into 52 ARKit-compatible facial blendshapes. Based on the Facebook Wav2Vec2 and LAM Audio2Expression models, optimized for real-time CPU inference.

Features

Feature Value
Input Raw 16kHz audio waveform
Output 52 ARKit blendshapes @ 30fps
Inference ~45ms per second of audio
Speed 22ร— faster than realtime
Size 1.8 MB

Quick Start

import onnxruntime as ort
import numpy as np

# Load model
session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])

# Load audio (16kHz, mono, float32)
# Example: 1 second = 16000 samples
audio = np.random.randn(1, 16000).astype(np.float32)

# Output: (1, 30, 52) - 30 frames at 30fps, 52 blendshapes

Model Specification

Input

Name Type Shape Description
audio_waveform float32 [batch, samples] Raw audio at 16kHz

Output

Name Type Shape Description
blendshapes float32 [batch, frames, 52] ARKit blendshapes [0-1]

Frame Calculation

output_frames = ceil(30 ร— (num_samples / 16000))

Example: 1 second audio (16000 samples) โ†’ 30 frames

ARKit Blendshapes

52 blendshape indices (click to expand)
Idx Name Idx Name
0 browDownLeft 26 mouthFrownRight
1 browDownRight 27 mouthFunnel
4 browOuterUpRight 30 mouthLowerDownRight
5 cheekPuff 31 mouthPressLeft
6 cheekSquintLeft 32 mouthPressRight
7 cheekSquintRight 33 mouthPucker
8 eyeBlinkLeft 34 mouthRight
9 eyeBlinkRight 35 mouthRollLower
10 eyeLookDownLeft 36 mouthRollUpper
11 eyeLookDownRight 37 mouthShrugLower
12 eyeLookInLeft 38 mouthShrugUpper
13 eyeLookInRight 39 mouthSmileLeft
14 eyeLookOutLeft 40 mouthSmileRight
15 eyeLookOutRight 41 mouthStretchLeft
16 eyeLookUpLeft 42 mouthStretchRight
17 eyeLookUpRight 43 mouthUpperUpLeft
18 eyeSquintLeft 44 mouthUpperUpRight
19 eyeSquintRight 45 noseSneerLeft
20 eyeWideLeft 46 noseSneerRight
21 eyeWideRight 47 tongueOut
22 jawForward 48 mouthClose
23 jawLeft 49 mouthDimpleLeft
24 jawOpen 50 mouthDimpleRight
25 mouthFrownLeft 51 jawRight

Usage Examples

Python with audio file

import onnxruntime as ort
import numpy as np
import soundfile as sf

session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])

# Load and resample audio to 16kHz if needed
audio, sr = sf.read("speech.wav")
if sr != 16000:
    import librosa
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)

# Ensure mono
if len(audio.shape) > 1:
    audio = audio.mean(axis=1)

# Run inference
audio_input = audio.astype(np.float32).reshape(1, -1)
blendshapes = session.run(None, {"audio_waveform": audio_input})[0]

print(f"Duration: {len(audio)/16000:.2f}s โ†’ {blendshapes.shape[1]} frames")

C++

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "Wav2ARKit");
Ort::Session session(env, L"wav2arkit_cpu.onnx", Ort::SessionOptions{});

std::vector<float> audio(16000);  // 1 second
std::vector<int64_t> shape = {1, 16000};

Ort::MemoryInfo mem = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input = Ort::Value::CreateTensor<float>(mem, audio.data(), audio.size(), shape.data(), shape.size());

const char* input_names[] = {"audio_waveform"};
const char* output_names[] = {"blendshapes"};
auto output = session.Run({}, input_names, &input, 1, output_names, 1);

JavaScript (onnxruntime-web/node)

const ort = require('onnxruntime-node');

const session = await ort.InferenceSession.create('wav2arkit_cpu.onnx');
const audioTensor = new ort.Tensor('float32', audioData, [1, audioData.length]);
const { blendshapes } = await session.run({ audio_waveform: audioTensor });

Architecture

Model Architecture

Note: The identity encoder supports 12 speaker identities (0-11). This ONNX export uses identity 11 baked in for single-speaker inference.

License

Apache 2.0 - Based on:

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Myned/wav2arkit_cpu

Quantized
(1)
this model