Wav2ARKit - Audio to Facial Expression (ONNX)
A fused, end-to-end ONNX model that converts raw audio waveforms directly into 52 ARKit-compatible facial blendshapes. Based on the Facebook Wav2Vec2 and LAM Audio2Expression models, optimized for real-time CPU inference.
Features
| Feature | Value |
|---|---|
| Input | Raw 16kHz audio waveform |
| Output | 52 ARKit blendshapes @ 30fps |
| Inference | ~45ms per second of audio |
| Speed | 22ร faster than realtime |
| Size | 1.8 MB |
Quick Start
import onnxruntime as ort
import numpy as np
# Load model
session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])
# Load audio (16kHz, mono, float32)
# Example: 1 second = 16000 samples
audio = np.random.randn(1, 16000).astype(np.float32)
# Output: (1, 30, 52) - 30 frames at 30fps, 52 blendshapes
Model Specification
Input
| Name | Type | Shape | Description |
|---|---|---|---|
audio_waveform |
float32 | [batch, samples] |
Raw audio at 16kHz |
Output
| Name | Type | Shape | Description |
|---|---|---|---|
blendshapes |
float32 | [batch, frames, 52] |
ARKit blendshapes [0-1] |
Frame Calculation
output_frames = ceil(30 ร (num_samples / 16000))
Example: 1 second audio (16000 samples) โ 30 frames
ARKit Blendshapes
52 blendshape indices (click to expand)
| Idx | Name | Idx | Name |
|---|---|---|---|
| 0 | browDownLeft | 26 | mouthFrownRight |
| 1 | browDownRight | 27 | mouthFunnel |
| 4 | browOuterUpRight | 30 | mouthLowerDownRight |
| 5 | cheekPuff | 31 | mouthPressLeft |
| 6 | cheekSquintLeft | 32 | mouthPressRight |
| 7 | cheekSquintRight | 33 | mouthPucker |
| 8 | eyeBlinkLeft | 34 | mouthRight |
| 9 | eyeBlinkRight | 35 | mouthRollLower |
| 10 | eyeLookDownLeft | 36 | mouthRollUpper |
| 11 | eyeLookDownRight | 37 | mouthShrugLower |
| 12 | eyeLookInLeft | 38 | mouthShrugUpper |
| 13 | eyeLookInRight | 39 | mouthSmileLeft |
| 14 | eyeLookOutLeft | 40 | mouthSmileRight |
| 15 | eyeLookOutRight | 41 | mouthStretchLeft |
| 16 | eyeLookUpLeft | 42 | mouthStretchRight |
| 17 | eyeLookUpRight | 43 | mouthUpperUpLeft |
| 18 | eyeSquintLeft | 44 | mouthUpperUpRight |
| 19 | eyeSquintRight | 45 | noseSneerLeft |
| 20 | eyeWideLeft | 46 | noseSneerRight |
| 21 | eyeWideRight | 47 | tongueOut |
| 22 | jawForward | 48 | mouthClose |
| 23 | jawLeft | 49 | mouthDimpleLeft |
| 24 | jawOpen | 50 | mouthDimpleRight |
| 25 | mouthFrownLeft | 51 | jawRight |
Usage Examples
Python with audio file
import onnxruntime as ort
import numpy as np
import soundfile as sf
session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])
# Load and resample audio to 16kHz if needed
audio, sr = sf.read("speech.wav")
if sr != 16000:
import librosa
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
# Ensure mono
if len(audio.shape) > 1:
audio = audio.mean(axis=1)
# Run inference
audio_input = audio.astype(np.float32).reshape(1, -1)
blendshapes = session.run(None, {"audio_waveform": audio_input})[0]
print(f"Duration: {len(audio)/16000:.2f}s โ {blendshapes.shape[1]} frames")
C++
#include <onnxruntime_cxx_api.h>
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "Wav2ARKit");
Ort::Session session(env, L"wav2arkit_cpu.onnx", Ort::SessionOptions{});
std::vector<float> audio(16000); // 1 second
std::vector<int64_t> shape = {1, 16000};
Ort::MemoryInfo mem = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input = Ort::Value::CreateTensor<float>(mem, audio.data(), audio.size(), shape.data(), shape.size());
const char* input_names[] = {"audio_waveform"};
const char* output_names[] = {"blendshapes"};
auto output = session.Run({}, input_names, &input, 1, output_names, 1);
JavaScript (onnxruntime-web/node)
const ort = require('onnxruntime-node');
const session = await ort.InferenceSession.create('wav2arkit_cpu.onnx');
const audioTensor = new ort.Tensor('float32', audioData, [1, audioData.length]);
const { blendshapes } = await session.run({ audio_waveform: audioTensor });
Architecture
Note: The identity encoder supports 12 speaker identities (0-11). This ONNX export uses identity 11 baked in for single-speaker inference.
License
Apache 2.0 - Based on:
- Downloads last month
- 15
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Myned/wav2arkit_cpu
Base model
3DAIGC/LAM_audio2exp