Wav2ARKit - Audio to Facial Expression (ONNX)

A fused, end-to-end ONNX model that converts raw audio waveforms directly into 52 ARKit-compatible facial blendshapes. Based on the Facebook Wav2Vec2 and LAM Audio2Expression models, optimized for real-time CPU inference.

Features

Feature	Value
Input	Raw 16kHz audio waveform
Output	52 ARKit blendshapes @ 30fps
Inference	~45ms per second of audio
Speed	22× faster than realtime
Size	1.8 MB

Quick Start

import onnxruntime as ort
import numpy as np

# Load model
session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])

# Load audio (16kHz, mono, float32)
# Example: 1 second = 16000 samples
audio = np.random.randn(1, 16000).astype(np.float32)

# Output: (1, 30, 52) - 30 frames at 30fps, 52 blendshapes

Model Specification

Input

Name	Type	Shape	Description
`audio_waveform`	float32	`[batch, samples]`	Raw audio at 16kHz

Output

Name	Type	Shape	Description
`blendshapes`	float32	`[batch, frames, 52]`	ARKit blendshapes [0-1]

Frame Calculation

output_frames = ceil(30 × (num_samples / 16000))

Example: 1 second audio (16000 samples) → 30 frames

ARKit Blendshapes

52 blendshape indices (click to expand)

Idx	Name	Idx	Name
0	browDownLeft	26	mouthFrownRight
1	browDownRight	27	mouthFunnel
4	browOuterUpRight	30	mouthLowerDownRight
5	cheekPuff	31	mouthPressLeft
6	cheekSquintLeft	32	mouthPressRight
7	cheekSquintRight	33	mouthPucker
8	eyeBlinkLeft	34	mouthRight
9	eyeBlinkRight	35	mouthRollLower
10	eyeLookDownLeft	36	mouthRollUpper
11	eyeLookDownRight	37	mouthShrugLower
12	eyeLookInLeft	38	mouthShrugUpper
13	eyeLookInRight	39	mouthSmileLeft
14	eyeLookOutLeft	40	mouthSmileRight
15	eyeLookOutRight	41	mouthStretchLeft
16	eyeLookUpLeft	42	mouthStretchRight
17	eyeLookUpRight	43	mouthUpperUpLeft
18	eyeSquintLeft	44	mouthUpperUpRight
19	eyeSquintRight	45	noseSneerLeft
20	eyeWideLeft	46	noseSneerRight
21	eyeWideRight	47	tongueOut
22	jawForward	48	mouthClose
23	jawLeft	49	mouthDimpleLeft
24	jawOpen	50	mouthDimpleRight
25	mouthFrownLeft	51	jawRight

Usage Examples

Python with audio file

import onnxruntime as ort
import numpy as np
import soundfile as sf

session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])

# Load and resample audio to 16kHz if needed
audio, sr = sf.read("speech.wav")
if sr != 16000:
    import librosa
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)

# Ensure mono
if len(audio.shape) > 1:
    audio = audio.mean(axis=1)

# Run inference
audio_input = audio.astype(np.float32).reshape(1, -1)
blendshapes = session.run(None, {"audio_waveform": audio_input})[0]

print(f"Duration: {len(audio)/16000:.2f}s → {blendshapes.shape[1]} frames")

C++

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "Wav2ARKit");
Ort::Session session(env, L"wav2arkit_cpu.onnx", Ort::SessionOptions{});

std::vector<float> audio(16000);  // 1 second
std::vector<int64_t> shape = {1, 16000};

Ort::MemoryInfo mem = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input = Ort::Value::CreateTensor<float>(mem, audio.data(), audio.size(), shape.data(), shape.size());

const char* input_names[] = {"audio_waveform"};
const char* output_names[] = {"blendshapes"};
auto output = session.Run({}, input_names, &input, 1, output_names, 1);

JavaScript (onnxruntime-web/node)

const ort = require('onnxruntime-node');

const session = await ort.InferenceSession.create('wav2arkit_cpu.onnx');
const audioTensor = new ort.Tensor('float32', audioData, [1, audioData.length]);
const { blendshapes } = await session.run({ audio_waveform: audioTensor });

Architecture

Note: The identity encoder supports 12 speaker identities (0-11). This ONNX export uses identity 11 baked in for single-speaker inference.

License

Apache 2.0 - Based on:

Downloads last month: 15

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Myned/wav2arkit_cpu

Base model

3DAIGC/LAM_audio2exp

Quantized

(1)

this model