-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2410.15316
-
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
Paper • 2502.20583 • Published • 11 -
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper • 2410.15316 • Published • 10 -
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Paper • 2503.01710 • Published • 3
-
80
Dailypapershackernews
📈 -
Prithvi WxC: Foundation Model for Weather and Climate
Paper • 2409.13598 • Published • 42 -
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Paper • 2410.05262 • Published • 10 -
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper • 2410.15316 • Published • 10
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 11 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 16 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 21