- AclNet: efficient end-to-end audio classification CNN We propose an efficient end-to-end convolutional neural network architecture, AclNet, for audio classification. When trained with our data augmentation and regularization, we achieved state-of-the-art performance on the ESC-50 corpus with 85:65% accuracy. Our network allows configurations such that memory and compute requirements are drastically reduced, and a tradeoff analysis of accuracy and complexity is presented. The analysis shows high accuracy at significantly reduced computational complexity compared to existing solutions. For example, a configuration with only 155k parameters and 49:3 million multiply-adds per second is 81:75%, exceeding human accuracy of 81:3%. This improved efficiency can enable always-on inference in energy-efficient platforms. 2 authors · Nov 15, 2018
1 FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM's computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0times higher energy efficiency and 1.8times better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2times higher throughput using the latest Versal VHK158 FPGA. 17 authors · Jan 8, 2024
1 Concurrent Self-testing of Neural Networks Using Uncertainty Fingerprint Neural networks (NNs) are increasingly used in always-on safety-critical applications deployed on hardware accelerators (NN-HAs) employing various memory technologies. Reliable continuous operation of NN is essential for safety-critical applications. During online operation, NNs are susceptible to single and multiple permanent and soft errors due to factors such as radiation, aging, and thermal effects. Explicit NN-HA testing methods cannot detect transient faults during inference, are unsuitable for always-on applications, and require extensive test vector generation and storage. Therefore, in this paper, we propose the uncertainty fingerprint approach representing the online fault status of NN. Furthermore, we propose a dual head NN topology specifically designed to produce uncertainty fingerprints and the primary prediction of the NN in a single shot. During the online operation, by matching the uncertainty fingerprint, we can concurrently self-test NNs with up to 100% coverage with a low false positive rate while maintaining a similar performance of the primary task. Compared to existing works, memory overhead is reduced by up to 243.7 MB, multiply and accumulate (MAC) operation is reduced by up to 10000times, and false-positive rates are reduced by up to 89%. 2 authors · Jan 2, 2024
- BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable. Our method is different from the approaches that only use semantic models to shape rewards. Instead, it directly embeds semantic features into the state representation. This cuts down on expensive runtime inference and makes sure that semantic guidance is always available. The simulation results show that the proposed model is better at LK stability and adaptability than the best vision-based and multimodal RL baselines in a wide range of difficult driving situations. We make our code publicly available. 3 authors · Oct 25
- Dataset Inference: Ownership Resolution in Machine Learning With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model's training set is what is common to all stolen copies. The adversary's goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model's owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce dataset inference, the process of identifying whether a suspected model copy has private knowledge from the original model's dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and ImageNet show that model owners can claim with confidence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model's training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overfitting the defended model. 3 authors · Apr 21, 2021
- EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation. 4 authors · Aug 3
2 Zorro: the masked multimodal transformer Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50. 11 authors · Jan 23, 2023
- Stochastic LLMs do not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMs In our opinion the exuberance surrounding the relative success of data-driven large language models (LLMs) is slightly misguided and for several reasons (i) LLMs cannot be relied upon for factual information since for LLMs all ingested text (factual or non-factual) was created equal; (ii) due to their subsymbolic na-ture, whatever 'knowledge' these models acquire about language will always be buried in billions of microfeatures (weights), none of which is meaningful on its own; and (iii) LLMs will often fail to make the correct inferences in several linguistic contexts (e.g., nominal compounds, copredication, quantifier scope ambi-guities, intensional contexts. Since we believe the relative success of data-driven large language models (LLMs) is not a reflection on the symbolic vs. subsymbol-ic debate but a reflection on applying the successful strategy of a bottom-up reverse engineering of language at scale, we suggest in this paper applying the effective bottom-up strategy in a symbolic setting resulting in symbolic, explainable, and ontologically grounded language models. 1 authors · Sep 11, 2023
1 Unveiling the Multi-Annotation Process: Examining the Influence of Annotation Quantity and Instance Difficulty on Model Performance The NLP community has long advocated for the construction of multi-annotator datasets to better capture the nuances of language interpretation, subjectivity, and ambiguity. This paper conducts a retrospective study to show how performance scores can vary when a dataset expands from a single annotation per instance to multiple annotations. We propose a novel multi-annotator simulation process to generate datasets with varying annotation budgets. We show that similar datasets with the same annotation budget can lead to varying performance gains. Our findings challenge the popular belief that models trained on multi-annotation examples always lead to better performance than models trained on single or few-annotation examples. 2 authors · Oct 23, 2023
1 Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM's process for solving a task. However, we find that CoT explanations can systematically misrepresent the true reason for a model's prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs -- e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(A)" -- which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations supporting those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. CoT is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness. 4 authors · May 7, 2023
33 Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation? Recent work has demonstrated that Chain-of-Thought (CoT) often yields limited gains for soft-reasoning problems such as analytical and commonsense reasoning. CoT can also be unfaithful to a model's actual reasoning. We investigate the dynamics and faithfulness of CoT in soft-reasoning tasks across instruction-tuned, reasoning and reasoning-distilled models. Our findings reveal differences in how these models rely on CoT, and show that CoT influence and faithfulness are not always aligned. 4 authors · Aug 27 2
- A Model Generalization Study in Localizing Indoor Cows with COw LOcalization (COLO) dataset Precision livestock farming (PLF) increasingly relies on advanced object localization techniques to monitor livestock health and optimize resource management. This study investigates the generalization capabilities of YOLOv8 and YOLOv9 models for cow detection in indoor free-stall barn settings, focusing on varying training data characteristics such as view angles and lighting, and model complexities. Leveraging the newly released public dataset, COws LOcalization (COLO) dataset, we explore three key hypotheses: (1) Model generalization is equally influenced by changes in lighting conditions and camera angles; (2) Higher model complexity guarantees better generalization performance; (3) Fine-tuning with custom initial weights trained on relevant tasks always brings advantages to detection tasks. Our findings reveal considerable challenges in detecting cows in images taken from side views and underscore the importance of including diverse camera angles in building a detection model. Furthermore, our results emphasize that higher model complexity does not necessarily lead to better performance. The optimal model configuration heavily depends on the specific task and dataset. Lastly, while fine-tuning with custom initial weights trained on relevant tasks offers advantages to detection tasks, simpler models do not benefit similarly from this approach. It is more efficient to train a simple model with pre-trained weights without relying on prior relevant information, which can require intensive labor efforts. Future work should focus on adaptive methods and advanced data augmentation to improve generalization and robustness. This study provides practical guidelines for PLF researchers on deploying computer vision models from existing studies, highlights generalization issues, and contributes the COLO dataset containing 1254 images and 11818 cow instances for further research. 3 authors · Jul 29, 2024