Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeLearning from Aggregate responses: Instance Level versus Bag Level Loss Functions
Due to the rise of privacy concerns, in many practical applications the training data is aggregated before being shared with the learner, in order to protect privacy of users' sensitive responses. In an aggregate learning framework, the dataset is grouped into bags of samples, where each bag is available only with an aggregate response, providing a summary of individuals' responses in that bag. In this paper, we study two natural loss functions for learning from aggregate responses: bag-level loss and the instance-level loss. In the former, the model is learnt by minimizing a loss between aggregate responses and aggregate model predictions, while in the latter the model aims to fit individual predictions to the aggregate responses. In this work, we show that the instance-level loss can be perceived as a regularized form of the bag-level loss. This observation lets us compare the two approaches with respect to bias and variance of the resulting estimators, and introduce a novel interpolating estimator which combines the two approaches. For linear regression tasks, we provide a precise characterization of the risk of the interpolating estimator in an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis allows us to theoretically understand the effect of different factors, such as bag size on the model prediction risk. In addition, we propose a mechanism for differentially private learning from aggregate responses and derive the optimal bag size in terms of prediction risk-privacy trade-off. We also carry out thorough experiments to corroborate our theory and show the efficacy of the interpolating estimator.
Few-Shot Segmentation Without Meta-Learning: A Good Transductive Inference Is All You Need?
We show that the way inference is performed in few-shot segmentation tasks has a substantial effect on performances -- an aspect often overlooked in the literature in favor of the meta-learning paradigm. We introduce a transductive inference for a given query image, leveraging the statistics of its unlabeled pixels, by optimizing a new loss containing three complementary terms: i) the cross-entropy on the labeled support pixels; ii) the Shannon entropy of the posteriors on the unlabeled query-image pixels; and iii) a global KL-divergence regularizer based on the proportion of the predicted foreground. As our inference uses a simple linear classifier of the extracted features, its computational load is comparable to inductive inference and can be used on top of any base training. Foregoing episodic training and using only standard cross-entropy training on the base classes, our inference yields competitive performances on standard benchmarks in the 1-shot scenarios. As the number of available shots increases, the gap in performances widens: on PASCAL-5i, our method brings about 5% and 6% improvements over the state-of-the-art, in the 5- and 10-shot scenarios, respectively. Furthermore, we introduce a new setting that includes domain shifts, where the base and novel classes are drawn from different datasets. Our method achieves the best performances in this more realistic setting. Our code is freely available online: https://github.com/mboudiaf/RePRI-for-Few-Shot-Segmentation.
Mixture Proportion Estimation Beyond Irreducibility
The task of mixture proportion estimation (MPE) is to estimate the weight of a component distribution in a mixture, given observations from both the component and mixture. Previous work on MPE adopts the irreducibility assumption, which ensures identifiablity of the mixture proportion. In this paper, we propose a more general sufficient condition that accommodates several settings of interest where irreducibility does not hold. We further present a resampling-based meta-algorithm that takes any existing MPE algorithm designed to work under irreducibility and adapts it to work under our more general condition. Our approach empirically exhibits improved estimation performance relative to baseline methods and to a recently proposed regrouping-based algorithm.
Proper losses for discrete generative models
We initiate the study of proper losses for evaluating generative models in the discrete setting. Unlike traditional proper losses, we treat both the generative model and the target distribution as black-boxes, only assuming ability to draw i.i.d. samples. We define a loss to be black-box proper if the generative distribution that minimizes expected loss is equal to the target distribution. Using techniques from statistical estimation theory, we give a general construction and characterization of black-box proper losses: they must take a polynomial form, and the number of draws from the model and target distribution must exceed the degree of the polynomial. The characterization rules out a loss whose expectation is the cross-entropy between the target distribution and the model. By extending the construction to arbitrary sampling schemes such as Poisson sampling, however, we show that one can construct such a loss.
Simpson's Bias in NLP Training
In most machine learning tasks, we evaluate a model M on a given data population S by measuring a population-level metric F(S;M). Examples of such evaluation metric F include precision/recall for (binary) recognition, the F1 score for multi-class classification, and the BLEU metric for language generation. On the other hand, the model M is trained by optimizing a sample-level loss G(S_t;M) at each learning step t, where S_t is a subset of S (a.k.a. the mini-batch). Popular choices of G include cross-entropy loss, the Dice loss, and sentence-level BLEU scores. A fundamental assumption behind this paradigm is that the mean value of the sample-level loss G, if averaged over all possible samples, should effectively represent the population-level metric F of the task, such as, that E[ G(S_t;M) ] approx F(S;M). In this paper, we systematically investigate the above assumption in several NLP tasks. We show, both theoretically and experimentally, that some popular designs of the sample-level loss G may be inconsistent with the true population-level metric F of the task, so that models trained to optimize the former can be substantially sub-optimal to the latter, a phenomenon we call it, Simpson's bias, due to its deep connections with the classic paradox known as Simpson's reversal paradox in statistics and social sciences.
Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an L_p loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.
A Neural Scaling Law from Lottery Ticket Ensembling
Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as N^{-alpha}, alpha=4/d, where N is the number of model parameters, and d is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem y=x^2 manifests a different scaling law (alpha=1) from their predictions (alpha=4). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the N^{-1} scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.
Loss-to-Loss Prediction: Scaling Laws for All Datasets
While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.
Predicting Rare Events by Shrinking Towards Proportional Odds
Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible.
Improve Representation for Imbalanced Regression through Geometric Constraints
In representation learning, uniformity refers to the uniform feature distribution in the latent space (i.e., unit hypersphere). Previous work has shown that improving uniformity contributes to the learning of under-represented classes. However, most of the previous work focused on classification; the representation space of imbalanced regression remains unexplored. Classification-based methods are not suitable for regression tasks because they cluster features into distinct groups without considering the continuous and ordered nature essential for regression. In a geometric aspect, we uniquely focus on ensuring uniformity in the latent space for imbalanced regression through two key losses: enveloping and homogeneity. The enveloping loss encourages the induced trace to uniformly occupy the surface of a hypersphere, while the homogeneity loss ensures smoothness, with representations evenly spaced at consistent intervals. Our method integrates these geometric principles into the data representations via a Surrogate-driven Representation Learning (SRL) framework. Experiments with real-world regression and operator learning tasks highlight the importance of uniformity in imbalanced regression and validate the efficacy of our geometry-based loss functions.
"Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts
Machine learning models frequently experience performance drops under distribution shifts. The underlying cause of such shifts may be multiple simultaneous factors such as changes in data quality, differences in specific covariate distributions, or changes in the relationship between label and features. When a model does fail during deployment, attributing performance change to these factors is critical for the model developer to identify the root cause and take mitigating actions. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.
Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation
Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines (up to 15%) for the LLP Binary Classification problem on various dataset types - tabular and Image. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.
Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology
Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P<0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.
Fine-Tuning Medical Language Models for Enhanced Long-Contextual Understanding and Domain Expertise
Large Language Models (LLMs) have been widely applied in various professional fields. By fine-tuning the models using domain specific question and answer datasets, the professional domain knowledge and Q\&A abilities of these models have significantly improved, for example, medical professional LLMs that use fine-tuning of doctor-patient Q\&A data exhibit extraordinary disease diagnostic abilities. However, we observed that despite improvements in specific domain knowledge, the performance of medical LLM in long-context understanding has significantly declined, especially compared to general language models with similar parameters. The purpose of this study is to investigate the phenomenon of reduced performance in understanding long-context in medical LLM. We designed a series of experiments to conduct open-book professional knowledge exams on all models to evaluate their ability to read long-context. By adjusting the proportion and quantity of general data and medical data in the process of fine-tuning, we can determine the best data composition to optimize the professional model and achieve a balance between long-context performance and specific domain knowledge.
FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
Less than one percent of words would be affected by gender-inclusive language in German press texts
Research on gender and language is tightly knitted to social debates on gender equality and non-discriminatory language use. Psycholinguistic scholars have made significant contributions in this field. However, corpus-based studies that investigate these matters within the context of language use are still rare. In our study, we address the question of how much textual material would actually have to be changed if non-gender-inclusive texts were rewritten to be gender-inclusive. This quantitative measure is an important empirical insight, as a recurring argument against the use of gender-inclusive German is that it supposedly makes written texts too long and complicated. It is also argued that gender-inclusive language has negative effects on language learners. However, such effects are only likely if gender-inclusive texts are very different from those that are not gender-inclusive. In our corpus-linguistic study, we manually annotated German press texts to identify the parts that would have to be changed. Our results show that, on average, less than 1% of all tokens would be affected by gender-inclusive language. This small proportion calls into question whether gender-inclusive German presents a substantial barrier to understanding and learning the language, particularly when we take into account the potential complexities of interpreting masculine generics.
Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study
The rise of machine learning (ML) systems has exacerbated their carbon footprint due to increased capabilities and model sizes. However, there is scarce knowledge on how the carbon footprint of ML models is actually measured, reported, and evaluated. In light of this, the paper aims to analyze the measurement of the carbon footprint of 1,417 ML models and associated datasets on Hugging Face, which is the most popular repository for pretrained ML models. The goal is to provide insights and recommendations on how to report and optimize the carbon efficiency of ML models. The study includes the first repository mining study on the Hugging Face Hub API on carbon emissions. This study seeks to answer two research questions: (1) how do ML model creators measure and report carbon emissions on Hugging Face Hub?, and (2) what aspects impact the carbon emissions of training ML models? The study yielded several key findings. These include a stalled proportion of carbon emissions-reporting models, a slight decrease in reported carbon footprint on Hugging Face over the past 2 years, and a continued dominance of NLP as the main application domain. Furthermore, the study uncovers correlations between carbon emissions and various attributes such as model size, dataset size, and ML application domains. These results highlight the need for software measurements to improve energy reporting practices and promote carbon-efficient model development within the Hugging Face community. In response to this issue, two classifications are proposed: one for categorizing models based on their carbon emission reporting practices and another for their carbon efficiency. The aim of these classification proposals is to foster transparency and sustainable model development within the ML community.
A study of a deterministic model for meningitis epidemic
A compartmental deterministic model that allows (1) immunity from two stages of infection and carriage, and (2) disease induced death, is used in studying the dynamics of meningitis epidemic process in a closed population. It allows for difference in the transmission rate of infection to a susceptible by a carrier and an infective. It is generalized to allow a proportion ({\phi}) of those susceptibles infected to progress directly to infectives in stage I. Both models are used in this study. The threshold conditions for the spread of carrier and infectives in stage I are derived for the two models. Sensitivity analysis is performed on the reproductive number derived from the next generation matrix. The case-carrier ratio profile for various parameters and threshold values are shown. So also are the graphs of the total number ever infected as influenced by {\epsilon} and {\phi}. The infection transmission rate (eta), the odds in favor of a carrier, over an infective, in transmitting an infection to a susceptible ({\epsilon}) and the carrier conversion rate ({\phi}) to an infective in stage I, are identified as key parameters that should be subject of attention for any control intervention strategy. The case-carrier ratio profiles provide evidence of a critical case-carrier ratio attained before the number of reported cases grows to an epidemic level. They also provide visual evidence of epidemiological context, in this case, epidemic incidence (in later part of dry season) and endemic incidence (during rainy season). Results from total proportion ever infected suggest that the model, in which {\phi}=0 obtained, can adequately represent, in essence, the generalized model for this study.
Focus on Your Target: A Dual Teacher-Student Framework for Domain-adaptive Semantic Segmentation
We study unsupervised domain adaptation (UDA) for semantic segmentation. Currently, a popular UDA framework lies in self-training which endows the model with two-fold abilities: (i) learning reliable semantics from the labeled images in the source domain, and (ii) adapting to the target domain via generating pseudo labels on the unlabeled images. We find that, by decreasing/increasing the proportion of training samples from the target domain, the 'learning ability' is strengthened/weakened while the 'adapting ability' goes in the opposite direction, implying a conflict between these two abilities, especially for a single model. To alleviate the issue, we propose a novel dual teacher-student (DTS) framework and equip it with a bidirectional learning strategy. By increasing the proportion of target-domain data, the second teacher-student model learns to 'Focus on Your Target' while the first model is not affected. DTS is easily plugged into existing self-training approaches. In a standard UDA scenario (training on synthetic, labeled data and real, unlabeled data), DTS shows consistent gains over the baselines and sets new state-of-the-art results of 76.5\% and 75.1\% mIoUs on GTAvrightarrowCityscapes and SYNTHIArightarrowCityscapes, respectively.
Regression with Label Permutation in Generalized Linear Model
The assumption that response and predictor belong to the same statistical unit may be violated in practice. Unbiased estimation and recovery of true label ordering based on unlabeled data are challenging tasks and have attracted increasing attentions in the recent literature. In this paper, we present a relatively complete analysis of label permutation problem for the generalized linear model with multivariate responses. The theory is established under different scenarios, with knowledge of true parameters, with partial knowledge of underlying label permutation matrix and without any knowledge. Our results remove the stringent conditions required by the current literature and are further extended to the missing observation setting which has never been considered in the field of label permutation problem. On computational side, we propose two methods, "maximum likelihood estimation" algorithm and "two-step estimation" algorithm, to accommodate for different settings. When the proportion of permuted labels is moderate, both methods work effectively. Multiple numerical experiments are provided and corroborate our theoretical findings.
SimpleX: A Simple and Strong Baseline for Collaborative Filtering
Collaborative filtering (CF) is a widely studied research topic in recommender systems. The learning of a CF model generally depends on three major components, namely interaction encoder, loss function, and negative sampling. While many existing studies focus on the design of more powerful interaction encoders, the impacts of loss functions and negative sampling ratios have not yet been well explored. In this work, we show that the choice of loss function as well as negative sampling ratio is equivalently important. More specifically, we propose the cosine contrastive loss (CCL) and further incorporate it to a simple unified CF model, dubbed SimpleX. Extensive experiments have been conducted on 11 benchmark datasets and compared with 29 existing CF models in total. Surprisingly, the results show that, under our CCL loss and a large negative sampling ratio, SimpleX can surpass most sophisticated state-of-the-art models by a large margin (e.g., max 48.5% improvement in NDCG@20 over LightGCN). We believe that SimpleX could not only serve as a simple strong baseline to foster future research on CF, but also shed light on the potential research direction towards improving loss function and negative sampling. Our source code will be available at https://reczoo.github.io/SimpleX.
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.
Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning
The extensive memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings, such as cloud environments or on-device. PLMs use embedding matrices to represent extensive vocabularies, forming a large proportion of the model parameters. While previous work towards parameter-efficient PLM development has considered pruning parameters within the transformer layers, pruning the embedding matrix as part of fine-tuning or inference has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused in these scenarios. We then propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach maintains equivalent downstream task performance while allowing a more efficient use of compute resources.
Industrial Application of 6D Pose Estimation for Robotic Manipulation in Automotive Internal Logistics
Despite the advances in robotics a large proportion of the of parts handling tasks in the automotive industry's internal logistics are not automated but still performed by humans. A key component to competitively automate these processes is a 6D pose estimation that can handle a large number of different parts, is adaptable to new parts with little manual effort, and is sufficiently accurate and robust with respect to industry requirements. In this context, the question arises as to the current status quo with respect to these measures. To address this we built a representative 6D pose estimation pipeline with state-of-the-art components from economically scalable real to synthetic data generation to pose estimators and evaluated it on automotive parts with regards to a realistic sequencing process. We found that using the data generation approaches, the performance of the trained 6D pose estimators are promising, but do not meet industry requirements. We reveal that the reason for this is the inability of the estimators to provide reliable uncertainties for their poses, rather than the ability of to provide sufficiently accurate poses. In this context we further analyzed how RGB- and RGB-D-based approaches compare against this background and show that they are differently vulnerable to the domain gap induced by synthetic data.
BabyLM Challenge: Exploring the Effect of Variation Sets on Language Model Training Efficiency
While current large language models have achieved a remarkable success, their data efficiency remains a challenge to overcome. Recently it has been suggested that child-directed speech (CDS) can improve training data efficiency of modern language models based on Transformer neural networks. However, it is not yet understood which specific properties of CDS are effective for training these models. In the context of the BabyLM Challenge, we focus on Variation Sets (VSs), sets of consecutive utterances expressing a similar intent with slightly different words and structures, which are ubiquitous in CDS. To assess the impact of VSs on training data efficiency, we augment CDS data with different proportions of artificial VSs and use these datasets to train an auto-regressive model, GPT-2. We find that the best proportion of VSs depends on the evaluation benchmark: BLiMP and GLUE scores benefit from the presence of VSs, but EWOK scores do not. Additionally, the results vary depending on multiple factors such as the number of epochs and the order of utterance presentation. Taken together, these findings suggest that VSs can have a beneficial influence on language models, while leaving room for further investigation.
Teacher-Student Architecture for Mixed Supervised Lung Tumor Segmentation
Purpose: Automating tasks such as lung tumor localization and segmentation in radiological images can free valuable time for radiologists and other clinical personnel. Convolutional neural networks may be suited for such tasks, but require substantial amounts of labeled data to train. Obtaining labeled data is a challenge, especially in the medical domain. Methods: This paper investigates the use of a teacher-student design to utilize datasets with different types of supervision to train an automatic model performing pulmonary tumor segmentation on computed tomography images. The framework consists of two models: the student that performs end-to-end automatic tumor segmentation and the teacher that supplies the student additional pseudo-annotated data during training. Results: Using only a small proportion of semantically labeled data and a large number of bounding box annotated data, we achieved competitive performance using a teacher-student design. Models trained on larger amounts of semantic annotations did not perform better than those trained on teacher-annotated data. Conclusions: Our results demonstrate the potential of utilizing teacher-student designs to reduce the annotation load, as less supervised annotation schemes may be performed, without any real degradation in segmentation accuracy.
Grounding Conversations with Improvised Dialogues
Effective dialogue involves grounding, the process of establishing mutual knowledge that is essential for communication between people. Modern dialogue systems are not explicitly trained to build common ground, and therefore overlook this important aspect of communication. Improvisational theater (improv) intrinsically contains a high proportion of dialogue focused on building common ground, and makes use of the yes-and principle, a strong grounding speech act, to establish coherence and an actionable objective reality. We collect a corpus of more than 26,000 yes-and turns, transcribing them from improv dialogues and extracting them from larger, but more sparsely populated movie script dialogue corpora, via a bootstrapped classifier. We fine-tune chit-chat dialogue systems with our corpus to encourage more grounded, relevant conversation and confirm these findings with human evaluations.
Towards VQA Models That Can Read
Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.
CartoMark: a benchmark dataset for map pattern recognition and 1 map content retrieval with machine intelligence
Maps are fundamental medium to visualize and represent the real word in a simple and 16 philosophical way. The emergence of the 3rd wave information has made a proportion of maps are available to be generated ubiquitously, which would significantly enrich the dimensions and perspectives to understand the characteristics of the real world. However, a majority of map dataset have never been discovered, acquired and effectively used, and the map data used in many applications might not be completely fitted for the authentic demands of these applications. This challenge is emerged due to the lack of numerous well-labelled benchmark datasets for implementing the deep learning approaches into identifying complicated map content. Thus, we develop a large-scale benchmark dataset that includes well-labelled dataset for map text annotation recognition, map scene classification, map super-resolution reconstruction, and map style transferring. Furthermore, these well-labelled datasets would facilitate the state-of-the-art machine intelligence technologies to conduct map feature detection, map pattern recognition and map content retrieval. We hope our efforts would be useful for AI-enhanced cartographical applications.
RACE: Large-scale ReAding Comprehension Dataset From Examinations
We present RACE, a new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students' ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). We hope this new dataset can serve as a valuable resource for research and evaluation in machine comprehension. The dataset is freely available at http://www.cs.cmu.edu/~glai1/data/race/ and the code is available at https://github.com/qizhex/RACE_AR_baselines.
Automatic Instruction Optimization for Open-source LLM Instruction Tuning
Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative in the training of open-source LLMs. To ensure the high quality of LLM-generated instruction datasets, several approaches have been proposed. Nevertheless, existing methods either compromise dataset integrity by filtering a large proportion of samples, or are unsuitable for industrial applications. In this paper, instead of discarding low-quality samples, we propose CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset. CoachLM is trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset from 17.7% to 78.9%. The effectiveness of CoachLM is further assessed on various real-world instruction test sets. The results show that CoachLM improves the instruction-following capabilities of the instruction-tuned LLM by an average of 29.9%, which even surpasses larger LLMs with nearly twice the number of parameters. Furthermore, CoachLM is successfully deployed in a data management system for LLMs at Huawei, resulting in an efficiency improvement of up to 20% in the cleaning of 40k real-world instruction pairs. We release the training data and code of CoachLM (https://github.com/lunyiliu/CoachLM).
Characterizing Mechanisms for Factual Recall in Language Models
Language Models (LMs) often must integrate facts they memorized in pretraining with new information that appears in a given context. These two sources can disagree, causing competition within the model, and it is unclear how an LM will resolve the conflict. On a dataset that queries for knowledge of world capitals, we investigate both distributional and mechanistic determinants of LM behavior in such situations. Specifically, we measure the proportion of the time an LM will use a counterfactual prefix (e.g., "The capital of Poland is London") to overwrite what it learned in pretraining ("Warsaw"). On Pythia and GPT2, the training frequency of both the query country ("Poland") and the in-context city ("London") highly affect the models' likelihood of using the counterfactual. We then use head attribution to identify individual attention heads that either promote the memorized answer or the in-context answer in the logits. By scaling up or down the value vector of these heads, we can control the likelihood of using the in-context answer on new data. This method can increase the rate of generating the in-context answer to 88\% of the time simply by scaling a single head at runtime. Our work contributes to a body of evidence showing that we can often localize model behaviors to specific components and provides a proof of concept for how future methods might control model behavior dynamically at runtime.
Devil in the Number: Towards Robust Multi-modality Data Filter
In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also demonstrate that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques.
DEPAC: a Corpus for Depression and Anxiety Detection from Speech
Mental distress like depression and anxiety contribute to the largest proportion of the global burden of diseases. Automated diagnosis systems of such disorders, empowered by recent innovations in Artificial Intelligence, can pave the way to reduce the sufferings of the affected individuals. Development of such systems requires information-rich and balanced corpora. In this work, we introduce a novel mental distress analysis audio dataset DEPAC, labeled based on established thresholds on depression and anxiety standard screening tools. This large dataset comprises multiple speech tasks per individual, as well as relevant demographic information. Alongside, we present a feature set consisting of hand-curated acoustic and linguistic features, which were found effective in identifying signs of mental illnesses in human speech. Finally, we justify the quality and effectiveness of our proposed audio corpus and feature set in predicting depression severity by comparing the performance of baseline machine learning models built on this dataset with baseline models trained on other well-known depression corpora.
Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
State-of-the-art Text-to-Image models like Stable Diffusion and DALLEcdot2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion's tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
A key goal for the advancement of AI is to develop technologies that serve the needs not just of one group but of all communities regardless of their geographical region. In fact, a significant proportion of knowledge is locally shared by people from certain regions but may not apply equally in other regions because of cultural differences. If a model is unaware of regional characteristics, it may lead to performance disparity across regions and result in bias against underrepresented groups. We propose GIVL, a Geographically Inclusive Vision-and-Language Pre-trained model. There are two attributes of geo-diverse visual concepts which can help to learn geo-diverse knowledge: 1) concepts under similar categories have unique knowledge and visual characteristics, 2) concepts with similar visual features may fall in completely different categories. Motivated by the attributes, we design new pre-training objectives Image Knowledge Matching (IKM) and Image Edit Checking (IEC) to pre-train GIVL. Compared with similar-size models pre-trained with similar scale of data, GIVL achieves state-of-the-art (SOTA) and more balanced performance on geo-diverse V&L tasks.
One-connection rule for structural equation models
Linear structural equation models are multivariate statistical models encoded by mixed graphs. In particular, the set of covariance matrices for distributions belonging to a linear structural equation model for a fixed mixed graph G=(V, D,B) is parameterized by a rational function with parameters for each vertex and edge in G. This rational parametrization naturally allows for the study of these models from an algebraic and combinatorial point of view. Indeed, this point of view has led to a collection of results in the literature, mainly focusing on questions related to identifiability and determining relationships between covariances (i.e., finding polynomials in the Gaussian vanishing ideal). So far, a large proportion of these results has focused on the case when D, the directed part of the mixed graph G, is acyclic. This is due to the fact that in the acyclic case, the parametrization becomes polynomial and there is a description of the entries of the covariance matrices in terms of a finite sum. We move beyond the acyclic case and give a closed form expression for the entries of the covariance matrices in terms of the one-connections in a graph obtained from D through some small operations. This closed form expression then allows us to show that if G is simple, then the parametrization map is generically finite-to-one. Finally, having a closed form expression for the covariance matrices allows for the development of an algorithm for systematically exploring possible polynomials in the Gaussian vanishing ideal.
kNN-Embed: Locally Smoothed Embedding Mixtures For Multi-interest Candidate Retrieval
Candidate generation is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. Since candidate generation is the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downstream ranking models. A common approach for candidate generation is to leverage approximate nearest neighbor (ANN) search from a single dense query embedding; however, this approach this can yield a low-diversity result set with many near duplicates. As users often have multiple interests, candidate retrieval should ideally return a diverse set of candidates reflective of the user's multiple interests. To this end, we introduce kNN-Embed, a general approach to improving diversity in dense ANN-based retrieval. kNN-Embed represents each user as a smoothed mixture over learned item clusters that represent distinct `interests' of the user. By querying each of a user's mixture component in proportion to their mixture weights, we retrieve a high-diversity set of candidates reflecting elements from each of a user's interests. We experimentally compare kNN-Embed to standard ANN candidate retrieval, and show significant improvements in overall recall and improved diversity across three datasets. Accompanying this work, we open source a large Twitter follow-graph dataset, to spur further research in graph-mining and representation learning for recommender systems.
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.
Mask Transfiner for High-Quality Instance Segmentation
Two-stage and query-based instance segmentation methods have achieved remarkable results. However, their segmented masks are still very coarse. In this paper, we present Mask Transfiner for high-quality and efficient instance segmentation. Instead of operating on regular dense tensors, our Mask Transfiner decomposes and represents the image regions as a quadtree. Our transformer-based approach only processes detected error-prone tree nodes and self-corrects their errors in parallel. While these sparse pixels only constitute a small proportion of the total number, they are critical to the final mask quality. This allows Mask Transfiner to predict highly accurate instance masks, at a low computational cost. Extensive experiments demonstrate that Mask Transfiner outperforms current instance segmentation methods on three popular benchmarks, significantly improving both two-stage and query-based frameworks by a large margin of +3.0 mask AP on COCO and BDD100K, and +6.6 boundary AP on Cityscapes. Our code and trained models will be available at http://vis.xyz/pub/transfiner.
SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data
Data mixing augmentation has proved effective in training deep models. Recent methods mix labels mainly based on the mixture proportion of image pixels. As the main discriminative information of a fine-grained image usually resides in subtle regions, methods along this line are prone to heavy label noise in fine-grained recognition. We propose in this paper a novel scheme, termed as Semantically Proportional Mixing (SnapMix), which exploits class activation map (CAM) to lessen the label noise in augmenting fine-grained data. SnapMix generates the target label for a mixed image by estimating its intrinsic semantic composition, and allows for asymmetric mixing operations and ensures semantic correspondence between synthetic images and target labels. Experiments show that our method consistently outperforms existing mixed-based approaches on various datasets and under different network depths. Furthermore, by incorporating the mid-level features, the proposed SnapMix achieves top-level performance, demonstrating its potential to serve as a solid baseline for fine-grained recognition. Our code is available at https://github.com/Shaoli-Huang/SnapMix.git.
LegendreTron: Uprising Proper Multiclass Loss Learning
Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as properness, which asserts that Bayes' rule is optimal. Recent works have sought to learn losses and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps R to [0,1] to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between R^{C-1} and the projected probability simplex Delta^{C-1} by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns proper canonical losses and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a t-test at 99% significance on all datasets with greater than 10 classes.
Masked Autoencoders Are Scalable Vision Learners
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or qualitative strategies to tune the proportions, we discover the quantitative predictability of model performance regarding the mixture proportions in function forms, which we refer to as the data mixing laws. Fitting such functions on sample mixtures unveils model performance on unseen mixtures before actual runs, thus guiding the selection of an ideal data mixture. Furthermore, we propose nested use of the scaling laws of training steps, model sizes, and our data mixing law to enable predicting the performance of large models trained on massive data under various mixtures with only small-scale training. Moreover, experimental results verify that our method effectively optimizes the training mixture of a 1B model trained for 100B tokens in RedPajama, reaching a performance comparable to the one trained for 48% more steps on the default mixture. Extending the application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting and outlooks the potential for dynamic data schedules
Visually-Aware Context Modeling for News Image Captioning
News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC.
Deriving pulsar pair-production multiplicities from pulsar wind nebulae using H.E.S.S. and LHAASO observations
Pulsar Wind Nebulae (PWNe) dominate the galactic gamma-ray sky at very high energies, and are major contributors to the leptonic cosmic ray flux. However, whether or not pulsars also accelerate ions to comparable energies is not yet experimentally confirmed. We aim to constrain the birth period and pair-production multiplicity for a set of pulsars. In doing so, we aim to constrain the proportion of ions in the pulsar magnetosphere and hence the proportion of ions that could enter the pulsar wind. We estimate possible ranges of the value of the average pair production multiplicity for a sample of 26 pulsars in the Australia Telescope National Facility (ATNF) catalogue, which have also been observed by the High Energy Stereoscopic System (H.E.S.S.) telescopes. We then derive lower limits for the pulsar birth periods and average pair production multiplicities for a subset of these sources where the extent of the pulsar wind nebula and surrounding supernova shell have been measured in the radio. We also derive curves for the average pair production multiplicities as a function of birth period for sources recently observed by the Large High Altitude Air Shower Observatory (LHAASO). We show that there is a potential for hadrons entering the pulsar wind for most of the H.E.S.S. and LHAASO sources we consider, dependent upon the efficiency of luminosity conversion into particles. We also present estimates of the pulsar birth period for six of these sources, which all fall into the range of simeq10-50 ms.
Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic
Large Language Models (LLMs) have shown impressive results in multiple domains of natural language processing (NLP) but are mainly focused on the English language. Recently, more LLMs have incorporated a larger proportion of multilingual text to represent low-resource languages. In Arabic NLP, several Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the past two years. However, most Arabic LLMs have more than 7 billion parameters, which increases their hardware requirements and inference latency, when compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable LM 1.6B chat model achieves impressive results on several benchmarks beating multiple models with up to 8x the parameters. In addition, we show the benefit of mixing in synthetic instruction tuning data by augmenting our fine-tuning data with a large synthetic dialogue dataset.
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.
Where Are Large Language Models for Code Generation on GitHub?
The increasing use of Large Language Models (LLMs) in software development has garnered significant attention from researchers assessing the quality of the code they generate. However, much of the research focuses on controlled datasets such as HumanEval, which fail to adequately represent how developers actually utilize LLMs' code generation capabilities or clarify the characteristics of LLM-generated code in real-world development scenarios. To bridge this gap, our study investigates the characteristics of LLM-generated code and its corresponding projects hosted on GitHub. Our findings reveal several key insights: (1) ChatGPT and Copilot are the most frequently utilized for generating code on GitHub. In contrast, there is very little code generated by other LLMs on GitHub. (2) Projects containing ChatGPT/Copilot-generated code are often small and less known, led by individuals or small teams. Despite this, most projects are continuously evolving and improving. (3) ChatGPT/Copilot is mainly utilized for generating Python, Java, and TypeScript scripts for data processing and transformation. C/C++ and JavaScript code generation focuses on algorithm and data structure implementation and user interface code. Most ChatGPT/Copilot-generated code snippets are relatively short and exhibit low complexity. (4) Compared to human-written code, ChatGPT/Copilot-generated code exists in a small proportion of projects and generally undergoes fewer modifications. Additionally, modifications due to bugs are even fewer, ranging from just 3% to 8% across different languages. (5) Most comments on ChatGPT/Copilot-generated code lack detailed information, often only stating the code's origin without mentioning prompts, human modifications, or testing status. Based on these findings, we discuss the implications for researchers and practitioners.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
Boldly Going Where No Benchmark Has Gone Before: Exposing Bias and Shortcomings in Code Generation Evaluation
Motivated by the increasing popularity of code generation from human descriptions using large language models (LLMs), several benchmarks have been proposed to assess the capabilities of existing and emerging models. This study presents a large-scale human evaluation of HumanEval and MBPP, two widely used benchmarks for Python code generation, focusing on their diversity and difficulty. Our findings reveal a significant bias towards a limited number of programming concepts, with negligible or no representation of most concepts. Additionally, we identify a concerningly high proportion of easy programming questions, potentially leading to an overestimation of model performance on code generation tasks.
Study of the effectiveness of incentive measures on Covid-19 vaccination in the United States of America
With COVID-19 having emerged as the most widespread human pandemic disease in a century, the need to control its spread to avoid massive loss of life became more than necessary, and extremely fast. Several vaccines were developed and the task of policy makers was suddenly to convince the reluctant population to be vaccinated by various means. While some countries have chosen a policy of mandatory vaccination or punitive incentives, many states in the United States have adopted various incentives to try to increase vaccination coverage. A study we conducted in recent months quantified the effect of these measures on the proportion of the population vaccinated, using the synthetic control method, by simulating what would have happened without these measures. The aim now is to generalize this study to smaller scales, to improve the results of our previous study, to quantify their robustness and to provide a tool that can be used by policy makers to adapt their behavior in light of the results obtained.
Delving into Masked Autoencoders for Multi-Label Thorax Disease Classification
Vision Transformer (ViT) has become one of the most popular neural architectures due to its great scalability, computational efficiency, and compelling performance in many vision tasks. However, ViT has shown inferior performance to Convolutional Neural Network (CNN) on medical tasks due to its data-hungry nature and the lack of annotated medical data. In this paper, we pre-train ViTs on 266,340 chest X-rays using Masked Autoencoders (MAE) which reconstruct missing pixels from a small part of each image. For comparison, CNNs are also pre-trained on the same 266,340 X-rays using advanced self-supervised methods (e.g., MoCo v2). The results show that our pre-trained ViT performs comparably (sometimes better) to the state-of-the-art CNN (DenseNet-121) for multi-label thorax disease classification. This performance is attributed to the strong recipes extracted from our empirical studies for pre-training and fine-tuning ViT. The pre-training recipe signifies that medical reconstruction requires a much smaller proportion of an image (10% vs. 25%) and a more moderate random resized crop range (0.5~1.0 vs. 0.2~1.0) compared with natural imaging. Furthermore, we remark that in-domain transfer learning is preferred whenever possible. The fine-tuning recipe discloses that layer-wise LR decay, RandAug magnitude, and DropPath rate are significant factors to consider. We hope that this study can direct future research on the application of Transformers to a larger variety of medical imaging tasks.
FuzzCoder: Byte-level Fuzzing Test via Large Language Model
Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.
Watermarking Makes Language Models Radioactive
This paper investigates the radioactivity of LLM-generated texts, i.e. whether it is possible to detect that such input was used as training data. Conventional methods like membership inference can carry out this detection with some level of accuracy. We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference. We link the contamination level to the watermark robustness, its proportion in the training set, and the fine-tuning process. We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence (p-value < 1e-5) even when as little as 5% of training text is watermarked. Thus, LLM watermarking, originally designed for detecting machine-generated text, gives the ability to easily identify if the outputs of a watermarked LLM were used to fine-tune another LLM.
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
In this paper, we present Cross Language Agent -- Simultaneous Interpretation, CLASI, a high-quality and human-like Simultaneous Speech Translation (SiST) System. Inspired by professional human interpreters, we utilize a novel data-driven read-write strategy to balance the translation quality and latency. To address the challenge of translating in-domain terminologies, CLASI employs a multi-modal retrieving module to obtain relevant information to augment the translation. Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information. Experimental results show that our system outperforms other systems by significant margins. Aligned with professional human interpreters, we evaluate CLASI with a better human evaluation metric, valid information proportion (VIP), which measures the amount of information that can be successfully conveyed to the listeners. In the real-world scenarios, where the speeches are often disfluent, informal, and unclear, CLASI achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese translation directions, respectively. In contrast, state-of-the-art commercial or open-source systems only achieve 35.4% and 41.6%. On the extremely hard dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70% VIP.
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks, including image interpolation, inversion, and editing. In this work, we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition, we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.
Focused Transformer: Contrastive Training for Context Scaling
Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of 3B and 7B OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a 256 k context length for passkey retrieval.
Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.
WikiNER-fr-gold: A Gold-Standard NER Corpus
We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature?
ImageNet, an influential dataset in computer vision, is traditionally evaluated using single-label classification, which assumes that an image can be adequately described by a single concept or label. However, this approach may not fully capture the complex semantics within the images available in ImageNet, potentially hindering the development of models that effectively learn these intricacies. This study critically examines the prevalent single-label benchmarking approach and advocates for a shift to multi-label benchmarking for ImageNet. This shift would enable a more comprehensive assessment of the capabilities of deep neural network (DNN) models. We analyze the effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of its variants, ImageNetV2. Studies in the literature have reported unexpected accuracy drops of 11% to 14% on ImageNetV2. Our findings show that these reported declines are largely attributable to a characteristic of the dataset that has not received sufficient attention -- the proportion of images with multiple labels. Taking this characteristic into account, the results of our experiments provide evidence that there is no substantial degradation in effectiveness on ImageNetV2. Furthermore, we acknowledge that ImageNet pre-trained models exhibit some capability at capturing the multi-label nature of the dataset even though they were trained under the single-label assumption. Consequently, we propose a new evaluation approach to augment existing approaches that assess this capability. Our findings highlight the importance of considering the multi-label nature of the ImageNet dataset during benchmarking. Failing to do so could lead to incorrect conclusions regarding the effectiveness of DNNs and divert research efforts from addressing other substantial challenges related to the reliability and robustness of these models.
PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback
Large Language Models (LLMs) are widely adopted for assisting in software development tasks, yet their performance evaluations have narrowly focused on the functional correctness of generated code. Human programmers, however, require LLM-generated code to be not only correct but also optimally efficient. We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code by incorporating feedback based on runtime during test case execution into the self-refinement iterations. With PerfCodeGen, we achieve speedups for a significantly higher proportion of problems compared to using the base LLM with sophisticated prompting techniques. Applied to open language models like Phi-3-mini, PerfCodeGen achieves runtime efficiency comparable to prompting powerful closed models like GPT-4. We achieve state-of-the-art runtime efficiency on benchmarks such as HumanEval, MBPP, and APPS, frequently surpassing the ground truth reference solutions with PerfCodeGen using GPT-3.5 and GPT-4. Additionally, we demonstrate the effectiveness of our approach in enhancing code quality across a range of open LLMs of varying sizes including Phi-3-mini, Llama 3 8B, Mixtral 8x7B, Command R, and Llama 3 70B.
Learning to Edit: Aligning LLMs with Knowledge Editing
Knowledge editing techniques, aiming to efficiently modify a minor proportion of knowledge in large language models (LLMs) without negatively impacting performance across other inputs, have garnered widespread attention. However, existing methods predominantly rely on memorizing the updated knowledge, impeding LLMs from effectively combining the new knowledge with their inherent knowledge when answering questions. To this end, we propose a Learning to Edit (LTE) framework, focusing on teaching LLMs to apply updated knowledge into input questions, inspired by the philosophy of "Teach a man to fish." LTE features a two-phase process: (i) the Alignment Phase, which fine-tunes LLMs on a meticulously curated parallel dataset to make reliable, in-scope edits while preserving out-of-scope information and linguistic proficiency; and (ii) the Inference Phase, which employs a retrieval-based mechanism for real-time and mass knowledge editing. By comparing our approach with seven advanced baselines across four popular knowledge editing benchmarks and two LLM architectures, we demonstrate LTE's superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds. The data and code are available at https://github.com/YJiangcm/LTE.
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection
3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description. Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching, which is limited by the isolated architecture. In such a paradigm, the detector needs to sample keypoints from raw point clouds due to the inherent properties of 3D point clouds (irregular and large-scale), to generate the corresponding object proposal for each keypoint. However, sparse proposals may leave out the target in detection, while dense proposals may confuse the matching model. Moreover, the language-irrelevant detection stage can only sample a small proportion of keypoints on the target, deteriorating the target prediction. In this paper, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method, which progressively selects keypoints with the guidance of language and directly locates the target. Specifically, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, which are significant clues for grounding. Besides, we devise a Target-oriented Progressive Mining (TPM) module to finely concentrate on the points of the target, which is enabled by progressive intra-modal relation modeling and inter-modal target mining. 3D-SPS bridges the gap between detection and matching in the 3D visual grounding task, localizing the target at a single stage. Experiments demonstrate that 3D-SPS achieves state-of-the-art performance on both ScanRefer and Nr3D/Sr3D datasets.
Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization
While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench.
KS-APR: Keyframe Selection for Robust Absolute Pose Regression
Markerless Mobile Augmented Reality (AR) aims to anchor digital content in the physical world without using specific 2D or 3D objects. Absolute Pose Regressors (APR) are end-to-end machine learning solutions that infer the device's pose from a single monocular image. Thanks to their low computation cost, they can be directly executed on the constrained hardware of mobile AR devices. However, APR methods tend to yield significant inaccuracies for input images that are too distant from the training set. This paper introduces KS-APR, a pipeline that assesses the reliability of an estimated pose with minimal overhead by combining the inference results of the APR and the prior images in the training set. Mobile AR systems tend to rely upon visual-inertial odometry to track the relative pose of the device during the experience. As such, KS-APR favours reliability over frequency, discarding unreliable poses. This pipeline can integrate most existing APR methods to improve accuracy by filtering unreliable images with their pose estimates. We implement the pipeline on three types of APR models on indoor and outdoor datasets. The median error on position and orientation is reduced for all models, and the proportion of large errors is minimized across datasets. Our method enables state-of-the-art APRs such as DFNetdm to outperform single-image and sequential APR methods. These results demonstrate the scalability and effectiveness of KS-APR for visual localization tasks that do not require one-shot decisions.
Geometry-Aware Adaptation for Pretrained Models
Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes -- or, in the case of zero-shot prediction, to improve its performance -- without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping argmax with the Fr\'echet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.
Causes and Cures for Interference in Multilingual Translation
Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.
Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection
Anomaly detection methods identify examples that do not follow the expected behaviour, typically in an unsupervised fashion, by assigning real-valued anomaly scores to the examples based on various heuristics. These scores need to be transformed into actual predictions by thresholding, so that the proportion of examples marked as anomalies equals the expected proportion of anomalies, called contamination factor. Unfortunately, there are no good methods for estimating the contamination factor itself. We address this need from a Bayesian perspective, introducing a method for estimating the posterior distribution of the contamination factor of a given unlabeled dataset. We leverage on outputs of several anomaly detectors as a representation that already captures the basic notion of anomalousness and estimate the contamination using a specific mixture formulation. Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the anomaly detectors' performance over several alternative methods. All code is publicly available for full reproducibility.
Do Multilingual Language Models Capture Differing Moral Norms?
Massively multilingual sentence representations are trained on large corpora of uncurated data, with a very imbalanced proportion of languages included in the training. This may cause the models to grasp cultural values including moral judgments from the high-resource languages and impose them on the low-resource languages. The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs. Both these issues can negatively influence zero-shot cross-lingual model transfer and potentially lead to harmful outcomes. Therefore, we aim to (1) detect and quantify these issues by comparing different models in different languages, (2) develop methods for improving undesirable properties of the models. Our initial experiments using the multilingual model XLM-R show that indeed multilingual LMs capture moral norms, even with potentially higher human-agreement than monolingual ones. However, it is not yet clear to what extent these moral norms differ between languages.
Molecular Graph Generation via Geometric Scattering
Graph neural networks (GNNs) have been used extensively for addressing problems in drug design and discovery. Both ligand and target molecules are represented as graphs with node and edge features encoding information about atomic elements and bonds respectively. Although existing deep learning models perform remarkably well at predicting physicochemical properties and binding affinities, the generation of new molecules with optimized properties remains challenging. Inherently, most GNNs perform poorly in whole-graph representation due to the limitations of the message-passing paradigm. Furthermore, step-by-step graph generation frameworks that use reinforcement learning or other sequential processing can be slow and result in a high proportion of invalid molecules with substantial post-processing needed in order to satisfy the principles of stoichiometry. To address these issues, we propose a representation-first approach to molecular graph generation. We guide the latent representation of an autoencoder by capturing graph structure information with the geometric scattering transform and apply penalties that structure the representation also by molecular properties. We show that this highly structured latent space can be directly used for molecular graph generation by the use of a GAN. We demonstrate that our architecture learns meaningful representations of drug datasets and provides a platform for goal-directed drug synthesis.
How to Synthesize Text Data without Model Collapse?
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.
PolyLM: An Open Source Polyglot Large Language Model
Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation.
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
The pretraining data of today's strongest language models is opaque. In particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data
Nowadays, open-source large language models like LLaMA have emerged. Recent developments have incorporated supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to align these models with human goals. However, SFT methods treat all training data with mixed quality equally, while RLFT methods require high-quality pairwise or ranking-based preference data. In this study, we present a novel framework, named OpenChat, to advance open-source language models with mixed-quality data. Specifically, we consider the general SFT training data, consisting of a small amount of expert data mixed with a large proportion of sub-optimal data, without any preference labels. We propose the C(onditioned)-RLFT, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy to leverage complementary data quality information. Interestingly, the optimal policy in C-RLFT can be easily solved through single-stage, RL-free supervised learning, which is lightweight and avoids costly human preference labeling. Through extensive experiments on three standard benchmarks, our openchat-13b fine-tuned with C-RLFT achieves the highest average performance among all 13b open-source language models. Moreover, we use AGIEval to validate the model generalization performance, in which only openchat-13b surpasses the base model. Finally, we conduct a series of analyses to shed light on the effectiveness and robustness of OpenChat. Our code, data, and models are publicly available at https://github.com/imoneoi/openchat.
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.
LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces
We introduce the Local Intersectional Visual Spaces (LIVS) dataset, a benchmark for multi-criteria alignment of text-to-image (T2I) models in inclusive urban planning. Developed through a two-year participatory process with 30 community organizations, LIVS encodes diverse spatial preferences across 634 initial concepts, consolidated into six core criteria: Accessibility, Safety, Comfort, Invitingness, Inclusivity, and Diversity, through 37,710 pairwise comparisons. Using Direct Preference Optimization (DPO) to fine-tune Stable Diffusion XL, we observed a measurable increase in alignment with community preferences, though a significant proportion of neutral ratings highlights the complexity of modeling intersectional needs. Additionally, as annotation volume increases, accuracy shifts further toward the DPO-tuned model, suggesting that larger-scale preference data enhances fine-tuning effectiveness. LIVS underscores the necessity of integrating context-specific, stakeholder-driven criteria into generative modeling and provides a resource for evaluating AI alignment methodologies across diverse socio-spatial contexts.
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a rater or preference model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.
Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face
Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, while the Considerations for Using the Data section receives the lowest proportion of content. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.
Proxy Anchor-based Unsupervised Learning for Continuous Generalized Category Discovery
Recent advances in deep learning have significantly improved the performance of various computer vision applications. However, discovering novel categories in an incremental learning scenario remains a challenging problem due to the lack of prior knowledge about the number and nature of new categories. Existing methods for novel category discovery are limited by their reliance on labeled datasets and prior knowledge about the number of novel categories and the proportion of novel samples in the batch. To address the limitations and more accurately reflect real-world scenarios, in this paper, we propose a novel unsupervised class incremental learning approach for discovering novel categories on unlabeled sets without prior knowledge. The proposed method fine-tunes the feature extractor and proxy anchors on labeled sets, then splits samples into old and novel categories and clusters on the unlabeled dataset. Furthermore, the proxy anchors-based exemplar generates representative category vectors to mitigate catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms the state-of-the-art methods on fine-grained datasets under real-world scenarios.
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture
In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO). MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts: the major part that contains the major information (central tensor) and the supplementary part that only has a small proportion of parameters (auxiliary tensors). Based on such a decomposition, our architecture shares the central tensor across all layers for reducing the model size and meanwhile keeps layer-specific auxiliary tensors (also using adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPO-based architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in reducing the model size and achieving highly competitive performance.
From heavy rain removal to detail restoration: A faster and better network
The profound accumulation of precipitation during intense rainfall events can markedly degrade the quality of images, leading to the erosion of textural details. Despite the improvements observed in existing learning-based methods specialized for heavy rain removal, it is discerned that a significant proportion of these methods tend to overlook the precise reconstruction of the intricate details. In this work, we introduce a simple dual-stage progressive enhancement network, denoted as DPENet, aiming to achieve effective deraining while preserving the structural accuracy of rain-free images. This approach comprises two key modules, a rain streaks removal network (R^2Net) focusing on accurate rain removal, and a details reconstruction network (DRNet) designed to recover the textural details of rain-free images. Firstly, we introduce a dilated dense residual block (DDRB) within R^2Net, enabling the aggregation of high-level and low-level features. Secondly, an enhanced residual pixel-wise attention block (ERPAB) is integrated into DRNet to facilitate the incorporation of contextual information. To further enhance the fidelity of our approach, we employ a comprehensive loss function that accentuates both the marginal and regional accuracy of rain-free images. Extensive experiments conducted on publicly available benchmarks demonstrates the noteworthy efficiency and effectiveness of our proposed DPENet. The source code and pre-trained models are currently available at https://github.com/chdwyb/DPENet.
WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets for hyperspectral image classification
Classification is an important aspect of hyperspectral images processing and application. At present, the researchers mostly use the classic airborne hyperspectral imagery as the benchmark dataset. However, existing datasets suffer from three bottlenecks: (1) low spatial resolution; (2) low labeled pixels proportion; (3) low degree of subclasses distinction. In this paper, a new benchmark dataset named the Wuhan UAV-borne hyperspectral image (WHU-Hi) dataset was built for hyperspectral image classification. The WHU-Hi dataset with a high spectral resolution (nm level) and a very high spatial resolution (cm level), which we refer to here as H2 imager. Besides, the WHU-Hi dataset has a higher pixel labeling ratio and finer subclasses. Some start-of-art hyperspectral image classification methods benchmarked the WHU-Hi dataset, and the experimental results show that WHU-Hi is a challenging dataset. We hope WHU-Hi dataset can become a strong benchmark to accelerate future research.
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage
Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 2,648 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.
A New Dataset and Framework for Real-World Blurred Images Super-Resolution
Recent Blind Image Super-Resolution (BSR) methods have shown proficiency in general images. However, we find that the efficacy of recent methods obviously diminishes when employed on image data with blur, while image data with intentional blur constitute a substantial proportion of general data. To further investigate and address this issue, we developed a new super-resolution dataset specifically tailored for blur images, named the Real-world Blur-kept Super-Resolution (ReBlurSR) dataset, which consists of nearly 3000 defocus and motion blur image samples with diverse blur sizes and varying blur intensities. Furthermore, we propose a new BSR framework for blur images called Perceptual-Blur-adaptive Super-Resolution (PBaSR), which comprises two main modules: the Cross Disentanglement Module (CDM) and the Cross Fusion Module (CFM). The CDM utilizes a dual-branch parallelism to isolate conflicting blur and general data during optimization. The CFM fuses the well-optimized prior from these distinct domains cost-effectively and efficiently based on model interpolation. By integrating these two modules, PBaSR achieves commendable performance on both general and blur data without any additional inference and deployment cost and is generalizable across multiple model architectures. Rich experiments show that PBaSR achieves state-of-the-art performance across various metrics without incurring extra inference costs. Within the widely adopted LPIPS metrics, PBaSR achieves an improvement range of approximately 0.02-0.10 with diverse anchor methods and blur types, across both the ReBlurSR and multiple common general BSR benchmarks. Code here: https://github.com/Imalne/PBaSR.
Eliminating Feature Ambiguity for Few-Shot Segmentation
Recent advancements in few-shot segmentation (FSS) have exploited pixel-by-pixel matching between query and support features, typically based on cross attention, which selectively activate query foreground (FG) features that correspond to the same-class support FG features. However, due to the large receptive fields in deep layers of the backbone, the extracted query and support FG features are inevitably mingled with background (BG) features, impeding the FG-FG matching in cross attention. Hence, the query FG features are fused with less support FG features, i.e., the support information is not well utilized. This paper presents a novel plug-in termed ambiguity elimination network (AENet), which can be plugged into any existing cross attention-based FSS methods. The main idea is to mine discriminative query FG regions to rectify the ambiguous FG features, increasing the proportion of FG information, so as to suppress the negative impacts of the doped BG features. In this way, the FG-FG matching is naturally enhanced. We plug AENet into three baselines CyCTR, SCCAN and HDMNet for evaluation, and their scores are improved by large margins, e.g., the 1-shot performance of SCCAN can be improved by 3.0%+ on both PASCAL-5^i and COCO-20^i. The code is available at https://github.com/Sam1224/AENet.
Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages
The development of Large Language Models (LLMs) relies on extensive text corpora, which are often unevenly distributed across languages. This imbalance results in LLMs performing significantly better on high-resource languages like English, German, and French, while their capabilities in low-resource languages remain inadequate. Currently, there is a lack of quantitative methods to evaluate the performance of LLMs in these low-resource languages. To address this gap, we propose the Language Ranker, an intrinsic metric designed to benchmark and rank languages based on LLM performance using internal representations. By comparing the LLM's internal representation of various languages against a baseline derived from English, we can assess the model's multilingual capabilities in a robust and language-agnostic manner. Our analysis reveals that high-resource languages exhibit higher similarity scores with English, demonstrating superior performance, while low-resource languages show lower similarity scores, underscoring the effectiveness of our metric in assessing language-specific capabilities. Besides, the experiments show that there is a strong correlation between the LLM's performance in different languages and the proportion of those languages in its pre-training corpus. These insights underscore the efficacy of the Language Ranker as a tool for evaluating LLM performance across different languages, particularly those with limited resources.
Synthesizing Realistic Data for Table Recognition
To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.
PINNACLE: PINN Adaptive ColLocation and Experimental points selection
Physics-Informed Neural Networks (PINNs), which incorporate PDEs as soft constraints, train with a composite loss function that contains multiple training point types: different types of collocation points chosen during training to enforce each PDE and initial/boundary conditions, and experimental points which are usually costly to obtain via experiments or simulations. Training PINNs using this loss function is challenging as it typically requires selecting large numbers of points of different types, each with different training dynamics. Unlike past works that focused on the selection of either collocation or experimental points, this work introduces PINN Adaptive ColLocation and Experimental points selection (PINNACLE), the first algorithm that jointly optimizes the selection of all training point types, while automatically adjusting the proportion of collocation point types as training progresses. PINNACLE uses information on the interaction among training point types, which had not been considered before, based on an analysis of PINN training dynamics via the Neural Tangent Kernel (NTK). We theoretically show that the criterion used by PINNACLE is related to the PINN generalization error, and empirically demonstrate that PINNACLE is able to outperform existing point selection methods for forward, inverse, and transfer learning problems.
Repoformer: Selective Retrieval for Repository-Level Code Completion
Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion. However, the invariable use of retrieval in existing methods exposes issues in both efficiency and robustness, with a large proportion of the retrieved contexts proving unhelpful or harmful to code language models (code LMs). In this paper, we propose a selective RAG framework to avoid retrieval when unnecessary. To power this framework, we design a self-supervised learning approach to enable a code LM to accurately self-evaluate whether retrieval can improve its output quality and robustly leverage the potentially noisy retrieved contexts. Using this LM as both the selective RAG policy and the generation model, our framework achieves state-of-the-art repository-level code completion performance on diverse benchmarks including RepoEval, CrossCodeEval, and CrossCodeLongEval, a new long-form code completion benchmark. Meanwhile, our analyses show that selectively retrieving brings as much as 70% inference speedup in the online serving setting without harming the performance. We further demonstrate that our framework is able to accommodate different generation models, retrievers, and programming languages. These advancements position our framework as an important step towards more accurate and efficient repository-level code completion.
Less is More: Data Value Estimation for Visual Instruction Tuning
Visual instruction tuning is the key to building multimodal large language models (MLLMs), which greatly improves the reasoning capabilities of large language models (LLMs) in vision scenario. However, existing MLLMs mostly rely on a mixture of multiple highly diverse visual instruction datasets for training (even more than a million instructions), which may introduce data redundancy. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of several instruction dataset even do not affect the performance. Based on the findings, we propose a new data selection approach TIVE, to eliminate redundancy within visual instruction data. TIVE first estimates the task-level and instance-level value of the visual instructions based on computed gradients. Then, according to the estimated values, TIVE determines the task proportion within the visual instructions, and selects representative instances to compose a smaller visual instruction subset for training. Experiments on LLaVA-1.5 show that our approach using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model across seven benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.
Revisiting fixed-point quantum search: proof of the quasi-Chebyshev lemma
The original Grover's algorithm suffers from the souffle problem, which means that the success probability of quantum search decreases dramatically if the iteration time is too small or too large from the right time. To overcome the souffle problem, the fixed-point quantum search with an optimal number of queries was proposed [Phys. Rev. Lett. 113, 210501 (2014)], which always finds a marked state with a high probability when a lower bound of the proportion of marked states is given. The fixed-point quantum search relies on a key lemma regarding the explicit formula of recursive quasi-Chebyshev polynomials, but its proof is not given explicitly. In this work, we give a detailed proof of this lemma, thus providing a sound foundation for the correctness of the fixed-point quantum search. This lemma may be of independent interest as well, since it expands the mathematical form of the recursive relation of Chebyshev polynomials of the first kind, and it also constitutes a key component in overcoming the souffle problem of quantum walk-based search algorithms, for example, robust quantum walk search on complete bipartite graphs [Phys. Rev. A 106, 052207 (2022)]. Hopefully, more applications of the lemma will be found in the future.
POBEVM: Real-time Video Matting via Progressively Optimize the Target Body and Edge
Deep convolutional neural networks (CNNs) based approaches have achieved great performance in video matting. Many of these methods can produce accurate alpha estimation for the target body but typically yield fuzzy or incorrect target edges. This is usually caused by the following reasons: 1) The current methods always treat the target body and edge indiscriminately; 2) Target body dominates the whole target with only a tiny proportion target edge. For the first problem, we propose a CNN-based module that separately optimizes the matting target body and edge (SOBE). And on this basis, we introduce a real-time, trimap-free video matting method via progressively optimizing the matting target body and edge (POBEVM) that is much lighter than previous approaches and achieves significant improvements in the predicted target edge. For the second problem, we propose an Edge-L1-Loss (ELL) function that enforces our network on the matting target edge. Experiments demonstrate our method outperforms prior trimap-free matting methods on both Distinctions-646 (D646) and VideoMatte240K(VM) dataset, especially in edge optimization.
LMD: Faster Image Reconstruction with Latent Masking Diffusion
As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in high-resolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with latent masking diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches.
A Quadratic Synchronization Rule for Distributed Deep Learning
In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for H steps without synchronizing with others, hence reducing communication frequency. While H has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper H value can lead to generalization improvement. Yet, selecting a proper H is elusive. This work proposes a theory-grounded method for determining H, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting H in proportion to 1{eta^2} as the learning rate eta decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves 1.16% or 0.84% higher top-1 validation accuracy.
BooookScore: A systematic exploration of book-length summarization in the era of LLMs
Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K USD and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than those generated by open-source models. While LLaMA 2 falls behind other models, Mixtral achieves performance on par with GPT-3.5-Turbo. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by annotators.
On the Stability of Iterative Retraining of Generative Models on their own Data
Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models must contend with the reality that their training is curated from both clean data and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets (of real and synthetic data) on their stability. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.
Order-Preserving GFlowNets
Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates with probabilities proportional to a given reward. However, GFlowNets can only be used with a predefined scalar reward, which can be either computationally expensive or not directly accessible, in the case of multi-objective optimization (MOO) tasks for example. Moreover, to prioritize identifying high-reward candidates, the conventional practice is to raise the reward to a higher exponent, the optimal choice of which may vary across different environments. To address these issues, we propose Order-Preserving GFlowNets (OP-GFNs), which sample with probabilities in proportion to a learned reward function that is consistent with a provided (partial) order on the candidates, thus eliminating the need for an explicit formulation of the reward function. We theoretically prove that the training process of OP-GFNs gradually sparsifies the learned reward landscape in single-objective maximization tasks. The sparsification concentrates on candidates of a higher hierarchy in the ordering, ensuring exploration at the beginning and exploitation towards the end of the training. We demonstrate OP-GFN's state-of-the-art performance in single-objective maximization (totally ordered) and multi-objective Pareto front approximation (partially ordered) tasks, including synthetic datasets, molecule generation, and neural architecture search.
Research on the Impact of Executive Shareholding on New Investment in Enterprises Based on Multivariable Linear Regression Model
Based on principal-agent theory and optimal contract theory, companies use the method of increasing executives' shareholding to stimulate collaborative innovation. However, from the aspect of agency costs between management and shareholders (i.e. the first type) and between major shareholders and minority shareholders (i.e. the second type), the interests of management, shareholders and creditors will be unbalanced with the change of the marginal utility of executive equity incentives.In order to establish the correlation between the proportion of shares held by executives and investments in corporate innovation, we have chosen a range of publicly listed companies within China's A-share market as the focus of our study. Employing a multi-variable linear regression model, we aim to analyze this relationship thoroughly.The following models were developed: (1) the impact model of executive shareholding on corporate innovation investment; (2) the impact model of executive shareholding on two types of agency costs; (3)The model is employed to examine the mediating influence of the two categories of agency costs. Following both correlation and regression analyses, the findings confirm a meaningful and positive correlation between executives' shareholding and the augmentation of corporate innovation investments. Additionally, the results indicate that executive shareholding contributes to the reduction of the first type of agency cost, thereby fostering corporate innovation investment. However, simultaneously, it leads to an escalation in the second type of agency cost, thus impeding corporate innovation investment.
HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations
Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our knowledge, all existing approaches rely on full hand visibility. While this is the case when, e.g., using motion controllers, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility owing to the restricted field of view of the HMD. In this paper, we propose the first unified approach, HMD-NeMo, that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts the full body motion in an online and real-time fashion. At the heart of HMD-NeMo is the spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. We perform extensive analysis of the impact of different components in HMD-NeMo and introduce a new state-of-the-art on AMASS dataset through our evaluation.
SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection
Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a feasible solution for economical autonomous driving. However, the existing BEV-based multi-view 3D detectors generally transform all image features into BEV features, without considering the problem that the large proportion of background information may submerge the object information. In this paper, we propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features and transform image features into semantic-aware BEV features. Accordingly, we propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature. In addition, we design a Multi-Scale Cross-Task (MSCT) head, which combines task-specific and cross-task information to predict depth distribution and semantic segmentation more accurately, further improving the quality of semantic-aware BEV feature. Finally, we integrate the above modules into a novel multi-view 3D object detection framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance. Code has been available at https://github.com/mengtan00/SA-BEV.git.
Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis
Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis. Abiding by the tradition in deep learning, DMs are trained and evaluated on the images with fixed sizes. However, users are demanding for various images with specific sizes and various aspect ratio. This paper focuses on adapting text-to-image diffusion models to handle such variety while maintaining visual fidelity. First we observe that, during the synthesis, lower resolution images suffer from incomplete object portrayal, while higher resolution images exhibit repetitively disordered presentation. Next, we establish a statistical relationship indicating that attention entropy changes with token quantity, suggesting that models aggregate spatial information in proportion to image resolution. The subsequent interpretation on our observations is that objects are incompletely depicted due to limited spatial information for low resolutions, while repetitively disorganized presentation arises from redundant spatial information for high resolutions. From this perspective, we propose a scaling factor to alleviate the change of attention entropy and mitigate the defective pattern observed. Extensive experimental results validate the efficacy of the proposed scaling factor, enabling models to achieve better visual effects, image quality, and text alignment. Notably, these improvements are achieved without additional training or fine-tuning techniques.
Removing Human Bottlenecks in Bird Classification Using Camera Trap Images and Deep Learning
Birds are important indicators for monitoring both biodiversity and habitat health; they also play a crucial role in ecosystem management. Decline in bird populations can result in reduced eco-system services, including seed dispersal, pollination and pest control. Accurate and long-term monitoring of birds to identify species of concern while measuring the success of conservation interventions is essential for ecologists. However, monitoring is time consuming, costly and often difficult to manage over long durations and at meaningfully large spatial scales. Technology such as camera traps, acoustic monitors and drones provide methods for non-invasive monitoring. There are two main problems with using camera traps for monitoring: a) cameras generate many images, making it difficult to process and analyse the data in a timely manner; and b) the high proportion of false positives hinders the processing and analysis for reporting. In this paper, we outline an approach for overcoming these issues by utilising deep learning for real-time classi-fication of bird species and automated removal of false positives in camera trap data. Images are classified in real-time using a Faster-RCNN architecture. Images are transmitted over 3/4G cam-eras and processed using Graphical Processing Units (GPUs) to provide conservationists with key detection metrics therefore removing the requirement for manual observations. Our models achieved an average sensitivity of 88.79%, a specificity of 98.16% and accuracy of 96.71%. This demonstrates the effectiveness of using deep learning for automatic bird monitoring.
Design, Integration, and Field Evaluation of a Robotic Blossom Thinning System for Tree Fruit Crops
The US apple industry relies heavily on semi-skilled manual labor force for essential field operations such as training, pruning, blossom and green fruit thinning, and harvesting. Blossom thinning is one of the crucial crop load management practices to achieve desired crop load, fruit quality, and return bloom. While several techniques such as chemical, and mechanical thinning are available for large-scale blossom thinning such approaches often yield unpredictable thinning results and may cause damage the canopy, spurs, and leaf tissue. Hence, growers still depend on laborious, labor intensive and expensive manual hand blossom thinning for desired thinning outcomes. This research presents a robotic solution for blossom thinning in apple orchards using a computer vision system with artificial intelligence, a six degrees of freedom robotic manipulator, and an electrically actuated miniature end-effector for robotic blossom thinning. The integrated robotic system was evaluated in a commercial apple orchard which showed promising results for targeted and selective blossom thinning. Two thinning approaches, center and boundary thinning, were investigated to evaluate the system ability to remove varying proportion of flowers from apple flower clusters. During boundary thinning the end effector was actuated around the cluster boundary while center thinning involved end-effector actuation only at the cluster centroid for a fixed duration of 2 seconds. The boundary thinning approach thinned 67.2% of flowers from the targeted clusters with a cycle time of 9.0 seconds per cluster, whereas center thinning approach thinned 59.4% of flowers with a cycle time of 7.2 seconds per cluster. When commercially adopted, the proposed system could help address problems faced by apple growers with current hand, chemical, and mechanical blossom thinning approaches.
Computational Assessment of Hyperpartisanship in News Titles
We first adopt a human-guided machine learning framework to develop a new dataset for hyperpartisan news title detection with 2,200 manually labeled and 1.8 million machine-labeled titles that were posted from 2014 to the present by nine representative media organizations across three media bias groups - Left, Central, and Right in an active learning manner. The fine-tuned transformer-based language model achieves an overall accuracy of 0.84 and an F1 score of 0.78 on an external validation set. Next, we conduct a computational analysis to quantify the extent and dynamics of partisanship in news titles. While some aspects are as expected, our study reveals new or nuanced differences between the three media groups. We find that overall the Right media tends to use proportionally more hyperpartisan titles. Roughly around the 2016 Presidential Election, the proportions of hyperpartisan titles increased in all media bias groups where the relative increase in the proportion of hyperpartisan titles of the Left media was the most. We identify three major topics including foreign issues, political systems, and societal issues that are suggestive of hyperpartisanship in news titles using logistic regression models and the Shapley values. Through an analysis of the topic distribution, we find that societal issues gradually receive more attention from all media groups. We further apply a lexicon-based language analysis tool to the titles of each topic and quantify the linguistic distance between any pairs of the three media groups. Three distinct patterns are discovered. The Left media is linguistically more different from Central and Right in terms of foreign issues. The linguistic distance between the three media groups becomes smaller over recent years. In addition, a seasonal pattern where linguistic difference is associated with elections is observed for societal issues.
A Comparative Survey of Deep Active Learning
While deep learning (DL) is data-hungry and usually relies on extensive labeled data to deliver good performance, Active Learning (AL) reduces labeling costs by selecting a small proportion of samples from unlabeled data for labeling and training. Therefore, Deep Active Learning (DAL) has risen as a feasible solution for maximizing model performance under a limited labeling cost/budget in recent years. Although abundant methods of DAL have been developed and various literature reviews conducted, the performance evaluation of DAL methods under fair comparison settings is not yet available. Our work intends to fill this gap. In this work, We construct a DAL toolkit, DeepAL+, by re-implementing 19 highly-cited DAL methods. We survey and categorize DAL-related works and construct comparative experiments across frequently used datasets and DAL algorithms. Additionally, we explore some factors (e.g., batch size, number of epochs in the training process) that influence the efficacy of DAL, which provides better references for researchers to design their DAL experiments or carry out DAL-related applications.
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management
The pre-trained model (PTM) is revolutionizing Artificial Intelligence (AI) technology. However, the hardware requirement of PTM training is prohibitively high, making it a game for a small proportion of people. Therefore, we proposed PatrickStar system to lower the hardware requirements of PTMs and make them accessible to everyone. PatrickStar uses the CPU-GPU heterogeneous memory space to store the model data. Different from existing works, we organize the model data in memory chunks and dynamically distribute them in the heterogeneous memory. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume and higher bandwidth utilization. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs on multiple nodes. % using data parallelism. The system can train tasks on bigger models and larger batch sizes, which cannot be accomplished by existing works. Experimental results show that PatrickStar extends model scales 2.27 and 2.5 times of DeepSpeed, and consistently exhibits significantly higher execution speed. PatricStar also successfully runs the 175B GPT3 training task on a 32 GPU cluster. Our code is publicly available at https://github.com/Tencent/PatrickStar.
VoxLingua107: a Dataset for Spoken Language Recognition
This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.
Did You Really Just Have a Heart Attack? Towards Robust Detection of Personal Health Mentions in Social Media
Millions of users share their experiences on social media sites, such as Twitter, which in turn generate valuable data for public health monitoring, digital epidemiology, and other analyses of population health at global scale. The first, critical, task for these applications is classifying whether a personal health event was mentioned, which we call the (PHM) problem. This task is challenging for many reasons, including typically short length of social media posts, inventive spelling and lexicons, and figurative language, including hyperbole using diseases like "heart attack" or "cancer" for emphasis, and not as a health self-report. This problem is even more challenging for rarely reported, or frequent but ambiguously expressed conditions, such as "stroke". To address this problem, we propose a general, robust method for detecting PHMs in social media, which we call WESPAD, that combines lexical, syntactic, word embedding-based, and context-based features. WESPAD is able to generalize from few examples by automatically distorting the word embedding space to most effectively detect the true health mentions. Unlike previously proposed state-of-the-art supervised and deep-learning techniques, WESPAD requires relatively little training data, which makes it possible to adapt, with minimal effort, to each new disease and condition. We evaluate WESPAD on both an established publicly available Flu detection benchmark, and on a new dataset that we have constructed with mentions of multiple health conditions. Our experiments show that WESPAD outperforms the baselines and state-of-the-art methods, especially in cases when the number and proportion of true health mentions in the training data is small.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.
Measuring the Robustness of Natural Language Processing Models to Domain Shifts
Existing research on Domain Robustness (DR) suffers from disparate setups, lack of evaluation task variety, and reliance on challenge sets. In this paper, we pose a fundamental question: What is the state of affairs of the DR challenge in the era of Large Language Models (LLMs)? To this end, we construct a DR benchmark comprising diverse NLP tasks, including sentence and token-level classification, QA, and generation, each task consists of several domains. We explore the DR challenge of fine-tuned and few-shot learning models in natural domain shift settings and devise two diagnostic metrics of Out-of-Distribution (OOD) performance degradation: The commonly used Source Drop (SD) and the overlooked Target Drop (TD). Our findings reveal important insights: First, despite their capabilities, zero-to-few shot LLMs and fine-tuning approaches still fail to meet satisfactory performance in the OOD context; Second, TD approximates better than SD the average OOD degradation; Third, in a significant proportion of domain shifts, either SD or TD is positive, but not both, and therefore disregarding one can lead to incorrect DR conclusions.
PI-RADS v2 Compliant Automated Segmentation of Prostate Zones Using co-training Motivated Multi-task Dual-Path CNN
The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
Cut your Losses with Squentropy
Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.
A step towards understanding why classification helps regression
A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss.
Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head
Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts.