new

Get trending papers in your email inbox!

Subscribe

byAK and the research community

Mar 14

Latent Diffusion Model for Medical Image Standardization and Enhancement

Computed tomography (CT) serves as an effective tool for lung cancer screening, diagnosis, treatment, and prognosis, providing a rich source of features to quantify temporal and spatial tumor changes. Nonetheless, the diversity of CT scanners and customized acquisition protocols can introduce significant inconsistencies in texture features, even when assessing the same patient. This variability poses a fundamental challenge for subsequent research that relies on consistent image features. Existing CT image standardization models predominantly utilize GAN-based supervised or semi-supervised learning, but their performance remains limited. We present DiffusionCT, an innovative score-based DDPM model that operates in the latent space to transform disparate non-standard distributions into a standardized form. The architecture comprises a U-Net-based encoder-decoder, augmented by a DDPM model integrated at the bottleneck position. First, the encoder-decoder is trained independently, without embedding DDPM, to capture the latent representation of the input data. Second, the latent DDPM model is trained while keeping the encoder-decoder parameters fixed. Finally, the decoder uses the transformed latent representation to generate a standardized CT image, providing a more consistent basis for downstream analysis. Empirical tests on patient CT images indicate notable improvements in image standardization using DiffusionCT. Additionally, the model significantly reduces image noise in SPAD images, further validating the effectiveness of DiffusionCT for advanced imaging tasks.

(Dynamic) Prompting might be all you need to repair Compressed LLMs

Large language models (LLMs), while transformative for NLP, come with significant computational demands, underlining the need for efficient, training-free compression. Notably, the reliability of perplexity as a benchmark for compressed model efficacy is in question, as our tests using LLaMA-7B and OPT-6.7b reveal a significant performance drop in several realistic downstream tasks, underscoring the disparity between perplexity as a performance indicator and real-world performance. Investigation into the trade-off between resource-intensive post-compression re-training highlights the prospect of prompt-driven recovery as a lightweight adaption tool. However, existing studies, confined mainly to perplexity evaluations and simple tasks, fail to offer unequivocal confidence in the scalability and generalizability of prompting. We tackle this uncertainty in two key ways. First, we uncover the vulnerability of naive prompts in LLM compression as an over-reliance on a singular prompt per input. In response, we propose inference-time dynamic prompting (IDP), a mechanism that autonomously chooses from a set of curated prompts based on the context of each individual input. Second, we delve into a scientific understanding of why ``prompting might be all you need post-LLM compression". Our findings suggest that compression doesn't irretrievably erase LLM model knowledge but displace it, necessitating a new inference path. IDP effectively redirects this path, enabling the model to tap into its inherent yet displaced knowledge and thereby recover performance. Empirical tests affirm the value of IDP, demonstrating an average performance improvement of 1.24% across nine varied tasks spanning multiple knowledge domains.

An Empirical Study of Flaky Tests in Python

Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22352 open source projects from the popular PyPI package index, and analyzed their 876186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order dependency is a much more dominant problem in Python, causing 59% of the 7571 flaky tests in our dataset. Another 28% were caused by test infrastructure problems, which represent a previously undocumented cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs by the projects, which is indicative of the type of software commonly written in Python. Our data also suggests that finding flaky tests requires more runs than are often done in the literature: A 95% confidence that a passing test case is not flaky on average would require 170 reruns.

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.

B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at https://github.com/ZJU-CTAG/B4.

No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation

Unit testing is essential in detecting bugs in functionally-discrete program units. Manually writing high-quality unit tests is time-consuming and laborious. Although traditional techniques can generate tests with reasonable coverage, they exhibit low readability and cannot be directly adopted by developers. Recent work has shown the large potential of large language models (LLMs) in unit test generation, which can generate more human-like and meaningful test code. ChatGPT, the latest LLM incorporating instruction tuning and reinforcement learning, has performed well in various domains. However, It remains unclear how effective ChatGPT is in unit test generation. In this work, we perform the first empirical study to evaluate ChatGPT's capability of unit test generation. Specifically, we conduct a quantitative analysis and a user study to systematically investigate the quality of its generated tests regarding the correctness, sufficiency, readability, and usability. The tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures. Still, the passing tests generated by ChatGPT resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers' preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved. Inspired by our findings above, we propose ChatTESTER, a novel ChatGPT-based unit test generation approach, which leverages ChatGPT itself to improve the quality of its generated tests. ChatTESTER incorporates an initial test generator and an iterative test refiner. Our evaluation demonstrates the effectiveness of ChatTESTER by generating 34.3% more compilable tests and 18.7% more tests with correct assertions than the default ChatGPT.

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.

ASTER: Natural and Multi-language Unit Test Generation with LLMs

Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how large language models (LLMs) can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.

An OFDM Signal Identification Method for Wireless Communications Systems

Distinction of OFDM signals from single carrier signals is highly important for adaptive receiver algorithms and signal identification applications. OFDM signals exhibit Gaussian characteristics in time domain and fourth order cumulants of Gaussian distributed signals vanish in contrary to the cumulants of other signals. Thus fourth order cumulants can be utilized for OFDM signal identification. In this paper, first, formulations of the estimates of the fourth order cumulants for OFDM signals are provided. Then it is shown these estimates are affected significantly from the wireless channel impairments, frequency offset, phase offset and sampling mismatch. To overcome these problems, a general chi-square constant false alarm rate Gaussianity test which employs estimates of cumulants and their covariances is adapted to the specific case of wireless OFDM signals. Estimation of the covariance matrix of the fourth order cumulants are greatly simplified peculiar to the OFDM signals. A measurement setup is developed to analyze the performance of the identification method and for comparison purposes. A parametric measurement analysis is provided depending on modulation order, signal to noise ratio, number of symbols, and degree of freedom of the underlying test. The proposed method outperforms statistical tests which are based on fixed thresholds or empirical values, while a priori information requirement and complexity of the proposed method are lower than the coherent identification techniques.

Transfer Q Star: Principled Decoding for LLM Alignment

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward r, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function (Q^*), which is often unavailable in practice. Hence, prior SoTA methods either approximate this Q^* using Q^{pi_{sft}} (derived from the reference SFT model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer Q^*, which implicitly estimates the optimal value function for a target reward r through a baseline model rho_{BL} aligned with a baseline reward rho_{BL} (which can be different from the target reward r). Theoretical analyses of Transfer Q^* provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference SFT model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

The ELEVATE-AI LLMs Framework: An Evaluation Framework for Use of Large Language Models in HEOR: an ISPOR Working Group Report

Introduction. Generative Artificial Intelligence, particularly large language models (LLMs), offers transformative potential for Health Economics and Outcomes Research (HEOR). However, evaluating the quality, transparency, and rigor of LLM-assisted research lacks standardized guidance. This article introduces the ELEVATE AI LLMs framework and checklist, designed to support researchers and reviewers in assessing LLM use in HEOR. Methods. The ELEVATE AI LLMs framework was developed through a targeted review of existing guidelines and evaluation frameworks. The framework comprises ten evaluation domains, including model characteristics, accuracy, comprehensiveness, and fairness. The accompanying checklist operationalizes the framework. To validate the framework, we applied it to two published studies, demonstrating its usability across different HEOR tasks. Results. The ELEVATE AI LLMs framework provides a comprehensive structure for evaluating LLM-assisted research, while the checklist facilitates practical application. Validation of the framework and checklist on studies of systematic literature reviews and health economic modeling highlighted their ability to identify strengths and gaps in reporting. Limitations. While the ELEVATE AI LLMs framework provides robust guidance, its broader generalizability and applicability to diverse HEOR tasks require further empirical testing. Additionally, several metrics adapted from computer science need further validation in HEOR contexts. Conclusion. The ELEVATE AI LLMs framework and checklist fill a critical gap in HEOR by offering structured guidance for evaluating LLM-assisted research. By promoting transparency, accuracy, and reproducibility, they aim to standardize and improve the integration of LLMs into HEOR, ensuring their outputs meet the field's rigorous standards.

Preserving Statistical Validity in Adaptive Data Analysis

A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

Question-Answering Model for Schizophrenia Symptoms and Their Impact on Daily Life using Mental Health Forums Data

In recent years, there is strong emphasis on mining medical data using machine learning techniques. A common problem is to obtain a noiseless set of textual documents, with a relevant content for the research question, and developing a Question Answering (QA) model for a specific medical field. The purpose of this paper is to present a new methodology for building a medical dataset and obtain a QA model for analysis of symptoms and impact on daily life for a specific disease domain. The ``Mental Health'' forum was used, a forum dedicated to people suffering from schizophrenia and different mental disorders. Relevant posts of active users, who regularly participate, were extrapolated providing a new method of obtaining low-bias content and without privacy issues. Furthermore, it is shown how to pre-process the dataset to convert it into a QA dataset. The Bidirectional Encoder Representations from Transformers (BERT), DistilBERT, RoBERTa, and BioBERT models were fine-tuned and evaluated via F1-Score, Exact Match, Precision and Recall. Accurate empirical experiments demonstrated the effectiveness of the proposed method for obtaining an accurate dataset for QA model implementation. By fine-tuning the BioBERT QA model, we achieved an F1 score of 0.885, showing a considerable improvement and outperforming the state-of-the-art model for mental disorders domain.

Methods2Test: A dataset of focal methods mapped to test cases

Unit testing is an essential part of the software development process, which helps to identify issues with source code in early stages of development and prevent regressions. Machine learning has emerged as viable approach to help software developers generate automated unit tests. However, generating reliable unit test cases that are semantically correct and capable of catching software bugs or unintended behavior via machine learning requires large, metadata-rich, datasets. In this paper we present Methods2Test: A dataset of focal methods mapped to test cases: a large, supervised dataset of test cases mapped to corresponding methods under test (i.e., focal methods). This dataset contains 780,944 pairs of JUnit tests and focal methods, extracted from a total of 91,385 Java open source projects hosted on GitHub with licenses permitting re-distribution. The main challenge behind the creation of the Methods2Test was to establish a reliable mapping between a test case and the relevant focal method. To this aim, we designed a set of heuristics, based on developers' best practices in software testing, which identify the likely focal method for a given test case. To facilitate further analysis, we store a rich set of metadata for each method-test pair in JSON-formatted files. Additionally, we extract textual corpus from the dataset at different context levels, which we provide both in raw and tokenized forms, in order to enable researchers to train and evaluate machine learning models for Automated Test Generation. Methods2Test is publicly available at: https://github.com/microsoft/methods2test

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning

The correct use of model evaluation, model selection, and algorithm selection techniques is vital in academic machine learning research as well as in many industrial settings. This article reviews different techniques that can be used for each of these three subtasks and discusses the main advantages and disadvantages of each technique with references to theoretical and empirical studies. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning. Common methods such as the holdout method for model evaluation and selection are covered, which are not recommended when working with small datasets. Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Common cross-validation techniques such as leave-one-out cross-validation and k-fold cross-validation are reviewed, the bias-variance trade-off for choosing k is discussed, and practical tips for the optimal choice of k are given based on empirical evidence. Different statistical tests for algorithm comparisons are presented, and strategies for dealing with multiple comparisons such as omnibus tests and multiple-comparison corrections are discussed. Finally, alternative methods for algorithm selection, such as the combined F-test 5x2 cross-validation and nested cross-validation, are recommended for comparing machine learning algorithms when datasets are small.

Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency

Developing an educational test can be expensive and time-consuming, as each item must be written by experts and then evaluated by collecting hundreds of student responses. Moreover, many tests require multiple distinct sets of questions administered throughout the school year to closely monitor students' progress, known as parallel tests. In this study, we focus on tests of silent sentence reading efficiency, used to assess students' reading ability over time. To generate high-quality parallel tests, we propose to fine-tune large language models (LLMs) to simulate how previous students would have responded to unseen items. With these simulated responses, we can estimate each item's difficulty and ambiguity. We first use GPT-4 to generate new test items following a list of expert-developed rules and then apply a fine-tuned LLM to filter the items based on criteria from psychological measurements. We also propose an optimal-transport-inspired technique for generating parallel tests and show the generated tests closely correspond to the original test's difficulty and reliability based on crowdworker responses. Our evaluation of a generated test with 234 students from grades 2 to 8 produces test scores highly correlated (r=0.93) to those of a standard test form written by human experts and evaluated across thousands of K-12 students.

Empirical Study of PEFT techniques for Winter Wheat Segmentation

Parameter Efficient Fine Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with minimal computational needs. Despite these advances, more research has yet to delve into potential PEFT applications in real-life scenarios, particularly in the critical domains of remote sensing and crop monitoring. The diversity of climates across different regions and the need for comprehensive large-scale datasets have posed significant obstacles to accurately identify crop types across varying geographic locations and changing growing seasons. This study seeks to bridge this gap by comprehensively exploring the feasibility of cross-area and cross-year out-of-distribution generalization using the State-of-the-Art (SOTA) wheat crop monitoring model. The aim of this work is to explore PEFT approaches for crop monitoring. Specifically, we focus on adapting the SOTA TSViT model to address winter wheat field segmentation, a critical task for crop monitoring and food security. This adaptation process involves integrating different PEFT techniques, including BigFit, LoRA, Adaptformer, and prompt tuning. Using PEFT techniques, we achieved notable results comparable to those achieved using full fine-tuning methods while training only a mere 0.7% parameters of the whole TSViT architecture. The in-house labeled data-set, referred to as the Beqaa-Lebanon dataset, comprises high-quality annotated polygons for wheat and non-wheat classes with a total surface of 170 kmsq, over five consecutive years. Using Sentinel-2 images, our model achieved a 84% F1-score. We intend to publicly release the Lebanese winter wheat data set, code repository, and model weights.

Empirical and Experimental Insights into Machine Learning-Based Defect Classification in Semiconductor Wafers

This survey paper offers a comprehensive review of methodologies utilizing machine learning (ML) classification techniques for identifying wafer defects in semiconductor manufacturing. Despite the growing body of research demonstrating the effectiveness of ML in wafer defect identification, there is a noticeable absence of comprehensive reviews on this subject. This survey attempts to fill this void by amalgamating available literature and providing an in-depth analysis of the advantages, limitations, and potential applications of various ML classification algorithms in the realm of wafer defect detection. An innovative taxonomy of methodologies that we present provides a detailed classification of algorithms into more refined categories and techniques. This taxonomy follows a three-tier structure, starting from broad methodology categories and ending with specific techniques. It aids researchers in comprehending the complex relationships between different algorithms and their techniques. We employ a rigorous empirical and experimental evaluation to rank these varying techniques. For the empirical evaluation, we assess techniques based on a set of five criteria. The experimental evaluation ranks the algorithms employing the same techniques, sub-categories, and categories. Also the paper illuminates the future prospects of ML classification techniques for wafer defect identification, underscoring potential advancements and opportunities for further research in this field

Empirical analysis of Binding Precedent efficiency in the Brazilian Supreme Court via Similar Case Retrieval

Binding precedents (S\'umulas Vinculantes) constitute a juridical instrument unique to the Brazilian legal system and whose objectives include the protection of the Federal Supreme Court against repetitive demands. Studies of the effectiveness of these instruments in decreasing the Court's exposure to similar cases, however, indicate that they tend to fail in such a direction, with some of the binding precedents seemingly creating new demands. We empirically assess the legal impact of five binding precedents, 11, 14, 17, 26 and 37, at the highest court level through their effects on the legal subjects they address. This analysis is only possible through the comparison of the Court's ruling about the precedents' themes before they are created, which means that these decisions should be detected through techniques of Similar Case Retrieval. The contributions of this article are therefore twofold: on the mathematical side, we compare the uses of different methods of Natural Language Processing -- TF-IDF, LSTM, BERT, and regex -- for Similar Case Retrieval, whereas on the legal side, we contrast the inefficiency of these binding precedents with a set of hypotheses that may justify their repeated usage. We observe that the deep learning models performed significantly worse in the specific Similar Case Retrieval task and that the reasons for binding precedents to fail in responding to repetitive demand are heterogeneous and case-dependent, making it impossible to single out a specific cause.

Empirical study of Machine Learning Classifier Evaluation Metrics behavior in Massively Imbalanced and Noisy data

With growing credit card transaction volumes, the fraud percentages are also rising, including overhead costs for institutions to combat and compensate victims. The use of machine learning into the financial sector permits more effective protection against fraud and other economic crime. Suitably trained machine learning classifiers help proactive fraud detection, improving stakeholder trust and robustness against illicit transactions. However, the design of machine learning based fraud detection algorithms has been challenging and slow due the massively unbalanced nature of fraud data and the challenges of identifying the frauds accurately and completely to create a gold standard ground truth. Furthermore, there are no benchmarks or standard classifier evaluation metrics to measure and identify better performing classifiers, thus keeping researchers in the dark. In this work, we develop a theoretical foundation to model human annotation errors and extreme imbalance typical in real world fraud detection data sets. By conducting empirical experiments on a hypothetical classifier, with a synthetic data distribution approximated to a popular real world credit card fraud data set, we simulate human annotation errors and extreme imbalance to observe the behavior of popular machine learning classifier evaluation matrices. We demonstrate that a combined F1 score and g-mean, in that specific order, is the best evaluation metric for typical imbalanced fraud detection model classification.

Empirical Study of Market Impact Conditional on Order-Flow Imbalance

In this research, we have empirically investigated the key drivers affecting liquidity in equity markets. We illustrated how theoretical models, such as Kyle's model, of agents' interplay in the financial markets, are aligned with the phenomena observed in publicly available trades and quotes data. Specifically, we confirmed that for small signed order-flows, the price impact grows linearly with increase in the order-flow imbalance. We have, further, implemented a machine learning algorithm to forecast market impact given a signed order-flow. Our findings suggest that machine learning models can be used in estimation of financial variables; and predictive accuracy of such learning algorithms can surpass the performance of traditional statistical approaches. Understanding the determinants of price impact is crucial for several reasons. From a theoretical stance, modelling the impact provides a statistical measure of liquidity. Practitioners adopt impact models as a pre-trade tool to estimate expected transaction costs and optimize the execution of their strategies. This further serves as a post-trade valuation benchmark as suboptimal execution can significantly deteriorate a portfolio performance. More broadly, the price impact reflects the balance of liquidity across markets. This is of central importance to regulators as it provides an all-encompassing explanation of the correlation between market design and systemic risk, enabling regulators to design more stable and efficient markets.

An Empirical Study of NetOps Capability of Pre-Trained Large Language Models

Large language models (LLMs) can respond to human language queries and have shown powerful potential applications in network operations (NetOps). Thanks to the large amount of commonsense knowledge inherent, LLMs achieve much better inference accuracy than traditional models and emerge with strong abilities in generalization, reasoning, and code generation. These abilities may have a crucial boost to automated and intelligent NetOps. However, it remains under-explored how well LLMs perform in various NetOps tasks. In this work, we make a systematic assessment of the capabilities, strengths, and limitations of selected LLMs in the field of NetOps. The evaluation is conducted on a collection of 5,732 questions about NetOps, encompassing 26 publicly available general-domain LLMs, including ChatGPT, LLaMA, Falcon, etc. We also finetune some of these LLMs with our collected NetOps corpus and evaluate the resulting models. The evaluation method follows the widely adopted benchmarks for general-domain LLMs, combined with Chain-of-Thought Prompts and Retrieval-Augmented Generation. The results show that only GPT-4 achieves high accuracy equivalent to passing the NetOps certification exam for humans, while all the other LLMs have much lower accuracy. However, some open models like LLaMA 2 still demonstrate significant potential. Furthermore, we evaluate the impact of factors such as model parameters, prompt engineering, instruction fine-tuning etc. This work shall be treated as the initial effort to systematic evaluation of LLMs in NetOps, and a more rigorous study is required for production use. The evaluation code and dataset will be released to benefit future research.

An Empirical Study of Mamba-based Language Models

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.

Comparing Software Developers with ChatGPT: An Empirical Investigation

The advent of automation in particular Software Engineering (SE) tasks has transitioned from theory to reality. Numerous scholarly articles have documented the successful application of Artificial Intelligence to address issues in areas such as project management, modeling, testing, and development. A recent innovation is the introduction of ChatGPT, an ML-infused chatbot, touted as a resource proficient in generating programming codes and formulating software testing strategies for developers and testers respectively. Although there is speculation that AI-based computation can increase productivity and even substitute software engineers in software development, there is currently a lack of empirical evidence to verify this. Moreover, despite the primary focus on enhancing the accuracy of AI systems, non-functional requirements including energy efficiency, vulnerability, fairness (i.e., human bias), and safety frequently receive insufficient attention. This paper posits that a comprehensive comparison of software engineers and AI-based solutions, considering various evaluation criteria, is pivotal in fostering human-machine collaboration, enhancing the reliability of AI-based methods, and understanding task suitability for humans or AI. Furthermore, it facilitates the effective implementation of cooperative work structures and human-in-the-loop processes. This paper conducts an empirical investigation, contrasting the performance of software engineers and AI systems, like ChatGPT, across different evaluation metrics. The empirical study includes a case of assessing ChatGPT-generated code versus code produced by developers and uploaded in Leetcode.

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code generation task to achieve remarkable performance. One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code. To address the issue, prior studies typically adopt a retrieval-augmented framework for the task, where the similar code snippets collected by a retrieval process can be leveraged to help understand the requirements and provide guidance for the generation process. However, there is a lack of systematic study on the application of this framework for code generation, including the impact of the final generated results and the specific usage of the framework. In this paper, we choose three popular pre-trained code models, namely CodeGen, UniXcoder, and CodeT5, to assess the impact of the quality and utilization of retrieved code on the retrieval-augmented framework. Our analysis shows that the retrieval-augmented framework is beneficial for improving the performance of the existing pre-trained models. We also provide suggestions on the utilization of the retrieval-augmented code generation framework: BM25 and Sequential Integration Fusion are recommended due to their convenience and superior performance. Sketch Filling Fusion, which extracts a sketch of relevant code, could help the model improve its performance further. Additionally, we conduct experiments to investigate the influence of the retrieval-augmented framework on large language models for code generation, showing the effectiveness of the framework, and we discuss the trade-off between performance improvement and computational costs in each phase within the framework.

An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models

Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model's tokenizer, leading to inadequate representation of new languages and necessitating an expansion of the tokenizer. The initialization of the embeddings corresponding to new vocabulary items presents a further challenge. Current strategies require cross-lingual embeddings and lack a solid theoretical foundation as well as comparisons with strong baselines. In this paper, we first establish theoretically that initializing within the convex hull of existing embeddings is a good initialization, followed by a novel but simple approach, Constrained Word2Vec (CW2V), which does not require cross-lingual embeddings. Our study evaluates different initialization methods for expanding RoBERTa and LLaMA 2 across four languages and five tasks. The results show that CW2V performs equally well or even better than more advanced techniques. Additionally, simpler approaches like multivariate initialization perform on par with these advanced methods indicating that efficient large-scale multilingual continued pretraining can be achieved even with simpler initialization methods.

An Empirical Study on Developers Shared Conversations with ChatGPT in GitHub Pull Requests and Issues

ChatGPT has significantly impacted software development practices, providing substantial assistance to developers in a variety of tasks, including coding, testing, and debugging. Despite its widespread adoption, the impact of ChatGPT as an assistant in collaborative coding remains largely unexplored. In this paper, we analyze a dataset of 210 and 370 developers shared conversations with ChatGPT in GitHub pull requests (PRs) and issues. We manually examined the content of the conversations and characterized the dynamics of the sharing behavior, i.e., understanding the rationale behind the sharing, identifying the locations where the conversations were shared, and determining the roles of the developers who shared them. Our main observations are: (1) Developers seek ChatGPT assistance across 16 types of software engineering inquiries. In both conversations shared in PRs and issues, the most frequently encountered inquiry categories include code generation, conceptual questions, how-to guides, issue resolution, and code review. (2) Developers frequently engage with ChatGPT via multi-turn conversations where each prompt can fulfill various roles, such as unveiling initial or new tasks, iterative follow-up, and prompt refinement. Multi-turn conversations account for 33.2% of the conversations shared in PRs and 36.9% in issues. (3) In collaborative coding, developers leverage shared conversations with ChatGPT to facilitate their role-specific contributions, whether as authors of PRs or issues, code reviewers, or collaborators on issues. Our work serves as the first step towards understanding the dynamics between developers and ChatGPT in collaborative software development and opens up new directions for future research on the topic.

A Large-scale Empirical Study on Improving the Fairness of Deep Learning Models

Fairness has been a critical issue that affects the adoption of deep learning models in real practice. To improve model fairness, many existing methods have been proposed and evaluated to be effective in their own contexts. However, there is still no systematic evaluation among them for a comprehensive comparison under the same context, which makes it hard to understand the performance distinction among them, hindering the research progress and practical adoption of them. To fill this gap, this paper endeavours to conduct the first large-scale empirical study to comprehensively compare the performance of existing state-of-the-art fairness improving techniques. Specifically, we target the widely-used application scenario of image classification, and utilized three different datasets and five commonly-used performance metrics to assess in total 13 methods from diverse categories. Our findings reveal substantial variations in the performance of each method across different datasets and sensitive attributes, indicating over-fitting on specific datasets by many existing methods. Furthermore, different fairness evaluation metrics, due to their distinct focuses, yield significantly different assessment results. Overall, we observe that pre-processing methods and in-processing methods outperform post-processing methods, with pre-processing methods exhibiting the best performance. Our empirical study offers comprehensive recommendations for enhancing fairness in deep learning models. We approach the problem from multiple dimensions, aiming to provide a uniform evaluation platform and inspire researchers to explore more effective fairness solutions via a set of implications.

An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry

Deep Neural Networks (DNNs) are being adopted as components in software systems. Creating and specializing DNNs from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, machine learning engineers have begun to reuse large-scale pre-trained models (PTMs) and fine-tune these models for downstream tasks. Prior works have studied reuse practices for traditional software packages to guide software engineers towards better package maintenance and dependency management. We lack a similar foundation of knowledge to guide behaviors in pre-trained model ecosystems. In this work, we present the first empirical investigation of PTM reuse. We interviewed 12 practitioners from the most popular PTM ecosystem, Hugging Face, to learn the practices and challenges of PTM reuse. From this data, we model the decision-making process for PTM reuse. Based on the identified practices, we describe useful attributes for model reuse, including provenance, reproducibility, and portability. Three challenges for PTM reuse are missing attributes, discrepancies between claimed and actual performance, and model risks. We substantiate these identified challenges with systematic measurements in the Hugging Face ecosystem. Our work informs future directions on optimizing deep learning ecosystems by automated measuring useful attributes and potential attacks, and envision future research on infrastructure and standardization for model registries.

Inference Scaling scriptsizeFLaws: The Limits of LLM Resampling with Imperfect Verifiers

Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models, such as by repeatedly sampling solutions to a coding problem until it passes unit tests. The central thesis of this paper is that there is no free lunch for inference scaling: indefinite accuracy improvement through resampling can only be realized if the "verifier" (in this case, a set of unit tests) is perfect. When the verifier is imperfect, as it almost always is in domains such as reasoning or coding (for example, unit tests have imperfect coverage), there is a nonzero probability of false positives: incorrect solutions that pass the verifier. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling even with an infinite compute budget. We find that there is a very strong correlation between the model's single-sample accuracy (i.e. accuracy without unit tests) and its false positive rate on coding benchmarks HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model (Fig. 1a). When we consider that false positives have a negative utility compared to abstaining from producing a solution, it bends the inference scaling curve further downward. Empirically, we find that the optimal number of samples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.

Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms

Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. While scalability as data size increases is important, previous papers have not examined the performance impact of similarly sized datasets that contain different itemset characteristics. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm.

Diminished Diversity-of-Thought in a Standard Large Language Model

We test whether Large Language Models (LLMs) can be used to simulate human participants in social-science studies. To do this, we run replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model, colloquially known as GPT3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the "correct answer" effect. Different runs of GPT3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly "correct answer." In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt. In another, we found that most but not all "correct answers" were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported 'GPT conservatives' and 'GPT liberals' showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity-of-thought.

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical "reasoning"-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs, Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/AIW

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?

Algorithmic approaches to interpreting machine learning models have proliferated in recent years. We carry out human subject tests that are the first of their kind to isolate the effect of algorithmic explanations on a key aspect of model interpretability, simulatability, while avoiding important confounding experimental factors. A model is simulatable when a person can predict its behavior on new inputs. Through two kinds of simulation tests involving text and tabular data, we evaluate five explanations methods: (1) LIME, (2) Anchor, (3) Decision Boundary, (4) a Prototype model, and (5) a Composite approach that combines explanations from each method. Clear evidence of method effectiveness is found in very few cases: LIME improves simulatability in tabular classification, and our Prototype method is effective in counterfactual simulation tests. We also collect subjective ratings of explanations, but we do not find that ratings are predictive of how helpful explanations are. Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability across a variety of explanation methods and data domains. We show that (1) we need to be careful about the metrics we use to evaluate explanation methods, and (2) there is significant room for improvement in current methods. All our supporting code, data, and models are publicly available at: https://github.com/peterbhase/InterpretableNLP-ACL2020

Quantifying Variance in Evaluation Benchmarks

Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale (sim7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.

Flexible Model Aggregation for Quantile Regression

Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions, or to model a diverse population without being overly reductive. For instance, epidemiological forecasts, cost estimates, and revenue predictions all benefit from being able to quantify the range of possible values accurately. As such, many models have been developed for this problem over many years of research in statistics, machine learning, and related fields. Rather than proposing yet another (new) algorithm for quantile regression we adopt a meta viewpoint: we investigate methods for aggregating any number of conditional quantile models, in order to improve accuracy and robustness. We consider weighted ensembles where weights may vary over not only individual models, but also over quantile levels, and feature values. All of the models we consider in this paper can be fit using modern deep learning toolkits, and hence are widely accessible (from an implementation point of view) and scalable. To improve the accuracy of the predicted quantiles (or equivalently, prediction intervals), we develop tools for ensuring that quantiles remain monotonically ordered, and apply conformal calibration methods. These can be used without any modification of the original library of base models. We also review some basic theory surrounding quantile aggregation and related scoring rules, and contribute a few new results to this literature (for example, the fact that post sorting or post isotonic regression can only improve the weighted interval score). Finally, we provide an extensive suite of empirical comparisons across 34 data sets from two different benchmark repositories.