Posted 2023-11-19Updated 2023-11-19Research3 minutes read (About 427 words)

[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

The paper presents the first extensive evaluation of recent LLMs for fixing real-world projects. It evaluates the effectiveness of the Automated Program Repair(ARP) in the era of LLMs.

Several conclusions were drawn:

As we increase the size of the model, we also increase in the number of correct and plausible patches generated.
Successfully utilizing the code after the buggy lines is important for fixing bugs.
While LLMs have the capability to perform fault localization and repair in one shot, for real world software systems, it is still more cost-effective to first use traditional fault localization techniques to pinpoint the precise bug locations and then leverage LLMs for more targeted patch generation.
By directly applying LLMs for APR without any specific change/finetuning, we can already achieve the highest number of correct fixes compared to existing baselines.
Entropy computation via LLMs can help distinguish correct patches from plausible patches.
Sum entropy performs slightly better compared to mean entropy.

Posted 2023-11-18Updated 2023-11-18Research3 minutes read (About 441 words)

[Review] Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

Link here

The paper discusses about prompt engineering, mainly focusing on GPT-3. It compiles some prompt engineering approaches.

Background:

The recent rise of massive self-supervised language models such as GPT-3 arises the interests of prompt engineering. For such models, 0-shot prompts may significantly outperform few-shot prompts. So, the importance of prompt engineering is again being promoted.

Some facts:

0-shot may outperform few-shot: instead of treating examples as a categorical guide, it is inferred that their semantic meaning is relevant to the task.
For GPT-3, its resemblance not to a single human author but a superposition of authors.

Posted 2023-11-15Updated 2023-11-15Research2 minutes read (About 374 words)

[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Link here

The paper proposes a new approach to leveraging LLMs to generate input programs for fuzzing DL libraries. More specifically, apply LLMs(Codex & INCODER) to fuzz DL libraries(pytorch & tensorflow).

Background:

Previous work on fuzzing DL libraries mainly falls into two categories: API-level fuzzing and model-level fuzzing. They still have some limitations.
Model level fuzzers attempt to leverage complete DL models (which cover various sets of DL library APIs) as test inputs. But due to the input/output constraints of DL APIs, model-level mutation/generation is hard to perform, leading to a limited number of unique APIs covered.
API-level fuzzing focuses on finding bugs within a single API at a time. But API-level fuzzers cannot detect any bug that arises from interactions within a complex API sequence.

Posted 2023-11-14Updated 2023-11-15Researcha minute read (About 149 words)

[Review] On the Naturalness of Software

Link here

A classical paper showing software also has its own naturalness like natural languages, demonstrating the basics of programming prediction and completion.

Natural languages are repetitive and predictable, which can be processed by statistical approaches(NLP). Programming code is also very regular, and even more so than natural languages.
Demonstrate, using standard cross-entropy and perplexity measures, that the above model is indeed capturing the high-level statistical regularity that exists in software at the n-gram level (probabilistic chains of tokens).
Regularities are specific to both projects and to application domains.

Posted 2023-11-13Updated 2023-11-13Research2 minutes read (About 312 words)

[Review] Titan : Efficient Multi-target Directed Greybox Fuzzing

Link here

The paper presents a multi-target fuzzing method, which fuzzes different targets at the same time.

Titan is proposed to perform this work, enabling the fuzzers to distinguish correlations between various targets in the program. And under these correlations, optimizes the input generation efficiently and simultaneously fuzzing different targets.

repo

Introduction:

In practice, more than 1000 potential targets may need verification, which will be costly. Current direct fuzzing only aims at on target at a time, lowering the verification efficiency, and generating multiple instances for fuzzing multiple targets will also be 3.6x slower compared with sequentially applying only one instance at a time for one target.

Posted 2023-11-12Updated 2023-11-12Research2 minutes read (About 230 words)

[Review] Whole Test Suite Generation

Link here

The paper presents a Genetic Algorithm(GA) in which whole test suites are evolved with the aim of covering all coverage goals at the same time.

Whole test suite generation achieves higher coverage than single branch test case generation.
Whole test suite generation produces smaller test suites than single branch test case generation.

http://www.evosuite.org

Background:

Current work only target at one coverage goal at a time.
Engineers should manually write assertion for every test case, so the length of the test case should be as short as possible(after satisfying the coverage prerequisite).

Posted 2023-11-09Updated 2023-11-12Researcha minute read (About 185 words)

[Review] Feedback-directed Random Test Generation

Link here

The paper presents a technique to improve random test generation by incorporating feedback obtained from executing test inputs as they are created.

This paper aims to exposing the potential faults in objects(e.g., Java class), i.e., object oriented, by generating a sequence of method calls to explore bugs.

Background

Random testing is of low efficiency, and may generate useless and redundant test sequences. So RANDOOP is proposed to handle this problem.

Implementation

Randomly select some method sequences that have been checked with no error.

Posted 2023-10-31Updated 2023-11-01Research2 minutes read (About 254 words)

[Review] autofz: Automated Fuzzer Composition at Runtime

Link here

This paper proposes a new fuzzing mechanism which integrates several fuzzers to perform a unique fuzzing process. For every workload, one or several optimal mixture of fuzzers are employed for fuzzing. Unlike the early work, autofz:

Do not need presetting and human efforts.
Allocate fuzzers for every workload, rather than every program.

Background:

A large amount of fuzzers have been created, which makes it difficult to choose a proper fuzzer for a specific fuzzing.
No universal fuzzer perpetually outperforms others, so choosing a optimal fuzzer will be difficult.
The efficiency of a fuzzer may not last for the whole fuzzing process.
Fuzzing is a random process, a optimal fuzzer may not always be that case.

Posted 2023-10-27Updated 2023-11-13Notes4 minutes read (About 579 words)

Software Analysis Basics

Background and Basics

Test oracle: a mechanism for determining whether software executed correctly for a test.
Differential test: Provide the same input to similar applications, and observe output differences.
Metamorphic testing: Provide the manipulated inputs to same application, and observe if output differences are as expected.

Program Analysis Basics

Abstract syntax tree(AST): Represents the abstract syntactic structure of a language construct.
Control flow graph(CFG):
- Divide the program into basic blocks.
- Basic blocks: A sequence of straight-line code that can be entered only at the beginning and exited at the end.
- Connect basic blocks together to generate CFG.

Posted 2023-10-26Updated 2023-11-01Researcha minute read (About 190 words)

[Review] A Large-Scale Empirical Analysis of the Vulnerabilities Introduced by Third-Party Components in IoT Firmware

Link here

This paper doesn’t propose anything new, but creates a system called FirmSec that can detect the TPCs(third-part components) at version-level in firmware, and then recognizes the corresponding vulnerabilities. FirmSec takes IoT firmware images as input and output the vulnerabilities of TPCs contained in the firmware image.

Also, their work creates a database consisting of 34, 136 firmware images. FirmSecDataset

Implementation:

Preprocess the database, gathering various firmware images both public and private.
Preprocess the database, gathering various TPCs and their vulnerabilities.
Take in the firmware image, identify its characters and determines the TPCs(at version level) contained in the firmware.
Generate the vulnerability report of the firmware.