[Review] Assisting Static Analysis with Large Language Models: A ChatGPT Experiment

[Review] Assisting Static Analysis with Large Language Models: A ChatGPT Experiment

Link

The paper demonstrates the effectiveness of LLM in static analysis.

The most important thing of this paper is the task division and the workflow design. First we need to figure out what the LLM is good at, and assign such tasks to it. What’s more, we need to care about the design of the workflow, which could significantly affect the final result.

Background

Traditional static analysis tools have some shortages. Embedding LLM into the toolchain can help the analysis.

In this paper, Use Before Initialization (UBI) bugs are chosen as the example.

UBITect, which is a tool for UBI bugs, has some shortcomings in detecting, and may discord some cases. LLM can help determine whether these bugs are true bugs.

Read more
[Review] One Simple API Can Cause Hundreds of Bugs: An Analysis of Refcounting Bugs in All Modern Linux Kernels

[Review] One Simple API Can Cause Hundreds of Bugs: An Analysis of Refcounting Bugs in All Modern Linux Kernels

Link

The paper mainly focuses on the reference counting(refcounting) bugs in Linux Kernel.

  1. Analyzes the history of 1,033 refcounting bugs in 753 versions of Linux Kernels from 2005 to 2022, and concludes 9 critical rules to check refcounting bugs.
  2. Designs a new tool applying these 9 rules, and detects 351 new bugs, of which 240 are confirmed.

Introduction

Reference counting bugs: the reference count is used to record the reference number of an object(similar to smart pointers in C++).

Potential risks: Memory leakage, UAF.

Read more
[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing

[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing

Link here

The paper proposes a new approach to Python fuzzing, called PyRTFuzz.

PyRTFuzz divides the fuzzing process into two levels:

  1. the generation-based level: generate the python applications.
  2. the mutation-based level: apply mutation-based fuzzing to test the generated python applications.

Background:

Three existing problems for Python fuzzing:

  1. testing the Python runtime requires testing both the interpreter core and the language’s runtime libraries.
  2. diverse and valid(syntactically and semantically correct) Python applications are needed.
  3. data types are not available in Python, so type-aware input generation is difficult.
Read more
[Review] DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation

[Review] DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation

Link here

The paper designs a stateful DBMS fuzzer called DynSQL. DynSQL adopts two new methods: Dynamic Query Interaction and Error Feedback.

Instead of generating all of the SQL queries before performing them, Dynamic Query Interaction allows the fuzzer to fuzz the DBMSs “step-by-step”, that is, dynamically determine the next statement after executing every prior statement.

Also, Error Feedback allows the seed generation to generate more valid SQL statements and queries.

Background:

Former DBMS testing tools: SQLsmith, SQUIRREL, SQLancer.

Existing DBMS fuzzers are still limited in generating complex and valid queries to find deep bugs in DBMSs.

SQLsmith generates only one statement in each query, SQUIRREL produces over 50% invalid queries and tends to generate simple statements.

SQLancer aims to figuring out logic bugs of DBMSs rather than general bugs.

Read more
[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

Link here

The paper tests the performance of LLM for program repair. The same topic as Automated Program Repair in the Era of Large Pre-trained Language Models. Differently, this paper focuses more on the details, whose program repair setting is much more complicated.

Some conclusions were drawn:

  • LLMs can generate fixes to bugs.
  • But for real-world settings, the performance is not enough.

Background:

  • Security bugs are significant.
  • LLMs are popular and has outstanding performance.

Implementation:

RQ1: Can off-the-shelf LLMs generate safe and functional code to fix security vulnerabilities?

RQ2: Does varying the amount of context in the comments of a prompt affect the LLM’s ability to suggest fixes?

RQ3: What are the challenges when using LLMs to fix vulnerabilities in the real world?

RQ4: How reliable are LLMs at generating repairs?

Read more
[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

Link here

The paper presents the first extensive evaluation of recent LLMs for fixing real-world projects. It evaluates the effectiveness of the Automated Program Repair(ARP) in the era of LLMs.

Several conclusions were drawn:

  • As we increase the size of the model, we also increase in the number of correct and plausible patches generated.
  • Successfully utilizing the code after the buggy lines is important for fixing bugs.
  • While LLMs have the capability to perform fault localization and repair in one shot, for real world software systems, it is still more cost-effective to first use traditional fault localization techniques to pinpoint the precise bug locations and then leverage LLMs for more targeted patch generation.
  • By directly applying LLMs for APR without any specific change/finetuning, we can already achieve the highest number of correct fixes compared to existing baselines.
  • Entropy computation via LLMs can help distinguish correct patches from plausible patches.
  • Sum entropy performs slightly better compared to mean entropy.
Read more
[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Link here

The paper proposes a new approach to leveraging LLMs to generate input programs for fuzzing DL libraries. More specifically, apply LLMs(Codex & INCODER) to fuzz DL libraries(pytorch & tensorflow).

Background:

  • Previous work on fuzzing DL libraries mainly falls into two categories: API-level fuzzing and model-level fuzzing. They still have some limitations.
  • Model level fuzzers attempt to leverage complete DL models (which cover various sets of DL library APIs) as test inputs. But due to the input/output constraints of DL APIs, model-level mutation/generation is hard to perform, leading to a limited number of unique APIs covered.
  • API-level fuzzing focuses on finding bugs within a single API at a time. But API-level fuzzers cannot detect any bug that arises from interactions within a complex API sequence.
Read more
[Review] autofz: Automated Fuzzer Composition at Runtime

[Review] autofz: Automated Fuzzer Composition at Runtime

Link here

This paper proposes a new fuzzing mechanism which integrates several fuzzers to perform a unique fuzzing process. For every workload, one or several optimal mixture of fuzzers are employed for fuzzing. Unlike the early work, autofz:

  1. Do not need presetting and human efforts.
  2. Allocate fuzzers for every workload, rather than every program.

Background:

  • A large amount of fuzzers have been created, which makes it difficult to choose a proper fuzzer for a specific fuzzing.
  • No universal fuzzer perpetually outperforms others, so choosing a optimal fuzzer will be difficult.
  • The efficiency of a fuzzer may not last for the whole fuzzing process.
  • Fuzzing is a random process, a optimal fuzzer may not always be that case.
Read more
[Review] How IoT Re-using Threatens Your Sensitive Data: Exploring the User-Data Disposal in Used IoT Devices

[Review] How IoT Re-using Threatens Your Sensitive Data: Exploring the User-Data Disposal in Used IoT Devices

Link here

This paper performs the first in-depth investigation on the user-data disposal of used IoT devices, and finds that:

  1. Most users lack the awareness of disposing used IoT devices.
  2. IoT devices collect more sensitive data than users expect, and current data protections of used IoT devices are inadequate.
  3. The disposal methods of used IoT devices are often ineffective.

Implementation:

RQ1: Which kinds of sensitive data reside in used IoT devices?

RQ2: Which methods can be used to dispose of sensitive data?

RQ3: Are existing disposal methods effective in erasing the sensitive data?

Read more
[Review] MINER: A Hybrid Data-Driven Approach for REST API Fuzzing

[Review] MINER: A Hybrid Data-Driven Approach for REST API Fuzzing

Link here

This paper proposed a new approach for API fuzzing, which:

  1. Focuses more on the long sequence query.
  2. Induces a customized attention model to support fuzzing process.
  3. Implements a new data-driven security rule checker to capture the new kind of errors caused by undefined parameters.

[1]: REST standard, usually including GET, POST, PUT, DELETE.

Motivation:

Cloud service testing is important, but early works(like RESTler) fail to generate long request sequence for testing, which is not enough to detect deep errors hidden in hard-to-reach states of cloud services. MINER applies length oriented mechanisms to generate long request sequence, and applies a attention model to help pass the semantic checking. Further more, it applies a data-driven security rule checker to capture the new kind of errors caused by undefined parameters.

Read more