[Review] CryptoGuard: High Precision Detection of Cryptographic Vulnerabilities in Massive-sized Java Projects

[Review] CryptoGuard: High Precision Detection of Cryptographic Vulnerabilities in Massive-sized Java Projects

Link here

The paper designs a new architecture called CryptoGuard to detect the cryptographic API misuse.

Use 16 rules to figure out the misuses and 5 refinement methods to avoid false positive, which resulting a precision of 98.61%.

Creates a benchmark named CryptoApi-Bench with 112 unit test cases. CryptoApi-Bench contains basic intraprocedural instances, inter-procedural cases, field sensitive cases, false positive tests, and correct API uses.

Introduction:

For cryptographic API misuse detection, both static and dynamic analyses have their respective pros and cons.

Static methods do not require the execution of programs. They scale up to a large number of programs, cover a wide range of security rules, and are unlikely to have false negatives.

Dynamic methods require one to trigger and detect specific misuse symptoms at runtime. They tend to produce fewer false positives than static analysis.

API misuse mainly contain the following problems:

Read more
[Review] Automatic Detection of Java Cryptographic API Misuses: Are We There Yet?

[Review] Automatic Detection of Java Cryptographic API Misuses: Are We There Yet?

Link here

A large study of Java cryptographic API misuse.

Two main contributions are made:

  1. evaluate the effectiveness of existing cryptographic API misuse detection tools.
  2. conduct a study with the developers, measuring the real-world performance of detectors.

Introduction:

JCA (Java Cryptography Architecture), JSSE (Java Secure Socket Extension).

Java cryptographic API misuses are common, which may cause a extensive security problems.

13 Java types frequently mentioned in the API-misuse patterns.

Read more
[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing

[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing

Link here

The paper proposes a new approach to Python fuzzing, called PyRTFuzz.

PyRTFuzz divides the fuzzing process into two levels:

  1. the generation-based level: generate the python applications.
  2. the mutation-based level: apply mutation-based fuzzing to test the generated python applications.

Background:

Three existing problems for Python fuzzing:

  1. testing the Python runtime requires testing both the interpreter core and the language’s runtime libraries.
  2. diverse and valid(syntactically and semantically correct) Python applications are needed.
  3. data types are not available in Python, so type-aware input generation is difficult.
Read more
[Review] DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation

[Review] DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation

Link here

The paper designs a stateful DBMS fuzzer called DynSQL. DynSQL adopts two new methods: Dynamic Query Interaction and Error Feedback.

Instead of generating all of the SQL queries before performing them, Dynamic Query Interaction allows the fuzzer to fuzz the DBMSs “step-by-step”, that is, dynamically determine the next statement after executing every prior statement.

Also, Error Feedback allows the seed generation to generate more valid SQL statements and queries.

Background:

Former DBMS testing tools: SQLsmith, SQUIRREL, SQLancer.

Existing DBMS fuzzers are still limited in generating complex and valid queries to find deep bugs in DBMSs.

SQLsmith generates only one statement in each query, SQUIRREL produces over 50% invalid queries and tends to generate simple statements.

SQLancer aims to figuring out logic bugs of DBMSs rather than general bugs.

Read more
[Review] Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

[Review] Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Link here

The paper explores the ability of ChatGPT(not LLMs, only ChatGPT) to find failure-inducing tests, and proposes a new method called Differential Prompting to do it. It can achieve a success rate of 75% for programs of QuixBugs and 66.7% of programs of Codeforces.

This approach may only be useful in some small scale programs(less than 100 LOC).

Background:

Failure-inducing tests is some testcases that can trigger bugs of the specific program. Finding such tests is a main objective in software engineering, but challenging in practice.

Recently, applying LLMs(e.g., ChatGPT) for software engineering has become popular, but directly apply ChatGPT to this task may be challenging and has a bad performance. Cause ChatGPT is insensitive to nuances(i.e., subtle differences between two similar sequence to tokens). So, it’s challenging for ChatGPT to find identify bugs because a bug is essentially a nuance between a buggy program and its fixed version.

Read more
[Review] Prompting Is All You Need: Automated Android Bug Replay with Large Language Models

[Review] Prompting Is All You Need: Automated Android Bug Replay with Large Language Models

Link here

This paper demonstrates a new approach to replaying the Android bugs. More specifically, creates a new tool called AdbGPT to automatedly convert bug reports to reproduction. For the result, AdbGPT is able to reproduce 81.3% of bug reports in 253.6 seconds, outperforming the state-of-the-art baselines and ablation studies.

Background:

Bug reports often go on to contain the steps to reproduce (S2Rs) the bugs that assist developers to replicate and rectify the bugs, albeit with considerable amounts of engineering effort.

Read more
[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

Link here

The paper tests the performance of LLM for program repair. The same topic as Automated Program Repair in the Era of Large Pre-trained Language Models. Differently, this paper focuses more on the details, whose program repair setting is much more complicated.

Some conclusions were drawn:

  • LLMs can generate fixes to bugs.
  • But for real-world settings, the performance is not enough.

Background:

  • Security bugs are significant.
  • LLMs are popular and has outstanding performance.

Implementation:

RQ1: Can off-the-shelf LLMs generate safe and functional code to fix security vulnerabilities?

RQ2: Does varying the amount of context in the comments of a prompt affect the LLM’s ability to suggest fixes?

RQ3: What are the challenges when using LLMs to fix vulnerabilities in the real world?

RQ4: How reliable are LLMs at generating repairs?

Read more
[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

Link here

The paper presents the first extensive evaluation of recent LLMs for fixing real-world projects. It evaluates the effectiveness of the Automated Program Repair(ARP) in the era of LLMs.

Several conclusions were drawn:

  • As we increase the size of the model, we also increase in the number of correct and plausible patches generated.
  • Successfully utilizing the code after the buggy lines is important for fixing bugs.
  • While LLMs have the capability to perform fault localization and repair in one shot, for real world software systems, it is still more cost-effective to first use traditional fault localization techniques to pinpoint the precise bug locations and then leverage LLMs for more targeted patch generation.
  • By directly applying LLMs for APR without any specific change/finetuning, we can already achieve the highest number of correct fixes compared to existing baselines.
  • Entropy computation via LLMs can help distinguish correct patches from plausible patches.
  • Sum entropy performs slightly better compared to mean entropy.
Read more
[Review] Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

[Review] Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

Link here

The paper discusses about prompt engineering, mainly focusing on GPT-3. It compiles some prompt engineering approaches.

Background:

The recent rise of massive self-supervised language models such as GPT-3 arises the interests of prompt engineering. For such models, 0-shot prompts may significantly outperform few-shot prompts. So, the importance of prompt engineering is again being promoted.

Some facts:

  • 0-shot may outperform few-shot: instead of treating examples as a categorical guide, it is inferred that their semantic meaning is relevant to the task.
  • For GPT-3, its resemblance not to a single human author but a superposition of authors.
Read more
[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Link here

The paper proposes a new approach to leveraging LLMs to generate input programs for fuzzing DL libraries. More specifically, apply LLMs(Codex & INCODER) to fuzz DL libraries(pytorch & tensorflow).

Background:

  • Previous work on fuzzing DL libraries mainly falls into two categories: API-level fuzzing and model-level fuzzing. They still have some limitations.
  • Model level fuzzers attempt to leverage complete DL models (which cover various sets of DL library APIs) as test inputs. But due to the input/output constraints of DL APIs, model-level mutation/generation is hard to perform, leading to a limited number of unique APIs covered.
  • API-level fuzzing focuses on finding bugs within a single API at a time. But API-level fuzzers cannot detect any bug that arises from interactions within a complex API sequence.
Read more