Posted 2023-12-09Updated 2023-12-11Research3 minutes read (About 457 words)

[Review] CryptoGuard: High Precision Detection of Cryptographic Vulnerabilities in Massive-sized Java Projects

The paper designs a new architecture called CryptoGuard to detect the cryptographic API misuse.

Use 16 rules to figure out the misuses and 5 refinement methods to avoid false positive, which resulting a precision of 98.61%.

Creates a benchmark named CryptoApi-Bench with 112 unit test cases. CryptoApi-Bench contains basic intraprocedural instances, inter-procedural cases, field sensitive cases, false positive tests, and correct API uses.

Introduction:

For cryptographic API misuse detection, both static and dynamic analyses have their respective pros and cons.

Static methods do not require the execution of programs. They scale up to a large number of programs, cover a wide range of security rules, and are unlikely to have false negatives.

Dynamic methods require one to trigger and detect specific misuse symptoms at runtime. They tend to produce fewer false positives than static analysis.

API misuse mainly contain the following problems:

Posted 2023-12-09Updated 2024-04-28Research3 minutes read (About 398 words)

[Review] Automatic Detection of Java Cryptographic API Misuses: Are We There Yet?

Link here

A large study of Java cryptographic API misuse.

Two main contributions are made:

evaluate the effectiveness of existing cryptographic API misuse detection tools.
conduct a study with the developers, measuring the real-world performance of detectors.

Introduction:

JCA (Java Cryptography Architecture), JSSE (Java Secure Socket Extension).

Java cryptographic API misuses are common, which may cause a extensive security problems.

13 Java types frequently mentioned in the API-misuse patterns.

Posted 2023-12-04Updated 2023-12-04Research2 minutes read (About 347 words)

[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing

Link here

The paper proposes a new approach to Python fuzzing, called PyRTFuzz.

PyRTFuzz divides the fuzzing process into two levels:

the generation-based level: generate the python applications.
the mutation-based level: apply mutation-based fuzzing to test the generated python applications.

Background:

Three existing problems for Python fuzzing:

testing the Python runtime requires testing both the interpreter core and the language’s runtime libraries.
diverse and valid(syntactically and semantically correct) Python applications are needed.
data types are not available in Python, so type-aware input generation is difficult.

Posted 2023-11-28Updated 2023-11-28Research3 minutes read (About 511 words)

[Review] DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation

Link here

The paper designs a stateful DBMS fuzzer called DynSQL. DynSQL adopts two new methods: Dynamic Query Interaction and Error Feedback.

Instead of generating all of the SQL queries before performing them, Dynamic Query Interaction allows the fuzzer to fuzz the DBMSs “step-by-step”, that is, dynamically determine the next statement after executing every prior statement.

Also, Error Feedback allows the seed generation to generate more valid SQL statements and queries.

Background:

Former DBMS testing tools: SQLsmith, SQUIRREL, SQLancer.

Existing DBMS fuzzers are still limited in generating complex and valid queries to find deep bugs in DBMSs.

SQLsmith generates only one statement in each query, SQUIRREL produces over 50% invalid queries and tends to generate simple statements.

SQLancer aims to figuring out logic bugs of DBMSs rather than general bugs.

Posted 2023-11-23Updated 2023-11-25Research4 minutes read (About 609 words)

[Review] Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Link here

The paper explores the ability of ChatGPT(not LLMs, only ChatGPT) to find failure-inducing tests, and proposes a new method called Differential Prompting to do it. It can achieve a success rate of 75% for programs of QuixBugs and 66.7% of programs of Codeforces.

This approach may only be useful in some small scale programs(less than 100 LOC).

Background:

Failure-inducing tests is some testcases that can trigger bugs of the specific program. Finding such tests is a main objective in software engineering, but challenging in practice.

Recently, applying LLMs(e.g., ChatGPT) for software engineering has become popular, but directly apply ChatGPT to this task may be challenging and has a bad performance. Cause ChatGPT is insensitive to nuances(i.e., subtle differences between two similar sequence to tokens). So, it’s challenging for ChatGPT to find identify bugs because a bug is essentially a nuance between a buggy program and its fixed version.

Posted 2023-11-22Updated 2023-11-23Research3 minutes read (About 418 words)

[Review] Prompting Is All You Need: Automated Android Bug Replay with Large Language Models

Link here

This paper demonstrates a new approach to replaying the Android bugs. More specifically, creates a new tool called AdbGPT to automatedly convert bug reports to reproduction. For the result, AdbGPT is able to reproduce 81.3% of bug reports in 253.6 seconds, outperforming the state-of-the-art baselines and ablation studies.

Background:

Bug reports often go on to contain the steps to reproduce (S2Rs) the bugs that assist developers to replicate and rectify the bugs, albeit with considerable amounts of engineering effort.

Posted 2023-11-21Updated 2023-11-22Research3 minutes read (About 391 words)

[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

Link here

The paper tests the performance of LLM for program repair. The same topic as Automated Program Repair in the Era of Large Pre-trained Language Models. Differently, this paper focuses more on the details, whose program repair setting is much more complicated.

Some conclusions were drawn:

LLMs can generate fixes to bugs.
But for real-world settings, the performance is not enough.

Background:

Security bugs are significant.
LLMs are popular and has outstanding performance.

Implementation:

RQ1: Can off-the-shelf LLMs generate safe and functional code to fix security vulnerabilities?

RQ2: Does varying the amount of context in the comments of a prompt affect the LLM’s ability to suggest fixes?

RQ3: What are the challenges when using LLMs to fix vulnerabilities in the real world?

RQ4: How reliable are LLMs at generating repairs?

Posted 2023-11-19Updated 2023-11-19Research3 minutes read (About 427 words)

[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

Link here

The paper presents the first extensive evaluation of recent LLMs for fixing real-world projects. It evaluates the effectiveness of the Automated Program Repair(ARP) in the era of LLMs.

Several conclusions were drawn:

As we increase the size of the model, we also increase in the number of correct and plausible patches generated.
Successfully utilizing the code after the buggy lines is important for fixing bugs.
While LLMs have the capability to perform fault localization and repair in one shot, for real world software systems, it is still more cost-effective to first use traditional fault localization techniques to pinpoint the precise bug locations and then leverage LLMs for more targeted patch generation.
By directly applying LLMs for APR without any specific change/finetuning, we can already achieve the highest number of correct fixes compared to existing baselines.
Entropy computation via LLMs can help distinguish correct patches from plausible patches.
Sum entropy performs slightly better compared to mean entropy.

Posted 2023-11-18Updated 2023-11-18Research3 minutes read (About 441 words)

[Review] Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

Link here

The paper discusses about prompt engineering, mainly focusing on GPT-3. It compiles some prompt engineering approaches.

Background:

The recent rise of massive self-supervised language models such as GPT-3 arises the interests of prompt engineering. For such models, 0-shot prompts may significantly outperform few-shot prompts. So, the importance of prompt engineering is again being promoted.

Some facts:

0-shot may outperform few-shot: instead of treating examples as a categorical guide, it is inferred that their semantic meaning is relevant to the task.
For GPT-3, its resemblance not to a single human author but a superposition of authors.

Posted 2023-11-15Updated 2023-11-15Research2 minutes read (About 374 words)

[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Link here

The paper proposes a new approach to leveraging LLMs to generate input programs for fuzzing DL libraries. More specifically, apply LLMs(Codex & INCODER) to fuzz DL libraries(pytorch & tensorflow).

Background:

Previous work on fuzzing DL libraries mainly falls into two categories: API-level fuzzing and model-level fuzzing. They still have some limitations.
Model level fuzzers attempt to leverage complete DL models (which cover various sets of DL library APIs) as test inputs. But due to the input/output constraints of DL APIs, model-level mutation/generation is hard to perform, leading to a limited number of unique APIs covered.
API-level fuzzing focuses on finding bugs within a single API at a time. But API-level fuzzers cannot detect any bug that arises from interactions within a complex API sequence.

[Review] CryptoGuard: High Precision Detection of Cryptographic Vulnerabilities in Massive-sized Java Projects

[Review] Automatic Detection of Java Cryptographic API Misuses: Are We There Yet?

[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing

[Review] DynSQL: Stateful Fuzzing for Database Management Systems with Complex and Valid SQL Query Generation

[Review] Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

[Review] Prompting Is All You Need: Automated Android Bug Replay with Large Language Models

[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

[Review] Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Links

Recents

Categories

Archives

Tags

Subscribe for updates