[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models
The paper proposes a new approach to leveraging LLMs to generate input programs for fuzzing DL libraries. More specifically, apply LLMs(Codex & INCODER) to fuzz DL libraries(pytorch & tensorflow).
Background:
- Previous work on fuzzing DL libraries mainly falls into two categories: API-level fuzzing and model-level fuzzing. They still have some limitations.
- Model level fuzzers attempt to leverage complete DL models (which cover various sets of DL library APIs) as test inputs. But due to the input/output constraints of DL APIs, model-level mutation/generation is hard to perform, leading to a limited number of unique APIs covered.
- API-level fuzzing focuses on finding bugs within a single API at a time. But API-level fuzzers cannot detect any bug that arises from interactions within a complex API sequence.
Implementation:
TitanFuzz is performed.
- Use a generative LLM(Codex) with a step-by-step input prompt to produce the initial seed programs for fuzzing.
- Adopt an evolutionary strategy to produce new test programs by using LLMs(INCODER) to automatically mutate the seed programs.
- Collect the generated programs, and feed them to the target DL libraries.
test oracle: results shown on CPU and GPU.
Because results from CPU and GPU may be reasonable to have little differences, so, a threshold is set to indicate the bugs.
Some points:
- This is especially true for API sequences, as the combination of keywords in multiple API calls can lead to previously undiscovered bugs.
- It is common in DL libraries for related APIs to share the same input, and borrowing inputs from one API can help trigger bugs in its relational APIs.
Evaluation:
- RQ1: How does TitanFuzz compare against existing DL library fuzzers?
- RQ2: How do the key components of TitanFuzz contribute to its effectiveness?
- RQ3: Is TitanFuzz able to detect real-world bugs?
set up: A 64-core workstation with 256 GB RAM and running Ubuntu 20.04.5 LTS with 4 NVIDIA RTX A6000 GPUs.
For RQ1: Compare with prior work.
For RQ2: Have a ablation study experiment.
For RQ3: Analyze the detected real-world pytorch and tensorflow bugs.
Future work:
- Apply LLMs to other areas.
- Improve the related algorithm(e.g., evolution algorithm).
- Find some other LLMs to achieve a better performance.
[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models