[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

Link here

The paper proposes a new approach to leveraging LLMs to generate input programs for fuzzing DL libraries. More specifically, apply LLMs(Codex & INCODER) to fuzz DL libraries(pytorch & tensorflow).

Background:

  • Previous work on fuzzing DL libraries mainly falls into two categories: API-level fuzzing and model-level fuzzing. They still have some limitations.
  • Model level fuzzers attempt to leverage complete DL models (which cover various sets of DL library APIs) as test inputs. But due to the input/output constraints of DL APIs, model-level mutation/generation is hard to perform, leading to a limited number of unique APIs covered.
  • API-level fuzzing focuses on finding bugs within a single API at a time. But API-level fuzzers cannot detect any bug that arises from interactions within a complex API sequence.

Implementation:

TitanFuzz is performed.

  1. Use a generative LLM(Codex) with a step-by-step input prompt to produce the initial seed programs for fuzzing.
  2. Adopt an evolutionary strategy to produce new test programs by using LLMs(INCODER) to automatically mutate the seed programs.
  3. Collect the generated programs, and feed them to the target DL libraries.

test oracle: results shown on CPU and GPU.

Because results from CPU and GPU may be reasonable to have little differences, so, a threshold is set to indicate the bugs.

Some points:

  • This is especially true for API sequences, as the combination of keywords in multiple API calls can lead to previously undiscovered bugs.
  • It is common in DL libraries for related APIs to share the same input, and borrowing inputs from one API can help trigger bugs in its relational APIs.

Evaluation:

  • RQ1: How does TitanFuzz compare against existing DL library fuzzers?
  • RQ2: How do the key components of TitanFuzz contribute to its effectiveness?
  • RQ3: Is TitanFuzz able to detect real-world bugs?

set up: A 64-core workstation with 256 GB RAM and running Ubuntu 20.04.5 LTS with 4 NVIDIA RTX A6000 GPUs.

For RQ1: Compare with prior work.

For RQ2: Have a ablation study experiment.

For RQ3: Analyze the detected real-world pytorch and tensorflow bugs.

Future work:

  • Apply LLMs to other areas.
  • Improve the related algorithm(e.g., evolution algorithm).
  • Find some other LLMs to achieve a better performance.



[Review] Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models

https://gax-c.github.io/blog/2023/11/15/21_paper_review_11/

Author

Gax

Posted on

2023-11-15

Updated on

2023-11-15

Licensed under

Comments