Posted 2023-11-21Updated 2023-11-22Research3 minutes read (About 391 words)0 visits

[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

The paper tests the performance of LLM for program repair. The same topic as Automated Program Repair in the Era of Large Pre-trained Language Models. Differently, this paper focuses more on the details, whose program repair setting is much more complicated.

Some conclusions were drawn:

LLMs can generate fixes to bugs.
But for real-world settings, the performance is not enough.

Background:

Security bugs are significant.
LLMs are popular and has outstanding performance.

Implementation:

RQ1: Can off-the-shelf LLMs generate safe and functional code to fix security vulnerabilities?

RQ2: Does varying the amount of context in the comments of a prompt affect the LLM’s ability to suggest fixes?

RQ3: What are the challenges when using LLMs to fix vulnerabilities in the real world?

RQ4: How reliable are LLMs at generating repairs?

apply different LLMs: code-cushman-001, code-davinci-001, code-davinci-002, j1-large, j1-jumbo, gpt2-csrc(self-trained), polycoder.
synthetic experimentation
- synthesize buggy programs.
  - manually write the starting part of the program
  - apply LLMs to generate the whole program
  - the generated program itself may be valid, compilable, vulnerable, functional or safe.
- test the influence of different parameters(temperature and top_p).
- apply LLMs to repair the generated but vulnerable programs.
- evaluate the performance.
- not every time the more specific prompt will achieve a better performance, but the more specific one has the better performance on average.
- The OpenAI Codex models consistently outperform the other models with regards to generating successful patches.(which means Codex may be a quite good tool for program generation.)
test on repairing hardware design languages(e.g., verilog)
- LLMs were less proficient at producing Verilog code than they were at C or Python.
real-world bugs
- security patches tend to be more localized, have fewer source code modifications, and tend to affect fewer functions, compared to non-security bugs. (from A Large-Scale Empirical Study of Security Patches | Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security)
- So bugs may gather, meaning that focusing on some nearby area can still be enough valid.
- A whole real-world program is too long to digest, so some measures are taken.
- testing process is almost the same as synthesized buggy programs testing.

[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

https://gax-c.github.io/blog/2023/11/21/24_paper_review_14/

Author

Gax

Posted on

2023-11-21

Updated on

2023-11-22

Licensed under

[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models

Author

Posted on

Updated on

Licensed under

Comments