[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models
The paper tests the performance of LLM for program repair. The same topic as Automated Program Repair in the Era of Large Pre-trained Language Models. Differently, this paper focuses more on the details, whose program repair setting is much more complicated.
Some conclusions were drawn:
- LLMs can generate fixes to bugs.
- But for real-world settings, the performance is not enough.
Background:
- Security bugs are significant.
- LLMs are popular and has outstanding performance.
Implementation:
RQ1: Can off-the-shelf LLMs generate safe and functional code to fix security vulnerabilities?
RQ2: Does varying the amount of context in the comments of a prompt affect the LLM’s ability to suggest fixes?
RQ3: What are the challenges when using LLMs to fix vulnerabilities in the real world?
RQ4: How reliable are LLMs at generating repairs?
apply different LLMs: code-cushman-001, code-davinci-001, code-davinci-002, j1-large, j1-jumbo, gpt2-csrc(self-trained), polycoder.
synthetic experimentation
- synthesize buggy programs.
- manually write the starting part of the program
- apply LLMs to generate the whole program
- the generated program itself may be valid, compilable, vulnerable, functional or safe.
- test the influence of different parameters(temperature and top_p).
- apply LLMs to repair the generated but vulnerable programs.
- evaluate the performance.
- not every time the more specific prompt will achieve a better performance, but the more specific one has the better performance on average.
- The OpenAI Codex models consistently outperform the other models with regards to generating successful patches.(which means Codex may be a quite good tool for program generation.)
- synthesize buggy programs.
- test on repairing hardware design languages(e.g., verilog)
- LLMs were less proficient at producing Verilog code than they were at C or Python.
- real-world bugs
- security patches tend to be more localized, have fewer source code modifications, and tend to affect fewer functions, compared to non-security bugs. (from A Large-Scale Empirical Study of Security Patches | Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security)
- So bugs may gather, meaning that focusing on some nearby area can still be enough valid.
- A whole real-world program is too long to digest, so some measures are taken.
- testing process is almost the same as synthesized buggy programs testing.
[Review] Examining Zero-Shot Vulnerability Repair with Large Language Models