[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

Link here

The paper presents the first extensive evaluation of recent LLMs for fixing real-world projects. It evaluates the effectiveness of the Automated Program Repair(ARP) in the era of LLMs.

Several conclusions were drawn:

  • As we increase the size of the model, we also increase in the number of correct and plausible patches generated.
  • Successfully utilizing the code after the buggy lines is important for fixing bugs.
  • While LLMs have the capability to perform fault localization and repair in one shot, for real world software systems, it is still more cost-effective to first use traditional fault localization techniques to pinpoint the precise bug locations and then leverage LLMs for more targeted patch generation.
  • By directly applying LLMs for APR without any specific change/finetuning, we can already achieve the highest number of correct fixes compared to existing baselines.
  • Entropy computation via LLMs can help distinguish correct patches from plausible patches.
  • Sum entropy performs slightly better compared to mean entropy.

Background:

Current APR tools, both template-based APR and learning-based APR, have been restricted by former knowledge shortage. Recent developments in building LLMs offer an alternative solution that can be applied for program repair without relying on historical bug fixes.

Automated Program Repair (APR) tools are used to generate patched code given the original code and the corresponding buggy location.

Evaluation:

RQ1: How do different types of LLMs perform for different APR settings?

RQ2: How does directly applying LLMs for APR compare against state-of-the-art APR tools?

RQ3: Can LLMs be directly used for patch ranking and correctness checking?

RQ4: Can we further improve the performance of LLMs?

  • Select different LLM models with different implementations and different parameter numbers:
    • GPT-Neo(125M, 1.3B and 2.7B parameters), GPT-J(6.7B parameters), GPT-NeoX(20B parameters)
    • Codex(12B parameters)
    • CodeT5(220M parameters)
    • INCODER(1.3B and 6.7B parameters)
    • Codex suffix version
  • Generation methods
    • Complete function generation: input the whole buggy function.
    • Correct code infilling: know the bug location, generate the correct replacement code given the prefix and suffix of the buggy function.
    • Single line generation: know the bug location, replace the single line.
  • Evaluate the relations between entropy and repair validity.
  • Compare LLM supported methods with former methods.
  • Evaluate unique examples towards the same buggy segment(different from the official bug fix).
  • Apply repair templates with LLMs to further improve the performance.

Future work:

  • Apply LLMs to other areas.
  • Apply other form metrics(for example, repair templates) with LLMs to achieve a better performance.
  • Improving LLM’s performance itself.



[Review] Automated Program Repair in the Era of Large Pre-trained Language Models

https://gax-c.github.io/blog/2023/11/19/23_paper_review_13/

Author

Gax

Posted on

2023-11-19

Updated on

2023-11-19

Licensed under

Comments