[Review] How Good Are the Specs? A Study of the Bug-Finding Effectiveness of Existing Java API Specifications
The paper is a evaluation, which assesses the current runtime verification technology, and mainly the effectiveness of the existing API specifications.
Three conclusions:
- Current RV technology has matured enough with tolerable runtime overhead.
- Existing API specification can find many bugs that developers are willing to fix.
- The false alarm rates are quite high due to the ineffective specifications.
Introduction
specification: a way to use an API as asserted by the developer or analyst, and which encodes information about the behavior of a program when an API is used.
In RV, the execution of a software system is dynamically checked against formal specifications. The program is monitored, and while there are violations, they will be captured.
Experiment
Set up:
- 199 specs(182 manually written, 17 automatically mined) and 200 open-source projects(used Maven, had at least one test, had all tests pass without monitoring, had all tests pass when monitoring with JavaMOP).
- Environment: Intel i7-3770K CPU @ 3.50GHz processor and 32GB of RAM running Ubuntu 14.04.4 LTS and Java 7 or 8.
- For manually written specs: written by Lee et al., selected from else where.
- For automatically mined specs: Paper Search -> Paper Filtering -> Email Authors => 17 papers related.
- Tests generation: Randoop.
The violations are divided into two groups: dynamic violations (DV) and static violations (SV). Cause some same violations may be triggered for several times, so for different DVs may be grouped into one single SV.
The violations are classified as:
- TrueBug: A potential bug to be confirmed by reporting to the developers or by checking if it was already fixed.
- FalseAlarm: The violation does not indicate a bug in the code but effectively a bug/imprecision in the spec.
- HardToInspect: The violation is hard to classify as a TrueBug or a FalseAlarm, because source code is missing or is particularly hard to reason about.
The false alarm rate (FAR) will then be calculated.
Results
- The FARs are high, reaching more than 80%.
- The similar FARs across all these dimensions suggests that the FARs are mostly due to inherent (in)effectiveness of the specs and less due to specific code-related factors.
- Violations in libraries are somewhat more likely to be false alarms, as one would expect that libraries are indeed better tested and have fewer bugs than the project code.
- Existing specs are rather ineffective for finding bugs, because they raise too many false alarms.
Future works
- Better technologies for specification generation. Maybe try LLMs.
- Better automatic specification mining technologies.
- Automated filtering of specs and false alarms. Or try to decease the false alarms from the code side(a plugin for preventing potential false alarm code problems).
[Review] How Good Are the Specs? A Study of the Bug-Finding Effectiveness of Existing Java API Specifications