[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing
The paper proposes a new approach to Python fuzzing, called PyRTFuzz.
PyRTFuzz divides the fuzzing process into two levels:
- the generation-based level: generate the python applications.
- the mutation-based level: apply mutation-based fuzzing to test the generated python applications.
Background:
Three existing problems for Python fuzzing:
- testing the Python runtime requires testing both the interpreter core and the language’s runtime libraries.
- diverse and valid(syntactically and semantically correct) Python applications are needed.
- data types are not available in Python, so type-aware input generation is difficult.
Implementation:
Runtime API Description Extraction
to extract the API description from Python’s official documentation.
Static Extraction: use the standard AST parser of Python to extract API descriptions.
Dynamic Refinement: given the untyped API description of a runtime API, run the unit tests to refine the untyped description to produce the typed API description.
Level-1 Fuzzing
- generation-based fuzzing
- for a single API, generate a Python application for testing.
- perform application generation, generate more diverse applications towards this API.
Level-2 Fuzzing
- given a application generated in level-1, perform mutation-based fuzzing for testing.
- mutate the input data according to its data type.
Evaluation:
PyRTFuzz only generates Python APPs each using a single API, without considering the potential dependencies among APIs.
RQ1: How effective is PyRTFuzz on fuzzing Python runtime?
RQ2: How scalable is Python APP generation in PyRTFuzz?
RQ3: What are the factors affecting PyRTFuzz’s effectiveness?
Benchmarks: Python 3.9.15, Python 3.8.15, and Python 3.7.15.
For RQ1:
- demonstrate the coverage.
- show the bug triggering ability.
For RQ2:
- show the impact of APP specification sizes towards time costs.
- increasing the APP specification size can generally help generate more complex Python APPs.
For RQ3:
evaluate the influences towards effectiveness of the following three dimensions.
APP Specification Size.
- Level-2 Time Budget.
- Typed versus Untyped API Descriptions.
Perform two case studies to introduce the bugs triggered.
Future work:
- apply combined fuzzing(both generation-based and mutation-based fuzzing) to other areas.
- introduce a new method which can test multiple APIs together.
[Review] PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing