Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Reverse Engineering Requirements from C++ Code Using Language Models

ELEC 898 Report

1    Summary

Reverse engineering requirements (in the format of user stories) from code is a critical aspect of ensuring that software development aligns with intended requirements, especially within Agile methodologies.  Traditionally, this task relies heavily on manual input from stakeholders and developers, which can be time-consuming and error-prone. Recent advances in large language models (LLMs) like LLaMA3-70B offer an opportunity to automate the genera- tion of user stories directly from code. In this work, we investigate the effectiveness of various prompting methods, including zero-shot, one-shot, and few-shot approaches, to elicit user stories from C++ code of varying lengths. Addi- tionally, we explore the effect of applying Chain-of-Thought technique in the prompts. We evaluate the performance using BERTScore metrics (precision, recall, and F1) to measure the alignment between generated and ground truth user stories. We aim to understand the effect of code length and prompting techniques on the quality of user stories generated from LLMs.

2    Introduction

2.1    Background

The most critical phase in Software Engineering (SE) is requirements gathering.  This lays the groundwork for identifying a project’s goals and functions [1].  Accurately capturing stakeholders’ needs is vital because it directly influences the subsequent design, development, and testing phases of SE projects. A clear and coherent set of require- ments aligns the objectives of a project with its business goals and at the very least reduces the chances of anyone involved misunderstanding what’s supposed to happen. Conversely, poorly defined requirements can cause the project to lose direction, leading to confusion, misaligned efforts, and ultimately project failure.

Agile software development methodologies– one of the most widely used methodologies in industry today– such as scrum are designed to foster flexibility, collaboration, and the incremental delivery of products [2].  Developers play different roles in a Scrum team such as Product Owner, who carries the responsibility for setting and maintaining project priorities; Scrum Master, who facilitates the team and keeps the process running in a manner that allows the team to function effectively; and the Development Team, which builds the product incrementally [3].  Scrum helps teams stay aligned, adapt to changes, and continuously improve while delivering functional software frequently.  It breaks down software development projects into small increments called sprints, which usually lasts from two to four weeks, where teams work to complete specific tasks.  At the end of each sprint, the team hold a sprint review led by the product owner to demonstrate the sprint increment to the stakeholders.  Stakeholders then provide feedback, which is used to adjust the product backlog for subsequent sprints. The team converses about successes, problems, and any incomplete tasks, ensuring alignment between the progress and stakeholder expectations, allowing the product to evolve based on real-time feedback.

In the scrum methodology, both product and sprint backlog often use user stories.  The origin of user stories can be traced to stakeholders, but they usually morph into something that better represents the end user [4][5]. Stakehold- ers—representing a broader group that includes product owners, business analysts, customers and in some instances, even developers—contribute to the creation of user stories by outlining the business proposition. In sum, user stories are brief, straightforward descriptions of a feature or functionality, articulated from the perspective of an end user. A common format for them is like, ”As a user, I want [some goal], so that [reason or benefit].”

The challenging task of automatically generating user stories can be seen as an attempt to cut down the labor- intensive requirement of capturing functional requirements.  The usual way of writing user stories depends a lot on getting input from stakeholders and users, making it a drawn-out process and prone to missing crucial details. Several studies have addressed the automation of user story generation. For instance, Rahman and Zhu use the large language model GPT-4 to automatically condense requirements documentation into unit user stories, thereby seeking to optimize

the process for Agile project management [6].  Other researchers, such as Pea Veitı(´)a et al., have employed natural

language processing (NLP) techniques to automate the identification of user stories from software issue records [7]. This line of work focuses on seeking to lessen the amount of human involvement needed to form user stories from requirements by the help of language models.

Requirements and user stories differ fundamentally in three areas: format, scope, and purpose. Requirements are formal, detailed, and system-wide descriptions of a system’s functionality and constraints. They are written mostly for developers and stakeholders, appearing as part of contractual and technical document. They describe both functional and non-functional aspects of the system, and focus more on technical and system-wide settings.  By comparison, user stories are informal and user-centric.  They are usually written in simple terms, serving agile development to steer iterative implementation and prioritize features based on user value. User stories can be considered as a type of requirement expressed in a more lightweight, user-centric way. They focus on what the user needs or wants from the system and the value or benefit of that functionality, but they are a specific, simplified subset of requirements rather than a full specification.

In our work, we focus on reverse engineering functional requirements by eliciting user stories directly from the code itself. Given the flexible nature of Scrum and Agile methodologies, where requirements often change, we want to ensure that we can trace those changes and understand how the code reflects the original goals.  By generating user stories, we’re creating a user-centric view that helps us evaluate how closely the current implementation aligns with what was initially intended. To illustrate, during retrospective, the team generates user stories from the code and compares those stories to the original ones in the product backlog to check for alignment with project goals. Retrieving user stories from code also allows teams to rapidly comprehend the current functionality of a system in the absence of any detailed documentation, which can be vital for effective project management. For example, when onboarding new developers, they can understand the project by diving right into the code. It also aids understanding of programs and computer science education by associating real software implementations with their requirements, thereby making them easier to explain.

We evaluate the Large Language Models (LLMs) ability to generate user stories from code with varying complexity and using different prompting methods.  By exploring the performance of LLM, we aim to understand how different prompting strategies and code example lengths impact the quality of the generated user stories.

Our work complements prior work on automating user story generation at the requirement elicitation phase by focusing on the other side of the coin; reverse engineering requirements (in the form of user stories) from code.  We concentrate on reverse engineering user stories from the code. This gives us a quality assurance mechanism to evaluate how the requirements have shifted from the start of the project to the end. And it lets us see if the initial requirements were accurate and complete or if they changed significantly during development.  We can evaluate the requirement elicitation process and probe any missed or changed key functionalities by contrasting the generated user stories from the code with the originals. This approach is beneficial for OSS projects and legacy systems with limited or outdated documentation.  It enables us to monitor what is actually implemented and to verify that the code aligns with the expected functionality.

2.2    Objectives and Scope

In this project, we aim at automating the reverse engineering of requirements (in the form of user stories) from code. To this end, we answer two main research questions. (RQ1) to what extent can large language models automate the generation of requirement from code with varying complexity? and (RQ2) how accurate are different prompting techniques in generating user stories from code with certain complexity?

This work contributes to the current body of knowledge by exploring the feasibility of automating the reverse engineering of requirements from code using large language models (LLMs), focusing on the broader question of how well this process can capture functional requirements.   It has the potential to impact areas such as program comprehension, Agile development, and computer science education.

We choose C++ as the testing language. Because C++ is the third most popular programming language according to the TIOBE Index.  It also has a large, active community on platforms like Stack Overflow and GitHub, where developers contribute to numerous open-source projects and collaborate on development issues [8][9].

We use LLMs to reverse engineering user stories from C++ code, because LLMs have impressive abilities of understanding and generating natural language text.  This makes them well-suited for the operation of transforming technical programming code into human-readable requirements. Research shows that LLMs not only generate human- like text but also simulate understanding through pattern recognition, making them capable of converting technical logic into natural language descriptions and vice versa [10][11].

2.3    Literature Review

In 2002, Rees proposed a software tool called DotStories as an alternative to the traditional pen-and-card approach to writing user stories [12]. Today’s popular agile project management tools, such as Jira and Trello, are built around the skeuomorph index card metaphor introduced by DotStories.  As of 2015, there has been a resurgence of interest in user stories in academia and has sparked a number of research activities. For example, Trkman et al. proposed an approach to link user stories to business process modeling activities [13], and they found that the execution sequence and integration dependencies of user stories can be better understood when business process models are available. In their work, Mathias Landha… ußer et al [14] present a tool that employs natural language processing elements to build user stories and links to API components to enable functional testing.  It was found that these links were effective in recommending reusable test steps in new user stories.

A comprehensive systematic literature review conducted by Carlos Alberto dos Santos et al.  [15] has made an important contribution to the field of automated user story generation. It has identified key techniques and challenges and unearthed a significant coverage issue:  public user story datasets are nearly nonexistent and are a prerequisite for training and testing any automated generation system aimed at serving the agile community.  The study points out the necessity of aligning the tools for automatic generation of software artifacts with agile development practices. The authors suggest that, as future work, there is a need for the creation of more usable datasets, which can enhance reproducibility and validation in this domain.

The applicability of large language models to software engineering tasks, such as program repair and test gener- ation, was investigated by Shin et al.  [16].  They find that structured inputs and few-shot prompting are key to using the LLMs accurately and successfully.Their main thesis seems to be that LLMs, if used correctly, have the potential to produce accurate bug fixes of any contemporary tool.  They back this up with a rather effective series of demon- strations.  They also give LLMs a ”look ahead” in the sense that they argue for both their applicability to high-level engineering tasks and the assertion that they’d performed accurate low-level fixes of the contemporary tools used in their study.

An approach developed by Ozkaya et al.  [17] utilizes large language models to improve software development processes.  The researchers applied prompt engineering and fine-tuning techniques to use the models for generating code, localizing bugs, and generating test cases. They acknowledge that the tasks they are interested in are not neces- sarily new, but assert that the use of LLMs for these tasks provides a ”fresh perspective” on longstanding problems. They found that while fine-tuning a model on a specific dataset does yield decent performance, advanced prompting techniques—especially those using in-context learning—can yield comparably good results. The authors additionally look into the influence that the choice and sequence of examples have on the performance of the models. They show that subtle differences in prompts can make a significant and positive difference in the ability of LLMs to solve SE problems.

3    Methodology

3.1    Model Selection LLaMA3-70B

The newest generation of large language models from Meta AI is open source, contains a staggering 70 billion parameters, and goes by the name LLaMA3-70B. The model is based on the familiar Transformer structure that under- pins many of today’s largest language models. The structure allows the model to capture the intricate relationships and patterns found in input sequences. As a result, the LLaMA3-70B can generate text that is not only high in quantity but also quite a bit quality if one were to place it in a reasonable human rubric for sentence-like constructions, performing many NLP tasks with capably rapid efficiency [18].

A key advantage of the LLaMA3-70B model lies in its almost uncanny ability to produce human-like text.  In areas like content creation, automated writing, and code generation, this facet of the model makes it perfect for plug- and-play use.  Of course, a certain amount of user oversight is always necessary, and the LLaMA3-70B’s extensive parameter size allows it to capture the subtle details that distinguish precise, correct pitch from bad, muddled, or flat pitch in the model’s outputs.  Validators in fields like law, technical documentation, and programming can therefore work with a human-like writing assistant while better ensuring the accuracy and reliability of their outputs.

LLama3-70B effectively performs powerfully and enjoys a context length advantage.  This makes it suitable for user story extraction in diverse scenarios designed in this project [19].

3.2    Dataset

3.2.1    Data Collection

Data are downloaded from IBM’s Project CodeNet website, a site that includes huge datasets, with approximately 14 million code samples. Specifically, we use Project CodeNet C++1400 [20], which includes approximately 500,000 C++ code files.

3.2.2    Code Categorization

Table 1 shows the code categories we used in the experiments. Based on the code length, we used a filter script and randomly select 50 code files for each category from Project CodeNet C++1400 dataset. The reason for choosing only 450 (9*50) code files in total is because of the shortage of available reference user story dataset, which means we must manually build the dataset. The purpose is to test the accuracy and relevance of user story generated from these code with different complexities. Here we use lines of code (LOC) as proxy for the complexity and analyze the results thoroughly. Because in the field of software engineering, LOC is a metric often employed to gauge the complexity of a program.  A statistical study published in 2014 showed that LOC is a significant indicator in software projects and supports its use as a reliable proxy for the complexity of the code.[21].

Code Length (line)

Number of Code

Files

1-10

50

11-20

50

21-30

50

31-40

50

41-50

50

51-60

50

61-70

50

71-80

50

81-90

50

Table 1: Code Categories Used for Generating User Stories

3.2.3    Manually Build Dataset for Ground Truth User Stories

Initially, 20% of the C++ files were randomly selected, and user stories were manually created from the code. Inter-Rater Reliability (IRR) [22] was assessed using Cohen’s Kappa coefficient[23], which was calculated for each pair of generated and benchmark stories. Since 95% of the pairs showed a Cohen’s Kappa coefficient higher than 0.8 [24], indicating almost perfect agreements between the two raters, it was concluded that there would likely be similar agreement on the manually created stories for the remaining 80% of the C++ files.  Consequently, 450 user stories were manually created from the code and used as benchmark data. Figure 3.1 shows parts of the benchmark data. In this CSV file, the first column is the code file path, and the second column is its corresponding reference story.

Figure 3.1: Reference User Story CSV File