How to Conduct a Data-Driven Software Engineering Study: A Goal-Oriented Methodology

January 3, 2024

Software engineering is a complex and dynamic field that requires constant evaluation and improvement. One way to assess the effectiveness and efficiency of software development methods and practices is to conduct empirical studies that collect and analyze data from real-world projects.

However, designing and conducting such studies is not a trivial task. It requires careful planning, execution, and validation of the data collection and analysis process.

In this blog post, we will introduce a goal-oriented methodology for conducting data-driven software engineering studies. This methodology is based on the work of Basili and Rombach, who proposed a framework for software engineering experimentation.

The methodology is suitable for both academic and industrial settings, and can be applied to various types of software engineering studies, such as evaluating software quality, productivity, maintainability, usability, etc.

The methodology is structured as follows:

Establish the goals of the data collection.
Develop a list of questions of interest.
Establish data categories.
Design and test data collection forms.
Collect and validate data.
Analyze the data.

We will explain each step in detail and provide some examples to illustrate the methodology.

Step 1: Establish the goals of the data collection

The first step of the methodology is to define the goals of the data collection. These goals should be derived from the objectives of the software engineering study, and should be specific, measurable, achievable, relevant, and time-bound (SMART). The goals should also be related to the claims or hypotheses that are being tested by the study.

For example, suppose we want to conduct a study to compare two software development methodologies: agile and waterfall. One of our goals could be to evaluate the impact of the methodologies on software quality.

A possible claim or hypothesis could be that agile methodology leads to higher software quality than waterfall methodology. Therefore, our data collection goal could be stated as follows:

To measure and compare the software quality of projects developed using agile and waterfall methodologies.

Step 2: Develop a list of questions of interest

The next step of the methodology is to develop a list of questions of interest that are related to the goals of the data collection. These questions should be formulated in a way that can be answered by the data that will be collected. The questions should also be relevant, clear, and concise.

For example, based on our data collection goal of measuring and comparing software quality, we could develop the following questions of interest:

What are the defect rates of projects developed using agile and waterfall methodologies?
What are the defect types and severities of projects developed using agile and waterfall methodologies?
How much rework is required for projects developed using agile and waterfall methodologies?
How satisfied are the customers and users with the software products developed using agile and waterfall methodologies?

Step 3: Establish data categories

The third step of the methodology is to establish data categories that correspond to the questions of interest. Data categories are the attributes or variables that will be measured or observed in the data collection process.

Each question of interest should induce one or more data categories. The data categories should be defined in terms of their name, description, type, scale, and possible values.

For example, for the question of interest “What are the defect rates of projects developed using agile and waterfall methodologies?”, we could define the following data categories:

Project ID: A unique identifier for each project. Type: Nominal. Scale: Nominal. Possible values: Any alphanumeric string.
Methodology: The software development methodology used for the project. Type: Categorical. Scale: Nominal. Possible values: Agile or Waterfall.
Defects: The number of defects found in the project. Type: Numerical. Scale: Ratio. Possible values: Any non-negative integer.

Step 4: Design and test data collection forms

The fourth step of the methodology is to design and test data collection forms that will be used to record the data for each data category. Data collection forms are the instruments or tools that facilitate the data collection process.

They can be paper-based or electronic, depending on the context and convenience of the study. The data collection forms should be designed in a way that ensures the validity, reliability, and usability of the data. They should also be tested and refined before the actual data collection begins.

For example, for the data categories defined in the previous step, we could design a data collection form as follows:

Project ID	Methodology	Defects
P001	Agile	12
P002	Waterfall	18
P003	Agile	9
P004	Waterfall	15

We could test the data collection form by applying it to a sample of projects and checking for any errors, inconsistencies, or ambiguities in the data.

Step 5: Collect and validate data

The fifth step of the methodology is to collect and validate the data using the data collection forms. Data collection is the process of obtaining the data from the sources, such as projects, developers, customers, etc. Data validation is the process of verifying the accuracy, completeness, and consistency of the data.

Both data collection and validation should be performed concurrently with the software development process, and should involve the participation of the stakeholders, such as developers, managers, customers, etc.

For example, for the data collection form designed in the previous step, we could collect the data by asking the developers to fill out the form for each project they work on, and validate the data by cross-checking the defect counts with the bug tracking system, and interviewing the developers for clarification or confirmation.

Step 6: Analyze the data

The final step of the methodology is to analyze the data using appropriate statistical or qualitative methods. Data analysis is the process of transforming the data into meaningful information that can answer the questions of interest.

Data analysis should be performed in a systematic and rigorous manner, and should follow the principles of validity, reliability, and replicability. The results of the data analysis should be reported and interpreted in a clear and concise way, and should support or reject the claims or hypotheses of the study.

For example, for the data collected and validated in the previous step, we could analyze the data by calculating the mean and standard deviation of the defect rates for each methodology, and performing a t-test to compare the means and test the hypothesis that agile methodology leads to lower defect rates than waterfall methodology. We could report and interpret the results as follows:

The mean defect rate for agile projects was 10.5, with a standard deviation of 1.5. The mean defect rate for waterfall projects was 16.5, with a standard deviation of 1.5.
The t-test showed that the difference between the means was statistically significant, with a p-value of 0.01. This means that there is a 99% chance that the difference is not due to random variation, and that agile projects have lower defect rates than waterfall projects.
Therefore, we can reject the null hypothesis that there is no difference between the defect rates of agile and waterfall projects, and accept the alternative hypothesis that agile projects have lower defect rates than waterfall projects. This supports the claim that agile methodology leads to higher software quality than waterfall methodology.

Conclusion

In this blog post, we have presented a goal-oriented methodology for conducting data-driven software engineering studies. The methodology consists of six steps:

Establish the goals of the data collection.
Develop a list of questions of interest.
Establish data categories.
Design and test data collection forms.
Collect and validate data.
Analyze the data.

The methodology is flexible and adaptable, and can be applied to various types of software engineering studies and contexts. The methodology can help researchers and practitioners to design and conduct rigorous and reliable software engineering studies, and to generate useful and actionable insights from the data.

Tags:agile methodology, Data Categories, Data-Driven Software Engineering Study, Goal-Oriented Methodology, goal-oriented software engineering study, waterfall methodology

NewsnReleases

How to Conduct a Data-Driven Software Engineering Study: A Goal-Oriented Methodology

About The Author

Chenos Kahn

Add a Comment

Related Posts

About The Author

Chenos Kahn

Add a Comment