Yury A Izrailevsky


Applying Neural Networks to PSP

Predictions and Evaluations

CS573 - Machine Learning

Professor Ellen Riloff


Department of Computer Science

University of Utah

Spring Quarter, 1997

Introduction

PSP stands for Personal Software Process, which is one of the most crucial element of Software engineering. PSP consists of several major phases that are involved in the development of medium and large scale software systems, mainly Requirements Analysis, Design, Coding, and Testing.

Requirements analysis phase involves the understanding and planning out of the software system specifications, requirements, and constraints. To an extent, it also incorporates decisions on the types of development and execution environments, and analyzes the use cases (simulating system interaction). Design phase outlines specific implementation details, such as objects (classes and their functionality) and how they interact with each other. Coding phase involves the actual software implementation of all such functionality. Testing phase monitories this implementation on both "white-box" (by checking all the branches and paths of execution) and "black-box" (making sure system doesn't crash when run) levels.

Accurate estimations of how long each development phase will take, and how these times will relate to one another are extremely important when trying to produce quality software on time. In the recent years, software industry and the academic and research groups have been paying a great deal of attention to applying software engineering practices, and it paid off. Unfortunately, it often turns into a tedious and unintuitive task, presenting an obvious need for an automated tool to do PSP predictions and logging. Although there have been several (not very successful) attempts to create a logging tool, the actual prediction of the times are left entirely up to the programmers, which has been a major problem for people who do not have as much experience making such predictions. The goal of this project was to create a neural network system trained on the actual PSP data that would be able to predict times required for each development phase, including requirements analysis, design, coding, and testing.


Requirements, Specifications and Constraints

In order to make a PSP prediction system, I had to adopt the following assumption about the nature of the problem. It is possible to estimate ratios among the times needed for each development phase, given the following parameters:

These input parameters have been carefully selected among many other possible ones as most relevant factors affecting the times each development phase takes, based on my experience. It should be noted, however, that there may be other important metrics that could have improved the performance of the system if taken into consideration. Nonetheless, as it will be shown later, the given selection proved to be adequate and produced reasonably accurate ratios among the development phase times.

Following this assumption, it is evident that if one of the development phase times could be estimated statically, one can apply neural networks to predict times needed for the other three phases. Since Coding Time is mostly consistent with the LOC, I chose it to be evaluated statically. Thus, three separate neural networks were trained to predict the (Analysis/Coding), (Design/Coding), and (Testing/Coding) time ratios.

Since the available training data consisted entirely of student project records, the size of the projects never exceeded 2500 LOC, thus putting an upper limit to the range of possible projects that the system can predict. I also made a decision to limit the familiarity, complexity, and level input parameters to the scale from 0 to 7.


Input Representation

Since it was not obvious how important the parameters were compared to each other, I decided to allocate an equal number of input units of the neural network to all of them. To remedy the fact that the LOC is an unscaled number, I split the range of all possible LOC into eight subranges: {(<400), (400..700), (700..1000), …. (1900..2200), (2200..2500)}. This division allows to represent LOC in the same manner as the other three input parameters.

The next question that has to be answered is how exactly the input parameters are to be represented. Naturally, in order to keep the network manageable, one would like to limit the total number of input units. Since there are eight possible values for each of the inputs, the minimum number of binary nodes required to represent this range is three, using some kind of an encoding scheme.

In order to decide on the best encoding scheme, I constructed a simple neural network that attempts to learn ranges given encoded binary inputs. The scheme that performed the best is shown in the rightmost column in Table 1 (below). Intuitively it can be explained by the fact that as the number of 1's in the encoded sequence never decreases as the number itself grows.

Table1: Binary number encoding:
Decimal
Standard binary:
Gray codes:
Best one I found
0
000
000
000
1
001
001
001
2
010
011
010
3
011
010
100
4
100
110
011
5
101
111
101
6
110
101
110
7
111
100
111


Output Representation/Evaluation

Outputs we also chosen to be represented as ranges of ratios on the scale [1..10], using 10 binary digits (ex.: {0 0 0 1 0 0 0 0 0 0} represents fourth ratio range). This gives us the flexibility of assigning different ratio ranges to each of the three neural networks to be trained, based on the statistical information available for the training sets.

Another possibility was to use a similar kind of encoding scheme for the outputs as well. However, the random success probability in this case would could be as high as 12.5% (0.5^3), introducing a lot of instability into the predictions. The random success rate of the chosen format is only (0.5)^10 = 0.1%, allowing for a more accurate evaluation of system performance. This format also allows for an alternative, more accurate method of output avaluation described below.

Intuitively we would evaluate an output sequence by matching all its binary terms to the ten possibilities (where '1' is in one of the ten places, and '0's everywhere else). If an error occurs, we are most likely to receive an invalid output (with all zeros, or more then one ones), since there are only 10 legal outputs out of 1024 possible ones. Then we could either tell the user about the error, or apply the following formula to calculate an alternative answer:

R = Sum( Zi*Ri)/ Sum( Zi), for i=1..10,

where R is the estimated ratio (PhaseTime/CodeTime, in our case), Zi = (Z1…Z10) - output fractions produced by the neural network, and Ri=(R1..R10) are their respective ranges.

In the case of using an encoding scheme for the output, almost any combination of output unit values would be valid, and we would have no way of knowing if an error occured.


Neural Network Architecture

When applying the momentum term, I used the following formula, which is slightly different from the traditional one, but seemed to work a lot better for this project:

w(t+1) = ALPHA*DELTA*z + NU*[w(t) - w(t-1)] + w(t)

Momentum was only applied to the regular (non-biased) weights, biased and non-biased weights have different metrics, and applying the same momentum rate kills the leering.

Varying these parameters, I learned quite a bit about the architecture of backpropogation neural networks, and how each parameter affects the performance. I will described the experiments that I conducted to test the performance of the network in later sections.


Training and Testing Data

Initially, I used the data set, kindly provided by Laurie Williams, graduate Computer Science student here at the University of Utah. The set contains PSP data collected from CS202 students during the Winter quarter, 1997. It came in the following format:

Table 2, Original data sample
ID Analys Design Code Review Comp Test Total LOC Grade
1
40
180
75
60
355
276
100
2
50
180
5
40
275
275
93
3
45
90
55
190
215
90
4
25
30
135
15
45
5
255
372
93
5
510
383
1001

A lot of the entries were missing. In order to process this datum into a neural-network-friendly format, I wrote a program that retrieves applicable present entries for each network, calculates statistical information about each selected datum, assigns ranges to the LOC and output, and converts data to the following Training/Testing file format:

Table 3, sample neural network input
Familiarity
Complexity
Level
LOC
Output Range
0 0 1
0 1 0
1 0 1
0 0 0
0 1 0 0 0 0 0 0 0 0
0 1 1
0 1 0
1 1 1
1 0 0
0 0 0 0 0 0 0 0 1 0
1 0 1
1 1 0
1 0 0
0 1 1
0 0 0 0 0 1 0 0 0 0
0 1 1
0 1 0
1 1 1
1 0 0
0 0 0 0 0 0 0 0 1 0
1 0 1
1 1 0
1 0 0
0 1 1
0 0 0 0 0 1 0 0 0 0

After processing this data set, I got the following statistics on the data:

654 Records Total, representing 109 students in CS202, after completing six projects.

564 records represented completed projects (Grade > 60). However, since a lot of entries are missing, the actual data sample for each phase is going to be different.

Table 4, Distribution of times for phases for data set #1
Phase, min
Number
Min Time
Max Time
Aver Time
Stan Dev Time
Req. Analys
367
5
660
69.05
87.40
Design
412
5
830
114.59
119.15
Coding
484
10
1770
267.53
245.93
Testing
437
5
2580
151.08
215.43

Ratio of CodeTime/LOC ranges: (0.046 - 8.808), average: 1.080 +-1.133 (1 st. dev)

Table 5, Distribution of times for phases for data set #1
Phase, min
Min Ratio
Max Ratio
Aver Ratio
St Dev Ratio
Req. Analys
0.00917
5.00000
0.390
0.538
Design
0.01408
11.50000
0.711
1.087
Testing
0.02105
8.16667
0.844
1.061

The data is obviously very inconsistent. All the metrics have standard deviations greater than the magnitude of the average. Even for the ratios, which are supposed to be more consistent than the specific times (because projects differ in complexity), we see huge variations.

Training and Testing the network on data set #1:

The LOC parameter could be taken directly from the data. I also assigned Familiarity to the environment on a per-project base (for each of the six projects): {1, 2, 3, 4, 5, 6}, Complexity based on each projects, as suggested by the instructor: {2, 3, 4, 5, 7, 4}. The Level of programming background was assigned based on Grade, representing one of eight ranges: [<66], [66..70], [71..75] … [91..95], [>96]}

Since ratios between each of the development phase times and the coding time were in their majority between zero and two, I assigned the following Output Ranges: {[<0.2], [0.2..0.4], [0.4..0.6] …[1.4..1.6], [1.6..1.8], [>1.8]}

After generating the network input format, I ran the data set using the following neural network characteristics: 1 hidden level, 25 hidden units, momentum rate= 0.9, learning rate = 0.2, with biased nodes, and got the following results:

Table 6, results of training and testing on data set #1
Phase
# Epochs
Training Size
% Train
Testing Size
%Test
Req. Analys
1000
296
28.71
71
16.91
Design
1000
338
17.46
74
1.27
Test
1000
358
13.68
79
1.26

As we can see, the results are simply terrible. The fact that even on the training set the network never reached 30% shows how inconsistent the data was. It is little surprize how low the testing percentages were. We definitely see some learning (since absolutely random set would have produced about 0.1% accuracy). However, it is far from being useable for making any kind of reasonable estimates.

Here are some possible reasons for such terrible performance:

  1. Students were cheating or careless on their PSP. There were cases when students reported writing 300 LOC in 10 minutes… You can't really use data like these.
  2. Phase times were calculated inconsistently. What some people considered coding time, other might have considered testing, and so on.
  3. Skill levels, familiarity with environment, and relative complexities of the projects were too different, while assumed the same. I had no way of making valid estimates of these input parameters without knowing each student, thus making assumptions based on some external characteristics, such as the project number, and this does not seem to work very well.

Keeping all these in mind, I have decided to use a different data set to train the neural networks on. Since students in CS453 (Senior Software Lab) have been tracking their PSP for almost a year by now, I have decided to use these data instead. I thought that this would be more accurate, since I can assign familiarity, complexity, and programming level better because I was personally involved with the projects and know all the people who did them (including myself).

I the short period of time available, I was able to collect information on 72 projects done by CS453 students. Most of the data came from myself and the members of my team in that class. All records represented completed projects (75 < Grades < 100). The following is the statistics on the times for development phases for the projects, and their ratios to the respective times spent on coding.

Table 7, Distribution of times for phases for data set #2
Phase, min
Min Time
Max Time
Aver Time
StDevTime
Req. Analys
174
2104
1087.08
597.52
Design
254
3071
1571.92
865.55
Coding
412
7461
3344.10
2278.37
Testing
99
1198
612.60
338.226

Ratio of CodeTime/LOC ranges (1.987 - 3.015), average 2.417 0.496

Table 8, Distribution of times for phases for data set #2
Phase, min
Min Ratio
Max Ratio
Aver Ratio
StDev Ratio
Req. Analys
0.1869
0.7520
0.3695
0.0934
Design
0.2339
1.0840
0.5340
0.1306
Testing
0.0945
0.4160
0.2079
0.0506

If you compare the data in tables 7 and 8 to thoes of tables 4 and 5, you will notice a huge difference in the variance both actual times and ratios of sets 1 and 2. This can be explained by students in 453 being more careful and having more experience in tracking and logging their PSP data. Also, the second set represented a much smaller number of individuals with very similar levels of skills, which also reduced the variation.

Before doing the actual training and testing, I assigned Familiarity and Complexity parameters on a per-project basis. This was possible because I was personally involved in working on all these projects, so I was able to be consistent with my estimations. I assigned Level parameter values based on Grade: {[<79],[79..82],[83..85] … [95..97], [>98]}. The following represent the ranges that I chose for the outputs for each of the phase networks (based on the statistics on each of the phase data):

Analysis/Code Range : {[<0.1], [0.1..0.2], [0.2..0.3],…[0.8..0.9], [>0.9]}

Design/Code Range {[<0.1], [0.1..0.2], [0.2..0.3] ,… [0.8..0.9], [>0.9]}

Test/Code Range: {[<0.05], [0.05..0.1], [0.1..0.15], ,… [0.40..0.45], [>0.45]}

I trained the networks on 100 epochs, both 1 and 2 hidden units levels, 25 hidden units in each level, momentum rate = 0.9, and learning rate = 0.2 for 1 level, and 0.02 for two levels. The following are the results of the experiments:

Table 9, results of training and testing on data set #2
Phase
Number of Hidden Layers
% Train
%Test
Req. Analys
1
98.25 (56/57)
53.33 (8/15)
Design
1
96.49 (55/57)
53.33 (8/15)
Test
1
98.25 (56/57)
60.00 (9/15)
Req. Analys
2
98.25 (56/57)
73.33 (11/15)
Design
2
96.49 (55/57)
80.00 (12/15)
Test
2
96.49 (55/57)
73.33 (11/15)

Performance improvement was very significant as compared to the data set#1! Especially striking is the performance difference when going from one to two levels of hidden units. Perhaps, the relationships between the phases are better described with non-linear functions that can only be approximated through two layers of hidden units.


Applications of the Project results

As a matter of fact, performance of the networks was good enough to be included into the real application. For the final project in Senior Software, our team developed a PSP logging tool that allows PSP tracking over the network (using a web browser) through a centralized PSP server. The pre-trained neural networks were included in the server to give estimates to the times required for each development phase and to evaluate final results.

In order to estimate the Coding Time based on the number of lines of code, I used an empirically constructed function (for LOC [100, 2500]):

CodeTime = LOC(1.85 + 0.0004LOC + 0.04familiarity + 0.11Complexity - 0.02Programming Level)

For the other development phases (Requirements analysis, Design, and Testing), the system first finds the correspondent ratios, and then multiplies them by the statically evaluated Code Time.

Overall, the program seems to give reasonable estimation of time for each development phase, although it is tailored towards the data set and academic-type projects of limited size. It is currently being considered for student PSP logging and estimation for the next year.


Experiments With Neural Network Architecture

In order to find the optimum configuration for the neural network, I experimented with the number of hidden units in hidden layers, momentum rate, learning rate, and presence of biased nodes.

Momentum rate:

This parameter proved to be crucial for the learning speed of the network. Consider the following graph (top curve represents a 1-layer network, bottom one - 2 layer):

The data have been collected after training the networks with 25 hidden units in each level, 0.1 learning rate, and biased nodes present. The X-axis has the momentum rate, ranging from 0 to 0.9. The Y-axis corresponds to the number of epochs before 96% accuracy was reached. As it is obvious from this graph, high momentum rates reduce learning time significantly.

Learning rate:

This parameter is also very important for the learning speed. However, I have also noticed that with very high learning rates (above .05 for two levels of hidden units, and .2 for one level), the network becomes very unstable, and takes much longer to converge then lower learning rates. Sometimes it gets into a permanent oscillating state and never converges.

Consider another chart (top curve represents a 1-layer network, bottom one - 2 layer):

As it can be seen from table 9 and the charts above,

The data have been collected after training the networks with 25 hidden units in each level, 0.8 momentum rate, and biased nodes present. The X-axis has the learning rates, specifically {0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5}. The Y-axis corresponds to the number of epochs before 96% accuracy was reached. As it is obvious from this graph, high momentum rates reduce learning time significantly.

Hidden Units and Layers:

I experimented with various sizes of hidden units and layers, and they also affected the speed of training significantly. I experimented with the number of units ranging from 5 to 50, and the performance peaked at 80% for 2 layers and 25 units in each layer.

Biased Nodes: Addition of biased nodes seems to give a 75% improvement in the speed of learning


Concluding Remarks:

Neural Network-based PSP prediction systems can be created, but they are heavily biased towards a particular type of user, who is consistent with the type of the projects being logged, their complexity, and size. However, if it is possible to identify such user, neural networks can perform reasonably well in predicting the development phases.

In the future I would like to learn more about different types of neural networks and their potential applications. In particular, I am interested in medical applications of neural nets. It would be interesting how well they will performed as compared to other machine learning methods used in medical expert system. In general, I think this project gave me a very valuable experience needed for a more advanced research in the field of artificial intelligence.