Abstract— The early prediction of softwarerisk is mandatory for it to be Recognize, Categorized and Prioritized for thesuccess of the project. Since the requirement gathering stage is most importantand challenging stage of the Software Development Life Cycle (SDLC), the risksshould be tackled at this stage and then store them to facilitate in futureprojects. The early risk prediction promotes the quality and productivity ofthe project by reducing time, budget and human resources. The software requirementrisks can be predicted by using classification techniques of Data-mining. Amodel has been proposed that will input software requirements through Software RequirementSpecification (SRS), Classify it through risk Dataset and will output riskranked list next to those requirements.The research comprised of three main portions that include requirement risk prediction model, Risk oriented Dataset formationand Dataset & Classifier validation.Keywords—Software Risk, SoftwareDevelopment Life Cycle (SDLC), Data-mining, Software Requirement Specification(SRS), Dataset                                                                                                                                                      I.          IntroductionThere is always a chance of occurrence of uncertain events in theprocess of software development lifecycle which may lead to potential loss ofsoftware development or organization called software risk.

It is essential toidentify risks as early as possible so they can be monitored and managedthroughout the Software development lifecycle. Late detection of riskmay affect the quality budget and time of the project [1,2,3]. The late identification of risk may increasethe schedule and budget of the project and may lead to project failure. Requirementgathering is the initial step of SoftwareDevelopment Life Cycle (SDLC). Since assessment of risks at this stage will bemore beneficial and will improve the quality and efficiency of a software byreducing the chances of failure of the project if risks are identified andmanaged properly. Numerous methods for software risk assessment at severalstages in SDLC are available So far. Unfortunately, rare techniques exist toassess risks at requirements [2,3]. Traditionally Risk Assessmentinvolves three core phases mentioned below.

·         Identify the hazards thatmay distract the time, resource or costs of the project. ·        The identified risks then convert into decision-making informationgenerally called Risk Analysis. The probability and the significance of eachrisk are assessed through risk analysis, the risks are transformed intodecision-making information that was identified [4].·        After organizing the risk table, then risks are prioritized and rankedby the team. The team uses categorical values for probability (e.g.

very high,high, low, or frequent) and/or impact (e.g. small, uncertain, serious, ordisastrous), then classification techniques may help risk ranking [1,5].

Software project Development is typically coming across by Risks. The risks hail from different risk factors which are rootedin a variety of activities of the project development life cycle, These Risk factors if not identified properlybecome responsible for the success or failure of the project [6]. These factors need to be triggered andmitigate to minimize the software cost and schedule by the assessment of riskin the initial stages of software development lifecycle.

Previouslyin literature, Purandare [6] proposed an entropy-based approach for theanalysis of risk factors of the software projects. logistic regression has beenperformed on software development projects to predict risks [7]. AHP has beenused by fang and marle[8] to identify risks and risk interactions of theproject. Also, Salih and Ammar [3] used machine learning techniques forthe software performance risk prediction.

Although no machine learningtechnique has been applied to software requirementsspecification (SRS) for risk prediction. Classification techniques can beimplemented using different simulation tools, such as MATLAB and WaikatoEnvironment for Knowledge Analysis(WEKA). Since Weka is a free software having a collectionof machine learning algorithms for data mining tasks. The algorithms can beapplied directly to a dataset [3,16].A Riskprediction model using classification techniques of data-mining has beenproposed to predict risks on the source of software requirementspecifications(SRS) of the project.

The research has been fragmented into threemain parts, software requirement Risk Prediction Model, Risk Oriented Dataset Formation and Dataset and ClassifierValidation.Therest of this paper is organized as follow. Section II presents ResearchMethodology of the paper. Section III consisting of Evaluation and analysis ofthe results. Section IV has theConclusion of the research.                                                                                                                                     II.

        Research MethodologyThe researchdivided into three main fragments as discussed above. Those fragments areexplained in details below.A.    softwareRequirement Risk Prediction ModelIn the first Fragment of research, the basic model of the RiskPrediction using Classification Techniques has been introduced. This modelcontains four main components as mentioned below,1)      Risk IdentificationThe very first stage of Software Risk Prediction Model is RiskIdentification, where the Risk Manager/Project manager will Identify the Requirements traditionally, it is performedusing “checklist”. The Requirements fromSRS having Risk threat were marked checked for further analysis.

After thechecklist is completed headed to next stage [4,9].2)      Risk AnalysisHere in this stage those requirements are analyzed and Tested by a K Nearest Neighbor (KNN) classifier on the basisof Risk Oriented Dataset. KNN was recognized most suitable Classifier for Risk relatedenvironment consists of nominal and textual data [3, 11].

The reason foradopting KNN classifier for the model is its superior accuracy as compared toother classifiers discussed in Section III.3)      Risk Prioritization This is the output stage of the Model, where the analyzed Risk ArePrioritized the list makes high probability, high impact risks transferred tothe top of the table and the low-probability, low impact risks drop to the bottom[9].4)      Risk-Oriented DatasetThe Dataset contains Risk measures against requirements from several SRS.It is needed to have risk Oriented Dataset to properly train on the classifier.   Figure 1: RiskPrediction ModelB.

    Risk-OrientedDataset FormationIn the second fragment of research,the Risk dataset has been formed by applying risk Attributes and measuresagainst open source software projects requirements. The IT industry Experts,having experience in the field more the five years have filled measures forthose risk Attributes. The Risk Attributes were collected from literaturementioned as “Project category”, “Requirement Category”, “Risk Target Category”,“Probability”, “Impact”, “Dimension of Risk”, and “Priority of Risk” [9]. Therewere some other attributes included by the IT experts which commonly used bythem in the Process of risk assessment that is“affecting no of modules”, “cost of Risk” and “Fixing Duration”. The attributeswere assigned with a set of nominalvalues for the better support classifier (KNN, Naïve Bayes, Decision Tree,Decision Table) evaluation. At last, thedata is normalized for percentile and numeric values to be in the range from “0” minimum to “1” maximum, for thehomogeneity of the data. The proposed dataset consists of 299 instances(Requirements) fromdifferent types open source software projects SRS, these projects were TransactionProcessing System, Management Information System, Enterprise System and SafetyCritical System. Figure 2: RiskOriented Dataset FormationC.

    Datasetand Classifier ValidationThe last fragment of researchcontains two tasks, that was necessary to authenticate the proposed RiskPrediction Model. These tasks are mentioned below.   1)   The classifiers (KNN, Naïve Bayes, Decision treeand Decision Table) has been selected on the bases of literature [3,6]. Resultswere compared using “mean absolute error”, “root mean squared error” and“correctly vs incorrect class identification”. ·        KNN: It discriminates theclassification of the unidentified field onthe basis of its nearest neighbor whose class is previously identified [3,8].It works by determiningthe class of a given field by not only on the neighbor that is nearest to it inthe neighbor space but on the categoriesof the k neighbors that are nearest to it [11].·        NaïveBayes: It calculates a possible outputbased on the input.

It is generally used in text classification because of betteroutcome in multi-class problems andindependence rule [26]. The Equation of naïve Bayes is as follow.   [3,10] Where; P(Cj/X)= “probability of instance” X “being in class” CjP(X/Cj) = “probability of generating instance” X “givenclass”Cj.P(Cj)= “probability of occurrence of class” Cj.P(X) = “probabilityof instance X occurring” [10, 12].·        DecisionTable: a decision table based on the cause-symptom matrix is used as aprobabilistic method for identifying irregular tremor. Mathematically it is A= (U,AÈ{d})form of any information system. Here, d ÏA are decision attributes.

Attributes aÎA -{d} are conditionalattributes. Decision attributes can be consisting of multiple values, but generally, they have a binary value, forinstance, True or False [13,15].·        DecisionTree: The decision trees generally used for grouping and stated as a statistical classifier. It creates decisiontrees from a set of training data. Being a supervised learning algorithm, itrequires a set of training examples which can be a pair, input object and a requiredoutput [9].2)   The last task was the comparison of the Risk dataset to another dataset from tera-PROMISE repository [15], which was used byPradnya Purandare [6] for risk factor analysis.  Figure 3: Dataset andClassifier ValidationForthe Validation of dataset and classifier,we used WEKA, which is a free tool developed at the University of Waikato, NewZealand.

It includes a huge library of datamining tools such as pre-processing of data,classification, clustering, and visualization[16].                                                                                                                                  III.       Evaluation and AnalysisIn this section of the paper, four classification techniques have beenevaluated on two different datasets.

Results of both scenarios have been compared to recommend most suitableclassification technique for software requirement risk predictions. In both scenarios,we have split Dataset into 60% to train the classifier and remaining 40% convertedto “Supplied Test Case” to test the classifier. The two scenarios are,A.    RiskPrediction on Risk DatasetIn the first scenario KNN,Naïve Bayes, Decision Tree and decision table has been evaluated on RiskDataset and results are presented.1)     KNN hasbeen performed and Accuracy results were generated and presented in Table 1with the Correctly Classified Instance as 96.67%. IBK KNN Correctly Classified Instances 96.

67% Incorrectly Classified Instances 3.33% Mean absolute error 0.0218 Root mean squared error 0.1144 Total Number of Instances 116/120 Table 1: KNN Classification Accuracy RiskDataset2)      Naïve Bayes Accuracy results having 93.33%Correctly classified instances are observed and presented in Table 2. Naïve Bayes Correctly Classified Instances 93.

33% Incorrectly Classified Instances 6.67% Mean absolute error 0.0767 Root mean squared error 0.1628 Total Number of Instances 112/120 Table 2: Naïve Bayes Accuracy Risk Dataset3)      Decision Table Accuracy results with 76.67%Correctly classified instances are observed and presented in Table 3.

Decision Table Correctly Classified Instances 76.67% Incorrectly Classified Instances 23.33% Mean absolute error 0.2268 Root mean squared error 0.

2991 Total Number of Instances 92/120 Table 3: Decision Table Accuracy RiskDataset4)      Decision Tree Accuracy results with 90.83%Correctly classified instances are observed and presented in Table 4.  J48 Decision tree Correctly Classified Instances 90.83% Incorrectly Classified Instances 9.16% Mean absolute error 0.0458 Root mean squared error 0.

1591 Total Number of Instances 109/120 Table 4: Decision Tree Accuracy RiskDatasetB.    RiskPrediction on Cocomo EFFORT datasetIn the Second scenario againKNN, Naïve Bayes, Decision Tree and decision table has been evaluated onCocomosdr Dataset [30,37] and results are presented.1)      KNN Accuracy results with 100% Correctly classified instances are observed and presented inTable 5. IBK KNN Correctly Classified Instances 100% Incorrectly Classified Instances 0% Mean absolute error 0.

926 Root mean squared error 0.1242 Total Number of Instances 5/ 5 Table 5: KNN ClassificationAccuracy Cocomosdr2)       NaïveBayes Accuracy Results having 100% Correctly classified instances are observedand presented in Table 6. Naïve Bayes Correctly Classified Instances 100% Incorrectly Classified Instances 0% Mean absolute error 0.0008 Root mean squared error 0.002 Total Number of Instances 5/ 5 Table 6: Naïve Bayes Accuracy Cocomosdr3)      Decision Table Accuracy results having 60%Correctly classified instances are observed and presented in Table 7. Decision Table Correctly Classified Instances 60.

00% Incorrectly Classified Instances 40% Mean absolute error 0.2259 Root mean squared error 0.3132 Total Number of Instances 3/ 5 Table 7: Decision Table Accuracy Cocomosdr4)      Decision Tree Accuracy results having 80%Correctly classified instances are observed and presented in Table 8. J48 Decision tree Correctly Classified Instances 80% Incorrectly Classified Instances 20% Mean absolute error 0.0833 Root mean squared error 0.2141 Total Number of Instances 4/ 5 Table8: Decision Tree Accuracy Cocomosdr        Results of correct class identification from both Datasetsare presented in Table 9. Correct Class Identification Classifier Risk Dataset Cocomosdr [15] KNN 96.

67% 100% Naïve Bayes 93.33% 100% Decision Table 76.67% 60% Decision Tree 90.83% 80% Table 9: Comparison of Classifier on bothDatasetsAccordingto Results it has been observed and proved that KNN had identified 96.67%correctly instances in Risk Dataset and 100% in the Cocomosdr datasets.Although Naïve Bayes has also performed 100% accurate at Cocomosdr Datasetwhere a number of instances were less butit has identified 93.33% instances correctly in Risk Dataset thus we can saythat it is second best classification technique after KNN.

The Decision treeand Decision table have lower accuracy over both datasets. From the Results, KNN has been proven most appropriateClassification Technique for Software Requirement Risk Prediction.                                                                                                                                                        IV.       ConclusionAccording to literature, a project will be more proneto failure if it doesn’t meet the user needs, budget or schedule and thequality of the product will be reducedsince it is mandatory for a product to bedeveloped in budget and time to reduce the effort and chances of failure.The late detection of risk has more influence to cause failure of the project. ARisk prediction model has been proposed, evaluated and validated to test andcompare results of appropriate classifier among KNN, Naive Bayes, DecisionTable and decision tree classifiers, and as the results revealed that KNN isbest suitable classifier in the environment related to Software risks becauseof Textual and Nominal Attribute types.