https://doi.org/10.31449/inf.v47i6.3712 Informatica 47 (2023) 145–158 145 The Effect of Topic Modelling on Prediction of Criticality Levels of Software Vulnerabilities Prarna Mehta 1 , Shubhangi Aggarwal 2 , Abhishek Tandon 1* E-mail: pmmphilscholar@gmail.com, shubhagg1206@gmail.com, atandon@or.du.ac.in 1 Deparment of Operational Research, University of Delhi, Delhi, India. 2 Bhagwan Parshuram Institute of Technology, Delhi. *Corresponding author Keywords: software vulnerabilities, topic modelling, machine learning, supervised learning, CVSS, text mining Received: August 30, 2021 In this day and age, software is an indispensable part of our per diem endeavours, thereby keeping a check on exploitable vulnerabilities has become a vital function of a software firm. The motivation of this paper is to have better understanding of vulnerabilities, creating a tool for the industry practitioners to identify a critical vulnerability that could be detrimental for the firm’s assets. In this article, 1999 vulnerabilities related to Google Chrome was analysed to understand the behaviour of vulnerabilities. The identification of trends and patterns using topic modelling technique led to extraction of topics. The extricated topics were then implemented in 10 classifiers to foresee the criticality of the vulnerability. The resulting performances were also assessed with the classifiers without implementing topic modelling techniques. A 10-fold validation was conducted on the suggested prediction model. Povzetek: Metoda za ugotavljanje občutljivosti programske opreme je narejena s pomočjo tem. 1 Introduction Enslavement towards software has been ferociously intensifying by leaps and bounds in the present era, consequently, a call for unswerving software system has become the need of the hour. The snowballing complexities in order to meet the demands of user and to survive in the industry, often escalates the vulnerabilities in the software. Any software employed in a project/application is subjected to some inadvertent shortcomings, in other words vulnerabilities that might turn out to be a liability. Such exposure encourages an/a attacker/hacker to disturb the software project/application, hampering the security of the system. A secure system is, thus a highly demanded pursuit for a developer as well as a consumer guaranteeing a smooth working even under any outbreak. Nevertheless, in order to avoid any attack, these vulnerabilities have to be deeply analysed by a software development team in order to fortify a system. Vulnerabilities in a software project/application liberates an attacker to squander vital data as well as interfere with the security. Countless episodes of losses due to vulnerability attack has been reported causing not only monetary loss but as well as eminence of a company. For instance, due to virus, namely, Code Red Worm, a loss of $2.6 billion was incurred as reported in the study by (Telang & Wattal, 2007). The National Vulnerability 1 Source: https://nvd.nist.gov/vuln/search/results?form_type=Basic &results_type=overview&search_type=all Database (NVD) aims at amassing statistics on software vulnerabilities and has a record of 152780 vulnerabilities till date 1 . The incidents due to vulnerabilities have been reported to Computer Emergency Response Team (CERT) and around 53117 security incidents were handled by Indian CERT team in the year 2017, nonetheless, the number hiked to 208456 in 2018 whereas it was 394499 in 2019 2 . Looking at the alarming rate of proliferating records on vulnerabilities, it draws attention of researchers to examine the scenario for the betterment of the industry. The risk attached to these software vulnerabilities, given the fact that gigantic amount of classified data is getting accrued on the daily basis, if corrective measures not taken can lead to serious collisions whereas, on the other hand, mammoth-volume, textual data on vulnerability accumulating each year needs to be tamed for better analysis and research in the field of software vulnerabilities (Malhotra, 2021). Moreover, this gives a direction to a software maintenance team concentrate on highly vulnerable part in the software project/application curtailing false positive as well (Stuckman et al., 2016). This brings the focus to develop an efficient algorithm that condenses the corpus as well as converges the limited resources towards a highly vulnerable part. In this paper, topic modelling, state of the art technique is deployed to reduce the textual descriptions into meaningful clusters 2 https://www.cert-in.org.in 146 Informatica 47 (2023) 145–158 P. Mehta et al. called topics. Three different Topic modelling algorithms were considered for this study, namely Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDiA) and lastly, Non-Negative Matrix Factorization (NMF), to asses each of their performance when combined with the prediction model. The colossal quantity of vulnerability data can be reduced by labelling them as critical and non-critical. The prediction of the criticality of vulnerabilities aids software maintenance team to drive the limited resources towards the critical vulnerabilities. However, the vulnerability prediction model as two aspects to it explicitly, the features of the vulnerability data and the classifier. For this study, Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbours (KNN), Decision Tree (DT), Artificial Neural Network (ANN), Naïve Bayes’ (NB), Linear Support Vector Machine (LSVM), Support Vector Machine (SVM), Random Forest (RF), and lastly, Gaussian Naïve Bayes’ (GNB). This study is noteworthy for the fact that it helps in mathematically modelling vulnerability text data, thereby furnishing with meaningful results obtained empirically. The core objective of this study is to build a highly accurate vulnerability prediction model to categorize vulnerability data into meaningful topics that trains state- of-the art classifiers to renders enriched prediction model. In order to achieve this goal, vulnerability data of Google Chrome, mined from the National Vulnerability Database (NVD), was pre-processed to configure topics using three Topic modelling techniques. The deduced topics contributed as training set for the learning algorithms to envisage the criticality of the vulnerability identified in Google Chrome. The models are validated using k-fold validation technique and compared with prediction model without considering procuring topics as a feature reduction scheme. The objectives of the research study is fulfilled by resolving the following research questions (RQ) that were investigated in this study, RQ1. What is the performance of topic modelling when combined with classifiers? RQ2. What is the performance of the classifier without incorporating any of the topic modelling technique? RQ3. Which of the Machine Learning (ML) classifiers shows improvement in the performance? In our knowledge, there has not been any work based on integration and comparative study of topic modelling techniques and machine learning classifiers. The dataset used to perform this study has also not been implemented in any previous literature. The key contribution manifested in this research article are, (1) to develop vulnerability prediction model using different topic modelling techniques and Machine learning classifiers, (2) to examine the performance of the developed models (3) to reconnoitre the effect of not incorporating topic modelling (4) adding new aspects to the literature for the experts to benefit from. The paper is spread over five sections: Past literature articles are discussed elaborately in section 2, in order to overcome the research gap, a methodology is propositioned in section 3, the proposed model is illustrated and validated in section 4, Threats to internal as well as external Validity is examined in section 5 and to sum up the study section 5 concludes the study. 2 Related work There are plenty of literature on vulnerability prediction in software project/application using machine learning techniques as well as feature reduction tools, establishing suitable results. (Walden et al., 2014) compared the effect of software metrics with that of bag of words on the vulnerability prediction model. A lot of work has been done in other areas of vulnerability like developing conventional models, optimisation model, release plans, cost models. (Kansal et al., 2016) developed a mathematical model for vulnerability detection and a cost model for patching after the detection. (Zerkane, 2018) examined the effect of vulnerabilities in software defined networking using CVSS score. (Kansal et al., 2018) made an effort of optimising the cost of after release maintenance issue by combining vulnerability fixing and fault fixing into single patch. Many mathematical optimization techniques have been used to optimally prioritise vulnerabilities. (Sharma et al., 2019) uses MCDM techniques, namely VIKOR and TOPSIS to prioritise vulnerabilities. A novel optimization tool, VULCAN was developed by (Farris et al., 2018) to manage vulnerabilities with respect to exposure and remediation. A comparative study between best worst method and AHP was studied by (Anjum, Agarwal, et al., 2020), following which, (Anjum, Kapur, et al., 2020) integrated MCDM and ML technique to develop bi-objection optimization problem prioritising the most critical vulnerability. (Narang et al., 2018) incorporated the effect of software vulnerabilities inter dependency attribute in prioritising them in accordance to their critical levels with the help of DEMATEL. Some Researchers are examining different feature reduction schemes to enhance the performances of the vulnerability prediction model. (Stuckman et al., 2016) examined the influence of dimension reduction techniques like PCA, Feature Synthesis, and their respective variant, on foreseeing vulnerabilities located in open source applications in PHP. (Ji et al., 2018) describes briefly different technologies implemented along with discussing pioneer work in the areas of automatic vulnerability detection, exploitation and patching. (Theisen & Williams, 2020) have used different software metrics along with features obtained through text mining and analysed the performance of Random Forest, Decision Trees, Logistic Regression and Naive Bayes. (Kalouptsoglou et al., 2020) develops model using deep learning and software metrics with promising results taking into consideration multiple projects for generalised results. A vulnerability prediction model was developed by (Filus et al., 2020) using RNN and CNN. An inter- The Effect of Topic Modelling on Prediction of Criticality… Informatica 47 (2023) 145–158 147 comparative study was performed by (Wu et al., 2017) to asses deep learning techniques like LSTM, CNN as well as reviewing the conventional machine learning techniques. (Shahriar & Haddad, 2016) implemented LSI to obtain smaller source code causing object injection vulnerability in a system. (Kudjo et al., 2020) framed a model using bellwether analysis to select subset for vulnerability prediction. (Rehurek & Sojka, 2010) discusses the importance of applying topic modelling techniques. A framework called SySeVR was developed by (Li et al., 2021) using deep learning techniques to identify semantics and syntax characteristics to spot the vulnerabilities in C/C++ source codes. A correlation was established between software metrics and the prevailing vulnerabilities by (Alves et al., 2016) determining answers to multiple research questions. A complete structure was suggested by (Kumar & Sharma, 2017) to manage vulnerabilities in an optimal manner. 3 Proposed methodology In this section, the framework of the study is explained step by step as depicted by figure 1. In figure 1, orange depicts phase 1 that is extraction of vulnerability datasets, green represents phase 2 that is the extracted dataset is pre- processed and prepared for further analysis, blue represents phase 3 that is feature mining and training of classifiers using tokenised data as well as the generated topics as features and lastly black represents phase 4 where, the performance of the prediction model is evaluated. Exploring huge text data manually is taxing and arduous job which have greater chances to be erroneous and have discrepancies whereas shrinking data into relevant topics with the help of topic modelling can be considered as solution. Topic modelling is considered to be the most efficient unsupervised data-mining algorithm, discovering relationships between text data (Vanamala et al., 2020). This condenses the dimension of data by amputating superfluous features that do not weigh in further analysis. For our analysis, we have considered three topic modelling algorithms, LSI, LDiA and NMF. The rudimentary concept behind topic modelling is to convert large corpus into vectors with the help of term frequency or inverse term frequency thereby, dividing and optimized by a probability model or matrix factorization into topics which is an array of words or tokens. LSI is a robust topic mining technique, having a knack for noise resistance and transforming large dimensional vector spaces to smaller dimensional vector spaces with the help of singular value decomposition. (Papadimitriou et al., 2000) endeavours to study the mathematics behind the LSI performance and its’s ability to divulge in statistical properties of corpus. LSI and LDiA both have probabilistic approach where as NMF is a matrix factorization paradigm that decomposes high dimensional array to a non-negative and low dimensional one. Non- Negative being the only criteria, NMF uses term frequency-inverse document frequency (TF-IDF) whereas LDiA and LSI uses frequency of bag of words or term frequency (TF) for feature extraction since the paradigm reads only positive integer frequencies and not a real number. The topics hence generated directs toward ameliorated results when read as an input by machine learning classifiers. For the experiment we have selected 10 classifiers that are most commonly used to assess any suggested model, namely Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbours (KNN), Decision Tree (DT), Artificial Neural Network (ANN), Naïve Bayes’ (NB), Linear Support Extraction of Google Chrome's Vulnerabilty data from NVD for all versions Preprocessing that includes removal of sttopwords, special characters, lowercasing, tokenization. All tokens act as feature Training classifiers using all features as input LR, LDA, KNN, DT, ANN, NB, LSVM, SVM, RF, GNB Assesment of output using all featuress Creation of 10 using 3 Topic modelling techniques, LDiA, LSI, NMF Training classifers using topics as input LR, LDA, KNN, DT, ANN, NB, LSVM, SVM, RF, GNB Assesment of output using reduced topics Did topics yield improved accuracy of the prediction model? Yes, use topics to predict the criticality of Google Chrome's vulnerabilities No, use all tokens as features to predict the criticality of Google Chrome's vulnerabilities Phase 1 Phase 2 Phase 3 Phase 4 Figure 1: A general framework of the proposed study. 150 Informatica 47 (2023) 145–158 P. Mehta et al. Vector Machine (LSVM), Support Vector Machine (SVM), Random Forest (RF), and lastly, Gaussian Naïve Bayes’ (GNB). By far, there are many evaluation markers to assess machine learning tools and traditionally ones are True Positives, True Negatives, False Negatives and False Positives that form the confusion matrix. In this study, the performance of the 10 classifiers were assessed with the help of Accuracy, F-measure, Recall and Precision. These measure were well explained by (Bulut et al., 2019) in their corresponding study and mostly used in past literature (Dam et al., 2016; Theisen & Williams, 2020). 4 Numerical analysis 4.1 Data collection For the numerical, a dataset of 1999 vulnerabilities captured in Google’s product “chrome” was collected manually from National Vulnerability Database (NVD). Google Chrome web application was chosen because its abundantly utilized web browser in the market (http://www.netmarketshare.com) for e-banking, social media, information sharing, consequently making it highly exploitable application to access sensitive data of a user. Additionally, many researchers have used google chrome to conduct their respective experiment, for instance, (Kudjo et al., 2020; Nguyen et al., 2016; Roumani et al., 2015). The data consists of Vulnerabilities Ids, Summary of the vulnerabilities and CVSS Severity for all the versions (more than 20 available). The CVSS is computed in two ways, CVSS 3.0 and CVSS 2.0. For our analysis, the latter score has been considered, since the score value and the severity level were available for all listed vulnerabilities. The CVSS score quantities the criticality of a vulnerability numerically between 1 to 10. The criticality level of the stated vulnerabilities was tagged into three categories, namely, High, Medium and Low. For easy computation and binary classification, the medium severity level was united with Low severity level as non-critical vulnerabilities, whereas High severity level was termed as critical vulnerability. Table 1 describes the dataset. Table 1: Description of vulnerability dataset. Project Google Chrome No. of Vulnerabilities 1999 Range of years 2021-2011 Versions <20 No. of critical Vulnerabilities 510 No. of Non-critical Vulnerabilities 1489 4.2 Data pre-processing Subsequently, the vulnerability description is mined to extract useful information with the help of pre-processing methods, thereby optimising the results. Special characters, punctuation, blank spaces occupy memory spaces as well as hamper the result of the experiment, hence removing such irrelevant information acts as a corrective measure (Vijayarani et al., 2015). Next, with the help of Python packages, Natural Language Toolkit (NLTK) and pandas, the words more than 3 letter in the vulnerability description column were retrieved in lowercase, replacing other special characters by a blank space, the stop words were eliminated and each document was tokenized into list of words for further experiment. The other packages put to use were ‘numpy’ and ‘matplotlib’ for data management and visualization whereas ‘sklearn’ library abetted in importing TFIDF and Count vectorizer for feature extraction, LDiA, LSI and NMF for topic modelling and lastly, machine learning classifiers to determine the criticality of the vulnerabilities. 4.3 Topic modelling The list of words or token obtained after pre-processing vulnerability data is considered as features of the respective study. The description of each vulnerability is converted to their respective feature vector that stores frequency of each token in a particular vulnerability document. Following which, Count Vectorizer and TF- IDF screens these features further by assigning a weight of importance. This not only resolves the tedious job to handle large corpus but also cuts down the expense involved and the computation time. To improve the performance of the prediction models, all the vulnerability documents are iterated to capitulate unique tokens as dictionary or bag-of-words. The subset of 100 words was considered as input for LDiA, LSI and NMF topic models along with number of topics as 10. The number of words and topics was chosen as it is observed to work well in the past literature (Dam et al., 2016; Mounika et al., 2019; Vanamala et al., 2020). Each topic created using topic modelling techniques is a linear combination of unique words and their respective weightage. For example, Topic 0 obtained from LSI topic modelling technique is represented as: [('remote' * 0.4358473796441817) + ('crafted' * 0.3319362678157508) + ('attacker' * 0.3159864058882791) + ('allowed' * 0.31502006087883194) + ('prior' * 0.31419760562296395) + ('html' * 0.26029878325415207) + ('page' * 0.25179755400065057) + ('potentially' * 0.14941933391958886) + ('attackers' * 0.14537824144767378) + ('allows' * 0.12949807741439195)] The Effect of Topic Modelling on Prediction of Criticality… Informatica 47 (2023) 145–158 151 From the above equation, it can be noted that the tokens: “remote, crafted, attacker, allowed, prior, html, page, potentially, attackers, allows” conjointly form Topic 0 as conferred by LSI topic modelling result. The numerical part attached to each token in Topic 0 signifies the weightage of the word in the respective topic. The topics created by LSI, LDiA and NMF and each token’s relative importance in their respective topics is depicted in figure 2, 3 and 4. (a) (b) (c) (d) (e) (f) (g) (h) 0 0.1 0.2 0.3 0.4 0.5 remote crafted attacker allowed prior html page potentially attackers allows Topic 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 attackers cause denial service allows unspecified possibly impact vectors remote Topic 2 0 0.05 0.1 0.15 0.2 0.25 0.3 policy insufficient bypass enforcement origin data domain allows extension remote Topic 3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 allows remote used function blink attackers properly core heap bypass Topic 4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 bounds windows perform linux memory read android domain file pdfium Topic 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 extension policy heap user file insufficient convinced enforcement malicious install Topic 6 0 0.1 0.2 0.3 0.4 0.5 0.6 free vulnerability domain related perform omnibox allows spoofing extension handling Topic 7 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 process memory bounds information potentially renderer obtain sensitive free sandbox Topic 8 152 Informatica 47 (2023) 145–158 P. Mehta et al. (i) (j) Fig 2: Relative importance of tokens in respective topics when LSI was performed. (a) (b) (c) (d) (e) (f) 0 0.1 0.2 0.3 0.4 0.5 0.6 windows linux android free blink vulnerability bypass html policy impact Topic 9 0 0.1 0.2 0.3 0.4 vectors properly remote allows unspecified bypass attackers related corruption arbitrary Topic 10 0 50 100 150 200 250 attacker remote crafted prior allowed html page incorrect omnibox spoof Topic 1 0 5 10 15 20 25 30 35 audio race condition local files crash html prior attacker allowed Topic 2 0 20 40 60 80 100 120 140 file pdfium video image files read crafted write remote related Topic 3 0 50 100 150 200 250 bounds remote crafted allowed attacker prior perform memory read html Topic 4 0 200 400 600 800 1000 1200 attackers allows remote cause service denial unspecified possibly vectors impact Topic 5 0 50 100 150 200 250 300 350 corruption heap potentially allowed attacker prior crafted remote exploit html Topic 6 The Effect of Topic Modelling on Prediction of Criticality… Informatica 47 (2023) 145–158 153 (g) (h) (i) (j) Fig 3: Relative importance of tokens in respective topics when LDA was performed. (a) (b) (c) (d) 0 20 40 60 80 100 120 140 160 overflow buffer integer heap based products used function issue data Topic 7 0 50 100 150 200 250 remote crafted policy html attacker allowed prior page bypass insufficient Topic 8 0 20 40 60 80 100 120 140 extension user extensions allowed prior convinced crafted attacker malicious install Topic 9 0 50 100 150 200 250 300 windows linux android attack inject arbitrary blink uxss incorrectly pages Topic 10 0 0.5 1 1.5 2 2.5 heap exploit corruption potentially page html attacker allowed prior crafted Topic 1 0 0.5 1 1.5 2 vulnerability unspecified possibly impact service denial cause allows vectors free Topic 2 0 0.5 1 1.5 2 policy insufficient bypass enforcement origin html page allowed attacker prior Topic 3 0 0.5 1 1.5 2 2.5 3 bounds read memory perform access write skia properly service denial Topic 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 multiple allow vulnerabilities unknown impact unspecified possibly vectors service denial Topic 5 0 0.5 1 1.5 2 omnibox domain spoof incorrect contents spoofing prior attacker allowed crafted Topic 6 154 Informatica 47 (2023) 145–158 P. Mehta et al. (e) (f) (g) (h) (i) (j) Figure 4: Relative importance of tokens in respective topics when NMF was performed. From fig 2 (a)-(j), the tokens: “remote, attackers, policy, allows, bounds, extension, free, process, windows, vendors” in the respective topics have the highest weightage ranging between 0.285 and 0.526 when LSI was performed. The most influencing tokens in respective topics as depicted in fig 3(a)-(j) obtained from LDA were “attacker, audio, file, bounds, attackers, corruption, overflow, remote, extension, windows”. Lastly, from NMF analysis, tokens “heap, vulnerability, policy, bounds, multiple, omnibox, process, windows, extension, used” weighed the most in their corresponding topic clusters. 4.4 Evaluations RQ1. What is the performance of topic modelling when combined with classifiers? The results of topics extracted and used as input for the 10 classifiers is given by table 2. From the table, it can be observed that when LR, KNN, DT, ANN, NB, RF and GNB combined with LSI gives an accuracy level of 0.8175, 0.82, 0.8025, 0.8175, 0.7975, 0.8175 and 0.7975 respectively. NMF combined with LDA, LSVM and SVM gives an accuracy level of 0.8275, 0.82, and 0.785. Lastly, it can be observed that LDiA has the poorest performance when combined the 10 classifiers with the accuracy level ranging between 0.7175 and 0.7525. A pictorial representation of accuracy levels of all classifiers with respect to topic modelling technique is given by figure 5. From figure 6 and table 2, F1- measure can be analysed for different classifiers subject to a given topic modelling technique. Classifiers with LDiA has overall same level of F1-measure except for the classifier KNN that shows highest level of F1-measure at 0.7269. On the other hand, LSI’s performance with the classifiers has an average F1- measure around 0.8 with highest at 0.8206 for ANN classifier and lowest for SVM at 0.7415. Last of all, NMF with the classifiers depicts a mixed performance of F1- measure ranging between the lowermost at 0.6462 for NB and reaching the peak at 0.8225 for LDA. Lastly, the tabular results of performance measures, Recall and Precision are given by table 2 and line diagram given by figure 7 and figure 8. Classifier GNB with topic modelling technique, NMF results in lowest recall value at 0.6925 whereas lowermost recall value for classifier SVM with LSI was 0.7225, and lastly for classifier DT with LDiA, it was 0.7175. However, NB with the topic modelling technique NMF has the lowest Precision value at 0.5662, classifier DT with topic modelling technique LSI has the lowest precision value at 0.8005, but multiple classifiers had poor Precision value with LDiA at 0.5663. A low recall and high precision value imply how accurately the model is returning positive predicted value. For all the low recall values recorded, it was observed that they had more or less high precision values implying that the suggested model labels a critical vulnerability correctly, however the number of false negatives is high due to high precision which indicates that the model is sometimes missing out critical vulnerabilities. In general, one cannot help put notice, the opposite behaviour of F1- measure and precision, whereas accuracy is in parallel with recall implying the goodness fit of the proposed 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 process sandbox renderer compromised escape potentially perform page html allowed Topic 7 0 0.5 1 1.5 2 2.5 3 windows linux android arbitrary file attacker allowed prior incorrectly pdfium Topic 8 0 0.5 1 1.5 2 2.5 extension user convinced install malicious extensions enforcement insufficient devtools prior Topic 9 0 0.5 1 1.5 2 2.5 used function blink allows core attackers properly cause remote denial Topic 10 The Effect of Topic Modelling on Prediction of Criticality… Informatica 47 (2023) 145–158 155 model. But from the results it was also noted that many classifiers had a high recall and high precision values signifying the fact that the model was accurately labelling a critical vulnerability. Table 2: Output of classifiers’ performance measures. NMF LSI LDiA Classifiers ' Name Precision Recall F1-Score Acc. Precision Recall F1-Score Acc. Precision Recall F1- Score Acc. 0.7451 0.77 0.7419 0.772 0.8194 0.817 0.8184 0.8175 0.5663 0.752 0.646 0.7525 LR 0.8204 0.82 0.8225 0.827 0.8173 0.812 0.8146 0.8125 0.5663 0.752 0.646 0.7525 LDA 0.7982 0.79 0.7978 0.797 0.8213 0.82 0.8206 0.82 0.7193 0.745 0.726 0.745 KNN 0.7729 0.78 0.7759 0.78 0.8005 0.802 0.8015 0.8025 0.6593 0.717 0.674 0.7175 DT 0.8005 0.80 0.8015 0.802 0.8251 0.817 0.8206 0.8175 0.5663 0.752 0.646 0.7525 ANN 0.5662 0.75 0.6462 0.752 0.8257 0.797 0.8061 0.7975 0.5663 0.752 0.646 0.7525 NB 0.8128 0.82 0.8152 0.82 0.8194 0.817 0.8184 0.8175 0.5663 0.752 0.646 0.7525 LSVM 0.7636 0.78 0.7535 0.785 0.8326 0.722 0.7415 0.7225 0.5663 0.752 0.646 0.7525 SVM 0.7584 0.78 0.7380 0.78 0.8125 0.817 0.8145 0.8175 0.5663 0.752 0.646 0.7525 RF 0.8271 0.692 0.7133 0.69 0.8257 0.797 0.8061 0.7975 0.6682 0.747 0.660 0.7475 GNB RQ2. What is the performance of the classifier without incorporating any of the topic modelling technique? The line graph illustrated by figure 5, 6, 7, 8 represents the accuracy, F1-measure, Recall and Precision levels for classifier when using topic modelling techniques and without topic. Modelling techniques. Even though accuracy level is oscillating between 0.71 and 0.89 whereas F-measure fluctuating between 0.7203 and 0.8904, it can be observed that the classifiers mostly show high precision and high recall values except for the classifier LDA. A high recall indicates that the model is predicting a vulnerability as non-critical but a critical vulnerability is not labelled as non-critical. However, high precision value with high recall is considered as perfect combination since the model results in high number of true positives implying that the critical vulnerabilities are predicted correctly. Figure 5: Comparative study of TP vs without TP using accuracy. Figure 6: Comparative study of TP vs without TP using F1 score. Figure 7: Comparative study of TP vs without TP using recall. Figure 8: Comparative study of TP vs without TP using precision. 0.67 0.72 0.77 0.82 0.87 0.92 LR LDA KNN DT ANN NB LSVM SVM RF GNB LDiA LSI NMF Without TP 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 LR LDA KNN DT ANN NB LSVM SVM RF GNB LDiA LSI NMF Without TP 0.67 0.72 0.77 0.82 0.87 0.92 LR LDA KNN DT ANN NB LSVM SVM RF GNB LDiA LSI NMF Without TP 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 LR LDA KNN DT ANN NB LSVM SVM RF GNB LDiA LSI NMF Without TP 156 Informatica 47 (2023) 145–158 P. Mehta et al. RQ3. Which of the Machine Learning (ML) classifiers shows improvement in the performance? Looking at the figures 5,6,7,8 and table 2, the Machine learning classifier GNB performs the best when combined with topic modelling technique LSI and machine learning classifier LDA performs best when combined with topic modelling technique NMF. While other classifiers for the given dataset show no sign of improvement when the features are reduced and combined into topics. The reason behind no improvement is simply due to over estimation while using topic modelling techniques. 4.6 Model validation In order to study the impact of features extracted mechanically by topic modelling techniques on 10 classifiers while developing vulnerability prediction model, a 10 cross-fold validation experiment was conducted. The vulnerability dataset was divided into 10 folds: 9 parts as training set while 1 part to test the model. Hence for each unique topic modelling technique and each classifier, 10 different performances results were obtained. Figure 9, 10, 11 depicts the averaged-out performance measure for each classifier under three different topic modelling technique. Accuracy was used as performance measure for this validation experiment. Figure 9: 10-fold validation for LSI model. Figure 10: 10-fold validation for NMF model. Figure 11: 10-fold validation for LDiA model. From figure 9, it can be noted that vulnerability prediction model using LSI and ANN outperforms with accuracy being 0.8555 whereas LSI fused with classifier NB performs the least with accuracy level at 0.7429, while other classifiers with LSI performed between the range. Analysing figure 10, one cannot help but notice the poor performance of the classifier, GNB with accuracy at 0.7898, on the other hand LDA has the highest accuracy at 0.8499. Lastly, from figure 11, the accuracy level ranging between 0.7098 (GNB) and 0.7456 (KNN) for LDiA is observed. The classifiers, namely ANN, NB, LSVM, SVM and RF have almost same accuracy level as 0.7429. Overall, after 10-fold validation, LSI is most impactful feature reduction tool when conjointly performed with the machine learning tool, ANN. 5 Threats to validity A Pragmatic study can be intimidated by number of limitations internally as well as externally, making it important to be worth of discussion. While a threat to internal validity describes the elements that might have an impact on the study’s output on the other hand a threat to external validity aims at the generalizing the output. In this study, the vulnerability description was mined to extract features and CVSS score to determine the criticality of the respective vulnerability for the prediction model however, other factors like CVSS metrics, were not taken into considerations, which might have an impact on the performance of the prediction model. Another threat to internal validity of the study is the vulnerability records was not documented during the period of this study. A statistical test was not conducted to verify the statistical significance of the results which gives a direction for future work. Subsequently, the threats to external validity in this study was the dataset was limited to one project which cannot infer generalized results for other datasets, adding biasness to the output. The reason behind this is that a vulnerability of high criticality level is not inevitably of same criticality in a different project dataset. In this study 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 LR LDA KNN DT ANN NB LSVM SVM RF GNB 0.725 0.745 0.765 0.785 0.805 0.825 0.845 0.865 0.885 0.905 LR LDA KNN DT ANN NB LSVM SVM RF GNB 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 LR LDA KNN DT ANN NB LSVM SVM RF GNB The Effect of Topic Modelling on Prediction of Criticality… Informatica 47 (2023) 145–158 157 we have worked on a web application’s vulnerability dataset, but the results may differ for other applications in written in different languages or Android application. The performance measures to assess the learning algorithms for the prediction model were Accuracy, F-measure, recall and precision, nonetheless there other measures as well for instance, Area under Receiver operating characteristics curve, Welch t-test, cliff’s delta effect size etc. For the empirical results, 10 machine learning algorithms were deployed, but there are many more algorithms to be validated for universal result. 6 Conclusion This study focuses on the impact of topic modelling techniques on the performances of the classifiers labelling vulnerabilities as critical or non-critical. The topic extracted from the vulnerability description condenses the textual data, thereby captures the significance portion and eradicating the irrelevant text. In order to perform the analysis, we have extracted a vulnerability dataset for the most used web application, Google Chrome. The topics were generated with the help of three topic modelling techniques namely, LSI, NMF, LDiA. These spawned topics were used as input in 10 most commonly used classifiers. The results of the suggested methodology were compared with that of the classifiers without integrating topic modelling inputs. All in all, one can conclude from the performed experiment that most of the classifiers perform best when not combined with topic modelling techniques except for GNB and LDA. Classifier GNB with LSI has an accuracy of 0.7975 whereas when LDA performs with NMF has an accuracy of 0.8275. However, individually considering the classifiers performance with topic modelling technique one can state that the performances are at par excellence. Future work can be directed toward three courses. Firstly, the proposed methodology can be validated on software application database such as PHP application, web applications, mobile applications and applications from various fields like finance, education, banking, energy utility etc. The second direction is incorporating techniques to balance the datasets. An imbalanced dataset does not result in high accuracy and performance of the prediction model. Hence incorporating sampling techniques can enhance the results. The third approach is that for this study, the vulnerability description is used to extract features, but there are multiple factors that improve and deliver a generalised result. References [1] Alves, H., Fonseca, B., & Antunes, N. (2016). Software metrics and security vulnerabilities: dataset and exploratory study. 2016 12th European Dependable Computing Conference (EDCC), [2] Anjum, M., Agarwal, V., Kapur, P., & Khatri, S. K. (2020). Two-phase methodology for prioritization and utility assessment of software vulnerabilities. International Journal of System Assurance Engineering and Management, 11(2), 289-300. [3] Anjum, M., Kapur, P., Agarwal, V., & Khatri, S. K. (2020). Evaluation and Selection of Software Vulnerabilities. International Journal of Reliability, Quality and Safety Engineering, 27(05), 2040014. [4] Bulut, F. G., Altunel, H., & Tosun, A. (2019). Predicting software vulnerabilities using topic modeling with issues. 2019 4th International Conference on Computer Science and Engineering (UBMK), [5] Dam, H. K., Tran, T., & Pham, T. (2016). A deep language model for software code. in workshop on Naturalness of Software (NL+SE), co- located with the 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), [6] Farris, K. A., Shah, A., Cybenko, G., Ganesan, R., & Jajodia, S. (2018). Vulcon: A system for vulnerability prioritization, mitigation, and management. ACM Transactions on Privacy and Security (TOPS), 21(4), 1-28. [7] Filus, K., Siavvas, M., Domańska, J., & Gelenbe, E. (2020). The random neural network as a bonding model for software vulnerability prediction. Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, [8] Ji, T., Wu, Y., Wang, C., Zhang, X., & Wang, Z. (2018). The coming era of alphahacking?: A survey of automatic software vulnerability detection, exploitation and patching techniques. 2018 IEEE third international conference on data science in cyberspace (DSC), [9] Kalouptsoglou, I., Siavvas, M., Tsoukalas, D., & Kehagias, D. (2020). Cross-project vulnerability prediction based on software metrics and deep learning. International Conference on Computational Science and Its Applications, [10] Kansal, Y., Kapur, P., & Kumar, D. (2016). Assessing optimal patch release time for vulnerable software systems. 2016 International Conference on Innovation and Challenges in Cyber Security (ICICCS-INBUSH), [11] Kansal, Y., Kumar, U., Kumar, D., & Kapur, P. K. (2018). Fixing of Faults and Vulnerabilities via Single Patch. In Quality, IT and Business Operations (pp. 175-190). Springer. [12] Kudjo, P. K., Chen, J., Mensah, S., Amankwah, R., & Kudjo, C. (2020). The effect of Bellwether analysis on software vulnerability severity prediction models. Software Quality Journal, 1-34. [13] Thanh Tung Khuat, My Hanh Le (2016). Optimizing Parameters of Software Effort Estimation Models using Directed Artificial Bee Colony Algorithm. Informatica 40 (2016) 427–436 [14] Kumar, M., & Sharma, A. (2017). An integrated framework for software vulnerability detection, analysis and mitigation: an autonomic system. Sādhanā, 42(9), 1481-1493. [15] Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., & Chen, Z. (2021). SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing. 158 Informatica 47 (2023) 145–158 P. Mehta et al. [16] Malhotra, R. (2021). Severity Prediction of Software Vulnerabilities Using Textual Data. Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. [17] Mounika, V., Yuan, X., & Bandaru, K. (2019). Analyzing CVE Database Using Unsupervised Topic Modelling. 2019 International Conference on Computational Science and Computational Intelligence (CSCI), [18] Narang, S., Kapur, P., Damodaran, D., & Majumdar, R. (2018). Prioritizing types of vulnerability on the basis of their severity in multi-version software systems using DEMATEL technique. 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), [19] Nguyen, V. H., Dashevskyi, S., & Massacci, F. (2016). An automatic method for assessing the versions affected by a vulnerability. Empirical Software Engineering, 21(6), 2268-2297. [20] Hrvoje Karna, Sven Gotovac and Linda Vicković (2020). Data Mining Approach to Effort Modeling on Agile Software Projects. Informatica 44 (2020) 231–239 [21] Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2), 217-235. [22] Abhishek Tandon, Neha & Anu G. Aggarwal (2020). Testing coverage-based reliability modelling for multi- release open-source software incorporating fault reduction factor. Cycle Reliab Saf Eng 9, 425–435 (2020). https://doi.org/10.1007/s41872-020-00148-7 [23] Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, [24] Roumani, Y., Nwankpa, J. K., & Roumani, Y. F. (2015). Time series modeling of vulnerabilities. Computers & Security, 51, 32-40. [25] Ouanes Aissaoui, Abdelkrim Amirat and Fadila Atil. (2014). A Model-Based Framework for Building Self- Adaptive Distributed Software. Informatica 38 (2014) 289–306. [26] PK Kapur, Anu G Aggarwal, Abhishek Tandon (2012). A unified approach for developing two-dimensional software reliability model. International Journal of Operational Research Vol. 13, No. 3, pp- 318-337. [27] Sharma, R., Sibal, R., & Sabharwal, S. (2019). Software vulnerability prioritization: A comparative study using TOPSIS and VIKOR techniques. In System performance and management analytics (pp. 405-418). Springer. [28] Stuckman, J., Walden, J., & Scandariato, R. (2016). The effect of dimensionality reduction on software vulnerability prediction models. . IEEE Transactions on Reliability, 66(1), 17-37. [29] Telang, R., & Wattal, S. (2007). An empirical analysis of the impact of software vulnerability announcements on firm stock price. IEEE Transactions on Software engineering, 33(8), 544-557. [30] Abhishek Tandon, Anu G Aggarwal, Nidhi Nijhawan (2016). An NHPP SRGM with change point and multiple releases. International Journal of Information Systems in the Service Sector (IJISSS), Vol 8 (4), pp: 56-68. [31] Theisen, C., & Williams, L. (2020). Better together: Comparing vulnerability prediction models. Information and Software Technology, 119, 106204. [32] Vanamala, M., Yuan, X., & Roy, K. (2020). Topic Modeling and Classification Of Common Vulnerabilities And Exposures Database. 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), [33] Vijayarani, S., Ilamathi, M. J., & Nithya, M. (2015). Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks, 5(1), 7-16. [34] PK Kapur, RB Garg, Udayan Chanda, Abhishek Tandon (2010). Development of software reliability growth model incorporating enhancement of features and related release policy. International Journal of Systems Assurance Engineering and Management. Vol (1), pp-52-58. [35] Walden, J., Stuckman, J., & Scandariato, R. (2014). Predicting vulnerable components: Software metrics vs text mining. 2014 IEEE 25th international symposium on software reliability engineering, [36] Wu, F., Wang, J., Liu, J., & Wang, W. (2017). Vulnerability detection with deep learning. 2017 3rd IEEE International Conference on Computer and Communications (ICCC), [37] Zerkane, S. (2018). Security Analysis and Access Control Enforcement through Software Defined Networks Brest]. [38] Roman Yu. Tsarev, Alexey S. Chernigovskiy, Elena N. Shtarik and Andrey V. Shtarik (2017). Modular Integrated Probabilistic Model of Software Reliability Estimation. Informatica 40 (2016) 125–132. [39] Kapur, P., Tandon, A., & Kaur, G. (2010). Multi up- gradation software reliability model. Paper presented at the 2010 2nd International Conference on Reliability, Safety and Hazard-Risk-Based Technologies and Physics-of- Failure Methods (ICRESH). [40] A.G. Aggarwal, N. Gandhi, V. Verma, A. Tandon [41] Multi-release software reliability growth assessment: an approach incorporating fault reduction factor and imperfect debugging Int. J. Math. Oper. Res., 15 (4) (2019), pp. 446-463. [42] A. Tandon, A.G. Aggarwal. Testing coverage-based reliability modelling for multi-release open-source software incorporating fault reduction factor Life Cycle Reliability and Safety Engineering, 9 (4) (2020), pp. 425-435