https://doi.org/10.31449/inf.47i9.3772 Informatica 47 (2023) 35-50 35 Retrieval of Interactive Requirements of Data Intensive Applications using Random Forest Classifier Renita Raymond 1 , S Margret Anouncia 2,* 1,2 School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India. E-mail: renita.r@vit.ac.in 1 , smargretanouncia@vit.ac.in 2,* * Corresponding Author Keywords: requirement classification, interaction-based requirements, data-intensive applications, random forest classifier Received: January 4, 2021 Classifying requirements in data-intensive systems based on their interactions can assist the requirements engineering process in becoming more systematic and transparent, resulting in higher requirement compliance and software project completion. However, understanding the requirements centred on interactions with the system is particularly tough due to the increased complexity of big data. In most cases, awareness of interaction-based requirements is critical in moving forward with prediction and decision-making. As a result, the classification of interactive requirements plays a critical role in removing the difficulties from unclear requirements. Various approaches to effective requirement classification are being devised. However, due to inadequate requirement management reflecting the fast- changing organizational change, classification accuracy does not achieve its maximum potential. The best approach for reducing misclassification rate and retrieving interactive requirements for data-intensive systems would be to use Word Embedding and Random Forest Classifier retrieval mechanism, as none of the studies to date have emphasized it. It also assessed the impact by comparing the results to metrics derived from the Random Forest classifier's training on word count characteristics. The data set used to experiment with the classification, particularly for interaction-based needs, is unique to our work and has not been covered by any other studies to date. The researchers will benefit from this study as they will better understand the requirement classification process. With an F1 score of 0.91, precision of 0.89, and recall of 0.93, statistical analysis showed that Word Embedding followed by Random Forest Classifier produced a relatively high classification result to differentiate interactive requirements for data-intensive systems. Povzetek: Z uporabo algoritma naključnih gozdov je izboljšana klasifikacija interaktivnih zahtev v podatkovno intenzivnih aplikacijah. 1 Introduction Generally, Big Data is described as a massive chunk of unstructured and structured data making it difficult to process using traditional methods [1]. Data-Intensive Applications (DIA) help business organizations to drive predictive and informative decisions by analyzing this massive chunk of data. Big data software requirements discover business values. Therefore, to elicit software requirements for the DIA, it is necessary to outline the implication of the projects at an earlier stage [2]. Requirement Engineering (RE) assist in collaborating with various stakeholders and business analyst expertise in analytical thinking to perceive and comply with the value and priority of each requirement. According to statistics, RE was the source of 60% of software development faults. As a result, soliciting relevant requirements minimizes the risk of software-intensive projects and consequently improves quality [3]. Requirements are also iterative, dynamic, interactive, and never complete [4]. As most of the requirements written are in natural language, developers, analysts, and software architects always find it difficult to classify the requirements as it is time-consuming and error-prone manually. These tasks require expertise, training, experience, and domain knowledge [5]. By utilizing Natural Language Processing (NLP), developers can organize and structure the requirements to perform feature extraction, classification, speech recognition, etc. Appropriate classification of requirements from Software Requirement Specification (SRS) improves the quality of software-intensive products [6]. Nevertheless, the requirements engineering process for traditional and big data business intelligence systems share many commonalities, and it also differs in many aspects. A very clear sympathetic is necessary to understand and classify the interactive requirements for end-user applications [7]. In DIA, interactive requirements must be processed separately and classified accurately to improve the quality of requirements and reduce budget over-run. Techniques for automatically classifying the elicited interactive-based requirements into different classes are required [8]. According to Manal et al. [9], Machine Learning (ML) approaches for classifying requirements in requirement documents have produced better results than traditional 36 Informatica 47 (2023) 35-50 S.M. Anouncia et al. natural language processing approaches. However, the systematic level of understanding is still lacking. Similarly, various approaches are used to classify functional and non -functional requirements [10]. However, there was no automated tool to support the analysis and management of interactive-based requirements, leading to various consequences like budget overrun, quality and security issues, and customer dissatisfaction in DIA. Furthermore, as the vast amount of data generated is increasing significantly on the internet, it is formidable for the developers to categorize and extract meaningful information especially textual requirements from the SRS, due to their complex semantic meaning. A supervised machine learning technique is used for the classification of requirements. Based on the acquired knowledge from training, it is possible to categorize analogous documents into various classes. Nevertheless, it is a more challenging task when designing DIA as the corpus to be classified increases to million petabytes every day on the internet. As mentioned earlier, word embedding and an improved random forest algorithm help catalogue the interactive requirements from the SRS. In the proposed framework, requirement feature extraction and requirement document classification are the two significant steps. In the first step, text features extracted are from the SRS documents using pre- processing. The extracted text features are represented as real-valued feature vectors in a predefined vector space using word embedding. In the next phase of classification, the converted feature vectors are categorized into four types, namely Input Requirement (IReq), Output Requirement (OReq), Transaction Requirement (TSReq), and Transformation Requirement (TFReq). IReq is the set of requirements from the environment required to produce a given level of outputs, OReq is the set requirements provisioned for the environment, TSReq is the set of requirements that are stable and filtered out. TFReq is the computation performed based on the requirement. Then, a query set is created with the help of keywords seen in the SRS. Non Metric Space Library (NMSLIB) creates indexing and retrieves the most similar documents according to the query set with a similarity score. Also, the performance is measured by training the corpus using the improved random forest classifier algorithm. The remainder of the paper is structured as follows. Section 2 and 3 consist of related work and motivation. Section 4 explains the design and implementation of our retrieval of IREq, OReq, TFReq, and TSReq using similarity search and some background work associated with it, and Section 5 presents results. Finally, Section 6 consists of some conclusions along with the future scope. 2 Related work RE is one of the essential aspects of research in the field of software engineering. Studies proclaim that failure to understand and classify requirements are the root cause for exceeding the allocated budget and time, leading to software system failure. They were manually classifying the requirements accordingly as FR and NFRs are difficult. Several researchers have stated that the requirements can be extracted and classified as FR and NFRs automatically from the natural language documents using various machine learning approaches and fuzzy techniques. Many techniques, especially for the classification of NFRs, have been devised and applied to various applications. Nevertheless, none of the methods addressed the classification of interaction-based requirements for data-intensive applications like banking, e-commerce, etc. This section outlines the various methods involved classification of requirements generally. A software system's success depends significantly upon adherence to non-functional requirements because when it is being missed or ignored, significant issues arise. To address this issue, Slank et al. [13] proposed a tool-based approach, namely the NFR locator. This tool classifies and extracts the sentences in natural language texts into their respective NFR categories. Though the NFR locator helps the analyst effectively extract NFRs in available natural language documents through automated NLP, it works well only with texts. It cannot process images and tables in the unconstrained document present. Similarly, security-related issues must be considered with caution for completing software that meets the customer's needs. Text mining techniques and prediction models have been used to classify the security requirements [14]. In 2017, Liang et al. [15, 16] combined feature extraction and machine learning algorithms to classify user review requirements automatically and concluded that AUR -BoW with Bagging provides the best classification results. Requirements can also be classified as FR and NFRs accurately using semi-supervised and unsupervised machine learning algorithms. A semi-Supervised classification technique can also be used to extract the FR and NFRs from the SRS automatically. Compared to supervised techniques, Semi- Supervised techniques provide better results because, in the latter one, only a minor amount of data needs to be labelled. In the former one, all the data set need to be labelled for classification. One such example is the app store, where the requirements present in the review from the app store are classified as functional and non- functional requirements using a self-labelling algorithm which is a part of the semi-supervised classification technique [17]. Semi-supervised classification methods help in classifying the requirements accordingly. Also, it will be enhanced with unsupervised learning techniques in the future. 2.1 Requirement pre-processing SRS consists of incredibly massive data of all sorts, and they are heterogeneous by nature with inconsistent values. Pre-processing is a very crucial task that must be completed before the data is used for model training. Authors of [47, 49] alleged the main pre-processing stages as tokenization, stop words removal, error correction, Retrieval of Interactive Requirements of Data Intensive Applications… Informatica 47 (2023) 35-50 37 normalization, and vectorization. Uysal et al. [48] evaluated the combination of pre-processing methods on two domains, namely e-mail and news, in two different languages. Results showed that choosing appropriate combinations of pre-processing tasks significantly improves classification accuracy depending on the domain and language studied. It is evident that pre- processing leads to better data sets that are clean and more manageable and must for any business organization to get meaningful insights. 2.2 Feature extraction An SRS, modelled after business requirement specification, consists of all the requirements categorized into four types: IReq, OReq, TSReq, and TFReq. It is represented in vectors after pre-processing so that the machine learning algorithms train the corpus and classify it accordingly. The feature extraction process extracts the text features from the SRS documents using NLP pre- processing techniques by converting text into feature vectors. Feature Extraction improves the accuracy of the learning algorithm as well as shortens the time. Selecting features from some effective ways like the vector space model reduces feature space dimensions [18]. Feature extraction algorithms like Term Frequency – Inverse Document Frequency (TF-IDF), Bag of Words (BoW), and Word2Vec calculate the weights of the words in the text by initiating a feature vector of the text using a predefined keyword set [19]. This section includes various feature extraction techniques used to extract the features and their limitations. One hot encoding is the first count-based embedding technique that converts the text into a vector by constructing a vocabulary. However, it cannot capture any contextual information due to its inefficient memory requirement [28]. BoW is one of the most common and effective features extraction techniques because of its simplicity and performance. In BoW, assuming words are independent of each other, texts are represented as a bag of words by recording the number of occurrences of each instance or word in a bag irrespective of their order or grammar. However, it leads to a high sparse and dimensional feature vector due to its non-zero dimensions and large vocabulary size [21][22]. Using Bow, all the features will have a value, and it gives equal weightage to all the features in the documents. Additionally, recurrently appearing features direct the model rather than the importance of the features in the document which TFIDF is solving. Qaisier et al. [23] say that TFIDF is calculated by multiplying both the term frequency and Inverse document frequencies. Terms with high TF-IDF weight are considered to be more important rather than terms with lesser TFIDF scores. However, TF-IDF is the most well-known and used formula to produce a vectors descriptor that developed to have several normalized forms it has certain limitations. TFIDF does not care about the position of a term in the text, its semantics and co-occurrences with other texts in the documents. In 2019, an extended form of Fuzzy based TF-IDF (FTF-IDF) is introduced to overcome the limitations of TF-IDF. FTF-IDF is a vector representation, where the components of the TF-IDF are presented as inputs to the Fuzzy Inference System (FIS). Weight terms are generated as crisp outputs after the defuzzification step. FTF-IDF provides semantic meanings to the words in the documents [24]. It does not look into the co-occurrences of other texts in the documents. Later on in the same year, Lakshmi et al. [25] proposed term weighting schemes to represent text documents using Term Frequency - Ranking of Term Frequency (TF- RTF) and Term Frequency - Ranking of fuzzy logic with the semantic relationship of terms (TF-RFST). It provides better clustering performance in terms of accuracy, recall, and F1 measure compared to word count, Term Frequency-Inverse Document Frequency (TF-IDF), Term Frequency-Inverse Corpus Frequency (TF- ICF), Multi- Aspect TF (MATF), BM25, and BM25F. Yet, it does not focus on the syntactic of the sentences in the documents. Also, Ricardo et al. [20] initiated YAKE depending only on statistical text features and not on a trained large corpus. It is adapted to different languages and scalable to documents of any length. However, it cannot tackle manually assigned keywords when not found in the text. Okapi BM25 is a ranking function used to estimate the relevance of documents to a given search query regardless of their proximity within the document. The authors of [27] made a comparative analysis using the Twitter data set and proved that TF-IDF is the best feature extraction technique compared to BM25 with an F1 measure of 89.77. BM25 is not suitable for large corpus. The authors of [26] state that the selection of the weighting technique is not essential because the weighting process is just a linear transformation of feature vectors. Therefore, researchers can use any one of the text feature extraction techniques or the combination of various techniques based on their project requirement, as every method has its pros and cons. 2.3 Requirement classification Requirements need to be defined, organized, and clearly understood by the stakeholders and the project members involved in developing the system. Classifying the requirements helps us define and organize the work because sometimes, compared to functional requirements, designing a system concerning non-functional requirements should be focused on a lot. It takes up large portions of the schedule and is filled with knotty problems. A part of requirement engineering, i.e., classification of requirements appropriately, is essential because it is the base for any software to be developed. Requirement classification done manually is a time- consuming task, and it is error-prone. Henceforth, an automatic classification of requirements must minimize rework and make the software easier to use and understand. This section consists of various classification 38 Informatica 47 (2023) 35-50 S.M. Anouncia et al. techniques suggested by the researchers to classify the requirements automatically. In 2019, Rahman et al. [30] extracted NFR from the SRS document using various machine learning techniques to meet customer expectations completely. Based on the statistical analysis, it is revealed that the SVM classifier achieves the best results with a precision of 0.66, recall of 0.61, and accuracy of 0.76. The experiments were conducted with the well-known PROMISE dataset, which has the characteristics of being unbalanced in FRs and NFRs. Lima et al. [31] expanded the PROMISE dataset, forming the PROMISE_exp repository. Again, Edna et al. [29] showed a comparative analysis of various machine learning algorithms like Support Vector Machine (SVM), KNN (K Nearest Neighbour), Decision Tree, Multinomial Naive Bayes (MNB), and Logistic Regression (LR) to determine which algorithm fits better to classify the requirements automatically using PROMISE_exp. The results reveal that the combination of TF-IDF and LR has the best performance measures with an F-measure of 91% on the binary classification, 74% in 11 granularity classification, and 78% on the 12- granularity classification. Before conducting any experimental analysis, researchers must verify whether the dataset being used is balanced or unbalanced. Studies have shown that an unbalanced dataset leads to poor automatic classification of requirements. Fuzzy Rough Set (FRS) is a powerful mathematical tool to deal with uncertain data. So, Behera et al. [33] proposed a Fuzzy Rough Set based on Robust Nearest Neighbor (FRS-RNN) to document classification. A modified CNN is used to extract the features from the documents, and later on, using FRS-RNN, documents are classified. It outperforms all the classification models like SVM, Naive Bayes, DNN, and CNN. However, the hyperparameter tuning of FRS-RNN consumes more time than conventional machine learning algorithms. An NFR sentence can be classified into more than one class. In 2019, Fuzzy Similarity KNN (FSKNN) was suggested for multi-label classification of requirements based on ISO/lEe 25010. In this paper, the fuzzy similarity measure approach is used to calculate the similarity between the terms, documents and a training pattern is obtained. The search set obtained from the training data is used to find the K nearest neighbor. A test document will be labelled into a specific category using a maximum a posteriori (MAP) estimate [35]. Similarly, to classify the FR and NFR contained in the reviews within the APP store, a semi-supervised classification technique was used. The self-labelling algorithm appropriately assigns labels to the collected unlabelled data and also classifies unseen future reviews. However, the results are not empirically evaluated [36]. Semantic information plays a significant role in the area of RE. Software developers use effective requirement classification techniques to produce semantic-based SRS of higher quality. A Requirement Classification Ontology (RCO) is initiated for sharing and describing the different classifications of requirements. It is used as a tool to confirm the RE process's semantic correctness, thereby ensuring consistency between the requirements [38]. Various studies [32][34][37] reveal that machine learning techniques play a significant role in classifying the requirements as FRs, NFRs, quality requirements, security requirements, legal requirements, etc., compared to fuzzy rule mechanisms. However, from the related work, it is evident that no research has been carried out to address the challenges faced in extracting the interaction-based requirements nor sets the standards for categorizing the requirements based on their interactions for designing DIAs. 3 Motivation It is inferred that categorizing the requirements according to their type of interactions will create transparency in the RE process, thereby promoting requirement fulfilment and completing software-intensive projects based on the study carried out. Considering the usefulness of the technology in software requirement classification, a new framework is designed to classify the interactive-based requirements. Limiting the requirements to interactions, in particular, can focus on what the DIA developers care about while allowing the engineers to bring all their knowledge and creativity to bear on the means for achieving it. Distinguishing interactive requirements from other requirements is very important because there are usually much more difficult challenges to design and test DIAs. Manuel et al. [46] conducted a survey in 2020, which reveals that the most recurrent classification algorithms featured on the identified studies are Naive Bayes, K Nearest Neighbor, J48, and Natural Language Processing algorithms. Also, the most used training datasets are academic databases and collected user reviews. Finally, it was concluded that most of the studies focus on classifying FRs and NFRs. None of the studies revealed the interest in classifying interactive requirements, especially for software-intensive projects. 4 Proposed methodology Given the extraction of interactive requirements as a prime focus, the framework is designed with the following phases, ➢ Requirement Elicitation ➢ Requirements (Text) Pre-Processing ➢ Features Extraction ➢ Requirement Discovery ➢ Requirement Classification Retrieval of Interactive Requirements of Data Intensive Applications… Informatica 47 (2023) 35-50 39 Figure 1: Framework for extracting interactive requirements 4.1 Requirement elicitation In the RE phase, requirement elicitation discovers the requirements for developing software-intensive projects from the users, customers, and other stakeholders. The requirements of DIAs should be discovered in the initial stage of the software life cycle itself. Conventional RE processes are incapable of fulfilling the needs of the organization mainly for two reasons. Firstly, it focuses primarily on generic user requirements, and it does not provide any meaningful insights about the features generated from big data's leading to a better business intelligence solution. Secondly, the vast amount of data generated daily by various systems leads to increased demand for consumption at various levels. Therefore, in the process of requirement elicitation in DIAs, even business analysts are also involved in the discussion to provide business intelligence solutions to the organizations. In the first phase of the framework, a form has been designed to gather the requirements from various stakeholders. The dataset created for this paper is based on the banking application. The stakeholders of the banking domain are customers, bankers, investors, regulators, RBI, etc. Requirements are gathered from the stakeholders and documented initially. The stakeholder form created for gathering the requirements is shown in figure 2. The form includes various details of a requirement like the name of the stakeholder, the role of the stakeholder (i.e., customer, staff, BoD, Investors, Regulators, etc.), purpose, data required for the particular requirement, the status of the stakeholder, either primary or secondary stakeholder, mode of interaction when entering the requirement, locality and the description of the requirement. Figure 2: Stakeholder form 40 Informatica 47 (2023) 35-50 S.M. Anouncia et al. Figure 3: Stakeholder data Requirement Analyst analyses the difference between what the customers need, validates, and documents the need of the project stakeholders. During the analysis phase, the analyst identifies the gathered requirements type documented using stakeholder form as either stable or volatile requirement concerning their priority and feasibility. Requirement types can be divided into two type’s stable and volatile requirements [39]. Stable Requirement – otherwise called enduring requirements are the requirements derived from the organization's core activity and directly related to the system's domain. Here, in the banking domain, requirements concerned with customers, bankers who do not change on time are considered. For example, 'The system shall have provision for the customers to deposit amount in the account', 'The system shall have provision for the staff to get the customer details when opening an account.’ Volatile Requirements – requirements that are likely to change after the system becomes operational are considered volatile requirements. Requirements related to policies framed by the Board of Directors, Investors, RBI are included in it. Such type of requirements falls into four categories as follows. Mutable – change in requirements concerning changes triggered in the organization's environment is included in it. E.g., 'The system shall have provision for the staffs to initiate the customers to set transaction limit for the transactions’. Emergent – requirements that emerge when the system is being developed and implemented are included in it. For example, 'The system shall have the staff's provision to collect the debt loan from the customers when it is not being repaid after giving prior notice'. Consequential – requirements that result from the introduction of the computer system are known as consequential. For example, 'The system shall have provision for the staff to link the customers' account details with aadhar card'. Compatibility – requirements that depend on other equipment or processes are included in it. E.g., 'The bank will have many ATMs, and the new software shall provide all the ATMs’ functionality’. Figure 4: Requirement types A separate keyword list is created and catalogued, as shown in Table 1. The Interaction Type column represents the four interaction types as Input, Output, Transformation, and Transaction. Various keywords related to the interaction types are listed in the Keywords column. Keywords present in the description column of the stakeholder data depicted in Figure 3 are matched with the Interactive Requirement Keyword Catalogue. The requirements are classified as Input, Output, Transaction, and Transformation automatically concerning their requirement type, priority, and feasibility. Any specific requirements needed for the corresponding requirements are also recorded and finally documented, as depicted in figure 6. Retrieval of Interactive Requirements of Data Intensive Applications… Informatica 47 (2023) 35-50 41 Table 1: Interactive requirement keyword catalogue S. No Interaction Type Keywords Total No of Keywo rds 1 Input (IReq) Get, Login, set transaction limit, check, Request, Raise, Write, Complete, set, enter, receive, open, verify, ensure, submit, evaluate, select, monitor, maintain, maintain Debt 20 2 Output (OReq) view, display, print, provide, canvassing, conduct, sanction, respond, issue, appoint, take, review, limit, observe, obtain 15 3 Transformati on (TFReq) Deposit, invest, pay, recharge, withdraw, transfer, add, accept, update, exchange, set policy, set priorities, link account 13 4 Transaction (TSReq) Calculate EMI, Packaging and rolling, quarterly, Filter, year, lock, authorization, evaluate 8 Figure 5: Distribution of interactive requirement type keywords catalogue Table 1 and Figure 5 show the distribution of keywords concerning their interaction types. Out of 56 keywords, Input consists of 20, the output consists of 15, Transformation consists of 13, and transaction consists of 8 keywords. Figure 7: Distribution of requirements per category The corpus created consists of 2812 requirement instances finally after the approval of the requirement analyst. The distribution of the requirement instances is shown in figure 7. IReq consists of 747 instances, OReq consists of 860 instances, TFReq consists of 647 instances, and TSReq consists of 558 instances. 4.2 Requirement pre-processing Requirement Pre-Processing is the second stage of the classification process. It directly improves the model's performance by removing the noise or unclear data extracted from different sources. Series of steps are followed to standardize textual data into a form that would be taken up as an input to analytics systems and applications. To categorize the requirement documents, there are various pre-processing techniques like stop words removal, tokenization, stemming, lemmatization, etc. Text from the SRS is broken into meaningful tokens. After converting into meaningful tokens, predefined stop words are removed. Occasionally, even the stop words can be user-defined based on their respective applications. Removing such words from the corpus reduces the dimensionality of the term space, thereby increasing the model's performance. Later on, stemming is done to identify the root of a token in the corpus. This process removes the various suffixes, reducing the corpus tokens even more to save time and memory space. Finally, lemmatization considers the morphological analysis of the tokens or words, thereby decreasing the noise and speeding up the user's task [40, 41]. 42 Informatica 47 (2023) 35-50 S.M. Anouncia et al. Table 2: Corpus before pre-processing RID Description Interaction Type 1 The system shall have provision for the users to login with authentication Input 2 The system shall have provision to accept the deposit money of the customers Transformation 3 The system shall have provision to request customers to maintain sufficient balance Input 4 The system shall have provision to open an account for the customers Transformation 5 The system shall have provision to submit customers KYC forms Input 6 The system shall have provision to submit income statement of the customers Input 7 The system shall have provision to set transaction limit for the transactions by the customers Input 8 The system shall have provision for the customers to invest shares Transformation 9 The system shall have provision for the users to pay automated bill payments Transformation 10 The system shall have provision for the users to pay taxes Transformation 11 The system shall have provision for the users to recharge the data card Transformation 12 The system shall have provision for the customers to pay for travel through UPI Transformation 13 The system shall have provision for the users to pay due (loan) Transformation 14 The system shall have provision for the users to pay service charges Transformation 15 The system shall have provision to for the users to set the ATM, Mobile Pin, Net Banking transaction pin Input 16 The system shall have provision for the customers to calculate EMI for loan Transaction 17 The system shall have provision for the customers to check the account balance of their account Input RID Description Interaction Type 18 The system shall have provision for the users to withdraw the amount from their account Transformation 19 The system shall have provision for the customers to view their weekly, monthly transaction details Output 20 The system shall have provision for the customers to submit their personal details Input All requirements in the corpus have gone through a pre- processing step. Table 2 shows the requirements in the corpus before the pre-processing steps. In this paper, Spacy, a free, open-source library for NLP is being used to process and understand large volume of text. It performs the pre-processing steps and provides the fastest and more accurate syntactic analysis of any NLP released to date [42]. For example, Table 2 RID 1: "The system shall have provision for the users to login with authentication" has been changed to "['user', 'login', 'authentication']" as shown in RID 1 of Table 3. Table 3: Corpus after Text pre-processing RI D Description Interaction Type Tokens 1 The system shall have provision for the users to login with authentication Input ['user', 'login', 'authenticatio n'] 2 The system shall have provision to accept the deposit money of the customers Transformat ion ['accept', 'deposit', 'money', 'customer'] 3 The system shall have provision to request customers to maintain sufficient balance Input ['request', 'customer', 'maintain', 'sufficient', 'balance'] 4 The system shall have provision to open an account for the customers Transformat ion ['open', 'account', 'customer'] 5 The system shall have provision to submit customers KYC forms Input ['submit', 'customer', 'kyc', 'form'] 6 The system shall have provision to submit Input ['submit', 'income', Retrieval of Interactive Requirements of Data Intensive Applications… Informatica 47 (2023) 35-50 43 RI D Description Interaction Type Tokens income statement of the customers 'statement', 'customer'] 7 The system shall have provision to set transaction limit for the transactions by the customers Input ['set', 'transaction', 'limit', 'transaction', 'customer'] 8 The system shall have provision for the customers to invest shares Transformat ion ['customer', 'invest', 'share'] 9 The system shall have provision for the users to pay automated bill payments Transformat ion ['user', 'pay', 'automated', 'bill', 'payment'] 10 The system shall have provision for the users to pay taxes Transformat ion ['user', 'pay', 'tax'] 11 The system shall have provision for the users to recharge the data card Transformat ion ['user', 'recharge', 'data', 'card'] 12 The system shall have provision for the customers to pay for travel through UPI Transformat ion ['customer', 'pay', 'travel', 'upi'] 13 The system shall have provision for the users to pay due (loan) Transformat ion ['user', 'pay', 'due', 'loan'] 14 The system shall have provision for the users to pay service charges Transformat ion ['user', 'pay', 'service', 'charge'] 15 The system shall have provision to for the users to set the ATM, Mobile Pin, Net Banking transaction pin Input ['user', 'set', 'atm', 'mobile', 'pin', 'net', 'banking', 'transaction', 'pin'] 16 The system shall have provision for the customers to calculate EMI for loan Transaction ['customer', 'calculate', 'emi', 'loan'] 17 The system shall have provision Input ['customer', 'check', RI D Description Interaction Type Tokens for the customers to check the account balance of their account 'account', 'balance', 'account'] 18 The system shall have provision for the users to withdraw the amount from their account Transformat ion ['user', 'withdraw', 'amount', 'account'] 19 The system shall have provision for the customers to view their weekly, monthly transaction details Output ['customer', 'view', 'weekly', 'monthly', 'transaction', 'detail'] 20 The system shall have provision for the customers to submit their personal details Input ['customer', 'submit', 'personal', 'detail'] The above table shows the corpus after wrangling, cleaning up, and standardizing the textual requirements into a form (i.e., tokens) taken up as an input for the feature extraction process. 4.3 Feature extraction In this stage, the pre-processed corpus is converted into numerical features representing the information contained in the requirements usable for machine learning. As the actual text is highly dimensional and unstructured, every unique word or token is seen as a separate dimension, making it challenging to apply classification algorithms. Word2Vec [43], developed by Tomas et al., takes as its input a large corpus of tokens obtained from the normalization process producing a vector space for unique tokens. Words in the vector space that share familiar contexts in the corpus are located close to one another in the space. The word vectors obtained for the corpus is shown in figure 8. 44 Informatica 47 (2023) 35-50 S.M. Anouncia et al. Figure 8: Sample word vectors created using Spacy toolkit. In the above figure, Spacy [42] parses entire blocks of text and seamlessly assigns word vectors from the loaded models. Word2vec improves the quality of features by considering contextual semantics of words in a text, hence improving machine learning and requirement classification accuracy. 4.4 Requirement discovery Requirement discovery is the process of identifying the interactive requirements IReq, OReq, TFReq, and TSReq needed to design software-intensive projects respectively based on the query set created. It is the understanding of how such interactive requirements are formed internally and externally. Query set created consisting of keywords as shown in Figure 9 should be meaningful to the humans, and it should provide enough diverse results in retrieving the documents. These keywords generalize the features of the corresponding requirements, and many diverse compositions can be found by retrieving them. Figure 9: Word Cloud of the query set (keywords) Features extracted from the Word2Vec are also passed as an input to the requirement discovery phase. A similarity measure is a metric used to measure the similarity between the features present in the corpus, irrespective of their sizes. This paper considers metric spaces and non- metric spaces because the non-metric similarity provides robustness, locality, and comfort in modelling. A non- metric is a function that does not satisfy some or all the properties of a metric. It includes context-dependent similarity functions and dynamic similarity functions as well. The non-Metric Space Library (NSMLIB) [44] is an efficient and extendable cross-platform similarity search library and a toolkit to evaluate similarity search methods. It is a library for fast similarity K Nearest Neighbour (k- NN) search. In this phase of extraction of interactive requirements based on the keywords present in the query set, NMSLIB is used as it is the first tool to support non- metric space searching. The principal concern is to provide a solution to a query by retrieving a subset of requirements from the corpus sufficiently similar to the query q. Figure 10: Top 10 similar requirements retrieved for the query 'set transaction limit.' Retrieval of Interactive Requirements of Data Intensive Applications… Informatica 47 (2023) 35-50 45 Figure 11: Retrieval of transformation requirements based on the query set For example, the above figure shows that the top (k =10) nearest neighbours with a similarity score is displayed for the query' Set Transaction Limit'. NMSLIB uses the k- Nearest Neighbors (k-NN) algorithm for performing similarity search as it is prevalent, and the elements in the corpus are represented as vectors. With the help of NMSLIB, k-NN enables high scale, low latency nearest neighbor search on billions of documents across thousands of dimensions with the same ease. The above figure illustrates the retrieval of interactive requirements, especially transformation requirements based on the query with a similarity score. A high degree of similarity score implies a high probability of retrieving the documents concerning the query accurately. Therefore, requirements concerning their interactions are retrieved accurately and efficiently with the help of a fast similarity search (k-NN) NMSLIB. Table 4: Sample of requirements retrieved for specific queries S. No Query Keyword Total Requirements Retrieved 1 Get 15 2 Deposit 25 3 Update 18 4 Check 30 5 Display 8 46 Informatica 47 (2023) 35-50 S.M. Anouncia et al. Figure 12: Sample of requirements retrieved concerning specific keywords Table 4 and figure 12 depicts the sample of requirements retrieved with respect to specific keywords. 4.5 Requirement classification Word embedding produced using Word2Vec is used to train the machine learning and classification algorithm to improve the interactive requirement classification accuracy based on the context and semantic relationship between words. Our approach uses a Random Forest as the training set fed into the algorithm entails labels. It contains several decision trees on various subsets of the given dataset and takes the majority votes to improve the predictive accuracy of that dataset [45]. The experimental analysis in section V reveals that it requires less training time than other algorithms and produces a high accuracy output even for the large data sets efficiently. Most industries consider the usage of Random Forest as it combines multiple classifiers to solve a complex problem, thereby maintaining accuracy even when the dataset is imbalanced. The corpus is split into a training dataset and testing data set in the ratio of 70:30. 70 % of the dataset goes into the training set, and the remaining 30% goes into the testing dataset. After splitting, the training set is trained using the RF model, and predictions are performed on the testing set. First, N random records with features from the training set are chosen, and secondly, a decision tree is build based on the N records. The parameter n_estimators decide the number of trees the RF needs, and the steps are repeated. A Decision Tree (DT) has low bias and high variance, prone to many errors when new test data arrives. Therefore, RF uses multiple DTs and row sampling and feature sampling concerning majority votes in the DTs. This way, high variance gets converted into low variance because using row sampling and feature sampling records to DT gets well trained concerning specific records. Evaluation metrics like precision, recall, and F1 measures are used to evaluate the classifier's performance. 4.6 Proposed algorithm for extraction of interactive requirements The flow of the proposed methodology is as follows. Algorithm: Extraction of Interactive Requirements Input: let f represent the stakeholder form, SRS be the Software Requirement Specification, i be the i th requirement in SRS Output: let IReq, OReq, TFReq, and TSReq represent the Input Requirement, Output Requirement, Transformation Requirement, Transaction Requirement, respectively. Data: Testing set (x) Begin Generate a stakeholder form f foreach f in the sequence do Get the requirements r i from s € S where S= {Primary Stakeholder, Secondary Stakeholder} Requirement Analyst Form  Save r i RID  Assign r i // RID stands for Requirement ID if r i is feasible and approved add r i to SRS else revert back to stakeholders endif endfor Function Preprocessing (SRS, Feature Vectors) Parse all the input requirements r i where i = 1,2,3…..n foreach requirement r i do Tokenize  r i Store the Tokens as array Create a customized stopword list foreach T from r i compare T and customized stopword list if T = customized stopword list remove T from r i else store the Tokens Remove suffixes from the tokens S i  Store tokens endfor endfor Function FeatureExtraction (S i, SimS) Let S i be the tokens in corpus word2vec model() Set the parameters size =300, window = 2, min_count = 20, negative = 20, alpha = 0.03 foreach S i in the corpus do Build the vocabulary table Train the model Find the similarity score (SimS) for s i Return SimS for the vectors S i in the corpus endfor Function Query Processing (QS, ExD) Let QS i,be the Query Set where i = 1, 2…n, RD represent the requirement documents from SRS, ExD represent the extracted requirement documents Create Query_Set (QS) if QS i = S i in Corpus Retrieve the documents (RD i) with SimS else Return no match Assign ExD  RD i Retrieval of Interactive Requirements of Data Intensive Applications… Informatica 47 (2023) 35-50 47 Function Classification To generate k classifiers Split the ExD in the ratio of 80:20 as 80 % training data and 20% testing data foreach i= 1 to k do Sample the training data ExD ExD i  ExD Create a root node RN i containing ExD i BuildTree (RN i) endfor BuildTree (RN) if RN consists of only one instance, then Return else Select the features F randomly in RN Select F with the highest information gain to split on Create f child nodes of RN, for i=1 to f do Set RN i to D i, where D i € RN D i = f i BuildTree(RN i) endfor elseif end 5 Experimental results The experiments have been carried out on intel core i5, 32GB RAM, and Windows 10. Pandas, NumPy, nltk, sklearn, matplotlib packages, spacy, NMSLIB were used for loading the data pre-processing and results in the evaluation. The most popular PROMISE and PROMISE_exp software requirement datasets are not suitable for our research. It is small in size, consisting of only 625 requirement instances, and the class distribution is also imbalanced. A novel dataset has been created regarding banking applications comprising 2812 requirement instances focusing on IReq, OReq, TFReq, and TSReq categories. The sample of requirement instances is illustrated in Figure 3 and 6 correspondingly. The prepared dataset is pre-processed, features extracted, requirements discovered, and classified using a python programming language. Spacy, a free, open-source library for NLP and NMSLIB, an efficient similarity search library, and a toolkit for evaluating search methods, which is the first principled support for non-metric space searching, is a significant part of programming. The performance of the Random Forest algorithm is compared with other supervised machine learning algorithms like Naïve Bayes, Support Vector Machine, Logistic Regression, KNN, etc. The evaluation metrics like Precision, Recall and F1 scores of 0.89, 0.93, and 0.91 respectively proves that RF is the best classification algorithm. Evaluation metrics are primarily used to evaluate the performance of a classifier by comparing the predictions obtained by a model with the actual values in the corpus. The essential components for the metrics are True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). According to Hitesh et al. [45], Precision = TP/TP+FP Recall = TP/TP+FN F1 Score = 2(Recall Precision) / (Recall + Precision) Figure 13: Comparison of various algorithms with respect to metrics Figure 13 shows the comparison of various supervised algorithms, out of which Random Forest records a higher value of precision with 0.89, recall of 0.93 and f1 measure of 0.91. Table 5 shows the performance results of each retrieved interactive type requirement. Table 5: Results of random forest classification using word2vec Random Forest Classification using Word2Vec Requirement Type Precision Recall F1 measure IReq 0.91 0.96 0.95 OReq 0.9 0.94 0.92 TFReq 0.89 0.91 0.9 TSReq 0.86 0.9 0.88 6 Conclusion Based on the research results, it can be concluded that the appropriate identification of interactive requirements is vital for the successful development of software-intensive projects. The paper's novelty is the retrieval of interactive requirements, especially for DIAs. The retrieval of pertinent data will provide meaningful insights into business intelligence problems. Vectorizing the requirements documents with word embedding’s using spacy is done to explore the documents with semantic features. As a result, it retrieved the interactive requirements separately as IReq, OReq, TFReq, and TSReq using a fast similarity (k -NN) search and NMSLIB. Also, it measured the impact of the extracted documents by comparing the performance with metrics 48 Informatica 47 (2023) 35-50 S.M. Anouncia et al. acquired from training the Random Forest classifier on word count features. The result of precision, recall, and F1 are 0.89, 0.93, and 0.91, respectively. Therefore, retrieval of interactive requirements like IReq, OReq, TFReq, and TSReq help the developers to document their projects more effectively by minimizing the rework. However, studies have shown that in an unbalanced data set, automatic classification performs worse when the size of requirements of some labels is smaller. As future work, we plan to increase the requirements dataset and look for ways to mitigate the unbalance of the base, being able to improve the classification with little training data. References [1] P. Wang, K. Tao, C. Gao, X. Ning, S. Gu, and B. Deng, “Eliciting big data requirement from big data itself: A task-directed approach,” 2017 6th International Workshop on Software Mining (SoftwareMining), Nov. 2017. [Online]. Available: 10.1109/softwaremining.2017.8100849. [2] C. Palomares, C. Quer, and X. Franch, “Requirements reuse and requirement patterns: a state of the practice survey,” Empirical Software Engineering, vol. 22, no. 6, pp. 2719–2762, Dec. 2016. [Online]. Available: 10.1007/s10664-016- 9485-x. [3] W. N. Robinson, S. D. Pawlowski, and V. Volkov, “Requirements interaction management,” ACM Computing Surveys, vol. 35, no. 2, pp. 132–190, Jun. 2003. [Online]. Available: 10.1145/857076.857079. [4] H. Meth, M. Brhel, and A. Maedche, “The state of the art in automated requirements elicitation,” Information and Software Technology, vol. 55, no. 10, pp. 1695–1709, Oct. 2013. [Online]. Available: 10.1016/j.infsof.2013.03.008. [5] Pohl K. Requirement’s engineering fundamentals: a study guide for the certified professional for requirements engineering exam-foundation level- IREB compliant. Rocky Nook, Inc.; 2016 Apr 30. [6] C. Li, L. Huang, J. Ge, B. Luo, and V. Ng, “Automatically classifying user requests in crowdsourcing requirements engineering,” Journal of Systems and Software, vol. 138, pp. 108–123, Apr. 2018. [Online]. Available: 10.1016/j.jss.2017.12.028. [7] N. H. Madhavji, A. Miranskyy, and K. Kontogiannis, “Big Picture of Big Data Software Engineering: With Example Research Challenges,” 2015 IEEE/ACM 1st International Workshop on Big Data Software Engineering, May 2015. [Online]. Available: 10.1109/bigdse.2015.10. [8] E. Sodagari and M. Keyvanpour, “Challenges Classification of Software Requirements Interaction Management Using Search-Based Methods,” 2019 5th International Conference on Web Research (ICWR), Apr. 2019. [Online]. Available: 10.1109/icwr.2019.8765253. [9] M. Binkhonain and L. Zhao, “A review of machine learning algorithms for identification and classification of non-functional requirements,” Expert Systems with Applications: X, vol. 1, p. 100001, Apr. 2019. [Online]. Available: 10.1016/j.eswax.2019.100001. [10] R. R. R. Merugu and S. R. Chinnam, “Automated cloud service based quality requirement classification for software requirement specification,” Evolutionary Intelligence, vol. 14, no. 2, pp. 389–394, May 2019. [Online]. Available: 10.1007/s12065-019-00241-6. [11] C. SenthilMurugan and S. Prakasam, “A Literal Review of Software Quality Assurance,” International Journal of Computer Applications, vol. 78, no. 8, pp. 25–30, Sep. 2013. [Online]. Available: 10.5120/13511-1279. [12] W. A. Qader, M. M. Ameen, and B. I. Ahmed, “An Overview of Bag of Words; Importance, Implementation, Applications, and Challenges,” 2019 International Engineering Conference (IEC), Jun. 2019. [Online]. Available: 10.1109/iec47844.2019.8950616. [13] J. Slankas and L. Williams, “Automated extraction of non-functional requirements in available documentation,” 2013 1st International Workshop on Natural Language Analysis in Software Engineering (NaturaLiSE), May 2013. [Online]. Available: 10.1109/naturalise.2013.6611715. [14] R. Jindal, R. Malhotra, and A. Jain, “Automated classification of security requirements,” 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Sep. 2016. [Online]. Available: 10.1109/icacci.2016.7732349. [15] M. Lu and P. Liang, “Automatic Classification of Non-Functional Requirements from Augmented App User Reviews,” Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, Jun. 2017. [Online]. Available: 10.1145/3084226.3084241. [16] Z. Kurtanovic and W. Maalej, “Automatically Classifying Functional and Non-functional Requirements Using Supervised Machine Learning,” 2017 IEEE 25th International Requirements Engineering Conference (RE), Sep. 2017. [Online]. Available: 10.1109/re.2017.82. [17] R. Deocadez, R. Harrison, and D. Rodriguez, “Automatically Classifying Requirements from App Stores: A Preliminary Study,” 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), Sep. 2017. [Online]. Available: 10.1109/rew.2017.58. [18] H. Liang, X. Sun, Y. Sun, and Y. Gao, “Text feature extraction based on deep learning: a review,” EURASIP Journal on Wireless Communications and Networking, vol. 2017, no. 1, Dec. 2017. [Online]. Available: 10.1186/s13638-017-0993-1. [19] R. Dzisevic and D. Sesok, “Text Classification using Different Feature Extraction Approaches,” 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2019. [Online]. Available: 10.1109/estream.2019.8732167. Retrieval of Interactive Requirements of Data Intensive Applications… Informatica 47 (2023) 35-50 49 [20] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt, “YAKE! Keyword extraction from single documents using multiple local features,” Information Sciences, vol. 509, pp. 257–289, Jan. 2020. [Online]. Available: 10.1016/j.ins.2019.09.013. [21] W. A. Qader, M. M. Ameen, and B. I. Ahmed, “An Overview of Bag of Words; Importance, Implementation, Applications, and Challenges,” 2019 International Engineering Conference (IEC), Jun. 2019. [Online]. Available: 10.1109/iec47844.2019.8950616. [22] M. Lu and P. Liang, “Automatic Classification of Non-Functional Requirements from Augmented App User Reviews,” Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, Jun. 2017. [Online]. Available: 10.1145/3084226.3084241. [23] S. Qaiser and R. Ali, “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents,” International Journal of Computer Applications, vol. 181, no. 1, pp. 25–29, Jul. 2018. [Online]. Available: 10.5120/ijca2018917395. [24] M. Bounabi, K. El Moutaouakil, and K. Satori, “Text classification using Fuzzy TF-IDF and Machine Learning Models,” Proceedings of the 4th International Conference on Big Data and Internet of Things, Oct. 2019. [Online]. Available: 10.1145/3372938.3372956. [25] R. Lakshmi and S. Baskar, “Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms,” Expert Systems with Applications, vol. 137, pp. 493–503, Dec. 2019. [Online]. Available: 10.1016/j.eswa.2019.07.022. [26] T. Walkowiak, S. Datko, and H. Maciejewski, “Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study,” Advances in Intelligent Systems and Computing, pp. 526–535, May 2018. [Online]. Available: 10.1007/978-3- 319-91446-6_49. [27] A. I. Kadhim, “Term Weighting for Feature Extraction on Twitter: A Comparison Between BM25 and TF-IDF,” 2019 International Conference on Advanced Science and Engineering (ICOASE), Apr. 2019. [Online]. Available: 10.1109/icoase.2019.8723825. [28] K. S. Kalaivani, S. Uma, and C. S. Kanimozhiselvi, “A Review on Feature Extraction Techniques for Sentiment Classification,” 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), Mar. 2020. [Online]. Available: 10.1109/iccmc48092.2020.iccmc-000126. [29] E. Dias Canedo and B. Cordeiro Mendes, “Software Requirements Classification Using Machine Learning Algorithms,” Entropy, vol. 22, no. 9, p. 1057, Sep. 2020. [Online]. Available: 10.3390/e22091057. [30] Md. A. Haque, Md. Abdur Rahman, and M. S. Siddik, “Non-Functional Requirements Classification with Feature Extraction and Machine Learning: An Empirical Study,” 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), May 2019. [Online]. Available: 10.1109/icasert.2019.8934499. [31] M. Lima, V. Valle, E. Costa, F. Lira, and B. Gadelha, “Software Engineering Repositories,” Proceedings of the XXXIII Brazilian Symposium on Software Engineering, Sep. 2019. [Online]. Available: 10.1145/3350768.3350776. [32] R. Deocadez, R. Harrison, and D. Rodriguez, “Automatically Classifying Requirements from App Stores: A Preliminary Study,” 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), Sep. 2017. [Online]. Available: 10.1109/rew.2017.58. [33] B. Behera and G. Kumaravelan, “Text document classification using fuzzy rough set based on robust nearest neighbor (FRS-RNN),” Soft Computing, vol. 25, no. 15, pp. 9915–9923, Nov. 2020. [Online]. Available: 10.1007/s00500-020- 05410-9. [34] A. Sainani, P. R. Anish, V. Joshi, and S. Ghaisas, “Extracting and Classifying Requirements from Software Engineering Contracts,” 2020 IEEE 28th International Requirements Engineering Conference (RE), Aug. 2020. [Online]. Available: 10.1109/re48521.2020.00026. [35] I. M. S. Raharja and D. O. Siahaan, “Classification of Non-Functional Requirements Using Fuzzy Similarity KNN Based on ISO / IEC 25010,” 2019 12th International Conference on Information & Communication Technology and System (ICTS), Jul. 2019. [Online]. Available: 10.1109/icts.2019.8850944. [36] R. Deocadez, R. Harrison, and D. Rodriguez, “Automatically Classifying Requirements from App Stores: A Preliminary Study,” 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), Sep. 2017. [Online]. Available: 10.1109/rew.2017.58. [37] R. Jindal, R. Malhotra, and A. Jain, “Automated classification of security requirements,” 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Sep. 2016. [Online]. Available: 10.1109/icacci.2016.7732349. [38] H. Alrumaih, A. Mirza, and H. Alsalamah, “Domain Ontology for Requirements Classification in Requirements Engineering Context,” IEEE Access, vol. 8, pp. 89899–89908, 2020. [Online]. Available: 10.1109/access.2020.2993838. [39] S. L. Lim and A. Finkelstein, “Anticipating Change in Requirements Engineering,” Relating Software Requirements and Architectures, pp. 17– 34, 2011. [Online]. Available: 10.1007/978-3-642- 21001-3_3. 50 Informatica 47 (2023) 35-50 S.M. Anouncia et al. [40] D. Virmani and S. Taneja, “A Text Preprocessing Approach for Efficacious Information Retrieval,” Advances in Intelligent Systems and Computing, pp. 13–22, Jun. 2018. [Online]. Available: 10.1007/978-981-10-8968-8_2. [41] D. Sarkar, “Text Analytics with Python,” 2016. [Online]. Available: 10.1007/978-1-4842-2388-8. [42] D. Sarkar, “Natural Language Processing Basics,” Text Analytics with Python, pp. 1–68, 2019. [Online]. Available: 10.1007/978-1-4842-4354- 1_1. [43] M. Bokan, “Negative-Sampling Word-Embedding Method,” Scientific Journal of Astana IT University, vol. 10, pp. 15–21, Jun. 2022. [Online]. Available: 10.37943/elgd6408. [44] L. Boytsov and B. Naidan, “Engineering Efficient and Effective Non-metric Space Library,” Lecture Notes in Computer Science, pp. 280–293, 2013. [Online]. Available: 10.1007/978-3-642-41062- 8_28. [45] M. Hitesh, V. Vaibhav, Y. J. A. Kalki, S. H. Kamtam, and S. Kumari, “Real-Time Sentiment Analysis of 2019 Election Tweets using Word2vec and Random Forest Model,” 2019 2nd International Conference on Intelligent Communication and Computational Techniques (ICCT), Sep. 2019. [Online]. Available: 10.1109/icct46177.2019.8969049. [46] J. M. Perez-Verdejo, A. J. Sanchez-Garcia, and J. O. Ocharan-Hernandez, “A Systematic Literature Review on Machine Learning for Automated Requirements Classification,” 2020 8th International Conference in Software Engineering Research and Innovation (CONISOFT), Nov. 2020. [Online]. Available: 10.1109/conisoft50191.2020.00014. [47] M. Kashina, I. D. Lenivtceva, and G. D. Kopanitsa, “Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification,” Procedia Computer Science, vol. 178, pp. 284–290, 2020. [Online]. Available: 10.1016/j.procs.2020.11.030. [48] A. K. Uysal and S. Gunal, “The impact of preprocessing on text classification,” Information Processing & Management, vol. 50, no. 1, pp. 104–112, Jan. 2014. [Online]. Available: 10.1016/j.ipm.2013.08.006. [49] M. Anandarajan, C. Hill, and T. Nolan, “Planning for Text Analytics,” Advances in Analytics and Data Science, pp. 27–41, Oct. 2018. [Online]. Available: 10.1007/978-3-319-95663-3_3.