https://doi.org/10.31449/inf.v47i6.4668 Informatica 47 (2023) 55–64 55 Detection of IoT Botnet Cyber Attacks Using Machine Learning Alaa Dhahi Khaleefah 1 , Haider M. Al-Mashhadi 2 1, 2 College of Computer Science and Information Technology, Computer Information Systems Department, University of Basrah, Basrah, Iraq Email: alaa.dahy.2021.2022@gmail.com 1 , mashhad01@gmail.com 2 Keywords: IoT, botnet, malware, machine learning, intrusion detection, anomaly detection, classification Received: February 8, 2023 As of 2018, the number of online devices has outpaced the global human population, a trend expected to surge towards an estimated 80 billion devices by 2024. With the growing ubiquity of Internet of Things (IoT) devices, securing these systems and the data they exchange has become increasingly complex, especially with the escalating frequency of IoT botnet attacks (IBA). The extensive data quantity and pervasive availability provided by these devices present a lucrative prospect for potential hackers, further escalating cybersecurity risks. Hence, one of the paramount challenges concerning IoT is ensuring its security. The primary objective of this research project is the development of a robust, machine learning algorithm-based model capable of detecting and mitigating botnet-based intrusions within IoT networks. The proposed model tackles the prevalent security issue posed by malicious bot activities. To optimize the model's performance, it was trained using the BoT-IoT dataset, employing a diverse range of machine learning methodologies, including linear regression, logistic regression, K- Nearest Neighbor (KNN), and Support Vector Machine (SVM) models. The efficacy of these models was evaluated using the F-measure, yielding results of 98.0%, 99.0%, 99.0%, and 99.0% respectively. These outcomes substantiate the models' capacity to accurately distinguish between normal and malicious network activities. Povzetek: Razvit je model strojnega učenja za zaznavanje in ublažitev napadov IoT botnetov, ki se je izkazal na domeni BoT-IoT. 1 Introduction Sensitive user data in large quantities is vulnerable to different internal and external threats. As technology has developed, cyberattacks have grown along with the complexity of algorithms [1]. Cyberattacks primarily target computers that process, store crucial data, or services that rely on those systems [2]. For the identification of malicious cyberattacks that represent a security risk, a unique intrusion detection system (IDS) is needed. IDS is an intrusion detection system that automatically detects and categorizes intrusions, security policy violations, and attacks on host and network infrastructures [1]. The constantly changing nature of threats has made it necessary to significantly tune and modify IDS performance by adding Machine Learning (ML) [3]. Artificial intelligence (AI) has a branch called machine learning (ML), which enables computerized learning without the requirement for outside programming [4]. ML techniques use historical data to learn and create predictions. The ultimate objective of ML is to create an effective technique that takes incoming data and produces a prediction using statistical analysis [5]. Two classes of machine learning techniques are recognized: Supervised learning and unsupervised learning are the first two. A well-labeled training dataset with both normal and attack samples is necessary for supervised learning. This kind of learning involves giving the learning model the input and the target output so it can predict the future [6]. The datasets utilized to train these ML models directly affect the amount of training necessary [7]. Biases in data or algorithms that are ignored or concealed might provide skewed predictions and impair the effectiveness of AI applications [8]. in this situation, ML is among the most effective computational methods. to provide embedded smartness in the Internet of Things context. For a variety of network security tasks, including network traffic analysis [9-12], intrusion detection [12] and botnet identification [13], machine learning algorithms have been utilized. Figure 1 shows the IDS using ML in network and IoT environment. The Internet of Things (IoT) has been multiplying in recent years all over the world. By the year 2030, there could be 125 billion IoT devices that are connected. The management of IoT networks has become increasingly difficult as a result of embedding these Iot systems with numerous alternative architectures, services, and protocols. As a result, the internet is exposed to significant risks and cyberattacks that could put users of such devices in danger [14]. The UNSW-NB15 network security dataset become available in 2015 [15]. 2,540,044 actual instances of both typical and abnormal behavior are included in this collection (often known as attack) functioning of networks in the electronic age. IXIA traffic generator employed three virtual servers to get this information. Two servers were set up to distribute standard network traffic, Table 1, shows the Summarization of the Related Works. 56 Informatica 47 (2023) 55–64 A.D. Khaleefah et al. Table 1: Summarization table on the related works. Ref Methodology Performance/Results [23] • Genetic Algorithm • The work used a combination of the Genetic Algorithm (GA) to remove unimportant characteristics and the Self-Organizing Map (SOM) classifier, which was optimized by GA's selected features, to find the high detection rates. [24] • Support Vector Machine • Performance of IDS employing reduction features beats that of competitors using all features. The Support Vector Machine (SVM) classifier was used as a multiclass detection approach, and Mutual Information with Linear Correlation Coefficient (MI-LCC) was used to identify the best features. [22] • UNSW-NB15 • They presented a hybrid system for IDS based on a Genetic Algorithm (GA) and Support Vector Machine for each assault in the UNSW- NB15 dataset (SVM). They transformed the traits into chromosomes and chose the ones with the best degree of correctness. (They suggested the Least Squares Support Vector Machine as a detection technique (LSSVM). The accuracy, true positive rate, and false- positive rate of the results were evaluated. [25] • Logistic Regression (LR), Support Vector Machines (SVM), and Random Forest (RF) • The dataset is used by the authors to categorize botnet traffic in the Iot infrastructure. Nine operational IoT devices that were attacked by the Mirai and BASHLITE botnets provided the data for this dataset, which contains genuine network traffic information. Three classification techniques, Logistic Regression (LR), Support Vector Machines (SVM), and Random Forest (RF), are used to examine the data and classify it by botnet, attack, and device. [15] • UWSNs • the authors suggest a novel routing protocol for the ocean floor that integrates two-dimensional UWSNs with sleep-scheduling routing to detect and report oil traces to the sink as soon as possible. [25] • K-NN • By combining the K-NN algorithm with the clustering technique, the authors of suggest a new routing strategy that may significantly cut down on both latency and power consumption throughout the whole network. This proposal shows how to create clusters using node classifications and the shortest possible distances between them. [16] • Network Performance • In order to learn about network performance through the identification of lost and transmitted packets, and to keep the cost of monitoring and communications infrastructure to a minimum, our system employs the placement of packet probes in passive monitoring devices on strategic links within the network. This work, which includes a user-friendly graphical user interface (GUI) and various data, metrics, and statistics related to network outcomes, can serve as a helpful manual for network researchers or other programmers wishing to analyze their networks and gain an understanding of how to calculate network performance. [26], • Extreme Learning Machine • The approaches with various steps based on supervised ML were suggested. It begins by using the Synthetic Minority Oversampling Technique (SMOTE) to address the issue of imbalanced classes in the dataset before using the Extremely Randomized Trees Classifier to choose the crucial features for each class that already exists in the dataset according to the Gini Impurity criterion (Extra Trees Classifier). The detection of each attack is then done independently by a pretrained Extreme Learning Machine (ELM) model using "One- Versus-All" as a binary classifier. Detection of IoT Botnet Cyber Attacks Using Machine Learning Informatica 47 (2023) 55-64 57 Figure 1: Integrated intrusion detection using ML The Argus and Bro-IDS tools were used to break down the original network packets into a total of 49 attributes, including both flow-based and packet-based features. For packet-based features, the payload and header of the packet are mined. Sequencing packets as they travel from source to destination over the network in turn produces flow-based features. The direction, inter- packet length and inter-arrival times are the most important properties in the flow-based feature formulation: Two examples of flow-based characteristics are total duration (dur) and destination-to-source-time-to- live (dttl). The features are divided into three groups basic (6 to 18), content (19 to 26), and time (27 to 35). The terms "connection features" and "general-purpose features" refer to features 36 through 40 and 41 through 47, respectively. General purpose features are those qualities intended to illustrate the purpose of a particular record, whereas connection features show the characteristic of the interaction of 100 records in sequence, consecutively. The final two features are labels and attack categories. The types of attacks include Analysis, Backdoor, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, and Worms. 2,218,761 records are used to represent typical attacks, whereas 24246, 2677, 2329, 16535, 44525, 215481, 13987, 1511, and 174 records, respectively, are used to represent fuzzers, analysis, backdoors, DoS, exploits, generic, reconnaissance, shellcode, and worm’s signatures. As a result, the dataset shows a large imbalance in its distribution. since Normal records make up 87% of it while Worms records make up just 0.007%. The dataset's creators additionally subsampled and divided it into training and testing subsets, the testing set has 82,332 records from the attack and normal classes, whereas the training set contains 175,341 records of each type. as shown in Table 2, which other researchers have used [16-18]. In contrast to existing benchmark datasets as DARPA98 [19], KDDCUP 99 [16, 20], and NSL-KDD [21], among others, the UNSW-NB15 dataset has a more complex structure. As a result, the UNSW-NB15 is enhanced to provide a more thorough assessment of the current network intrusion detection technologies [16]. We used a number of preprocessing procedures to get the data ready for visual analysis. We first determine whether UNSW-NB15 has any redundant features, convert the nominal input features to numerical ones, rescale them, and then choose the pertinent input features. In the UNSW-NB15 Dataset, nine different attack types have been identified. 1. Fuzzers: are attempts by the attacker to exploit security flaws in the operating system, network, or program in order to temporarily halt or even crash these resources. 2. Analysis: There is a category of intrusions that target online applications by scanning their ports, sending spam emails, and other means. 3. Backdoor: a method through which an attacker can acquire remote access to a system without being authenticated. 4. DoS: a kind of intrusion when the hacker makes an effort to overburden computational resources to prevent unauthorized access to them. 5. Exploit: a terminology used to characterize intrusions that profit from bugs, mistakes, or malfunctions in software or operating systems (OS). 6. Generic: By using a cryptographic system, this attack aims to decrypt the security system's key. 7. Reconnaissance: Also known as a probe, this type of attack gathers details about the victim computer system in order to get over its protection measures. 8. A malware attack known as a shellcode involves the hacker controlling the compromised machine by infiltrating a little piece of code beginning with a shell. 9. Worms are malicious software programs that reproduce themselves and spread to other computers via a network, relying on security flaws on the target computer that they are trying to reach. To project the preprocessed data into a low- dimensional space, the research use binary classification and the classes of the dataset are lastly visualized using multi-class classification. After that data normalization, label encoding, correlations between features of dataset are performed. Also, the machine learning techniques are used such as Linear regression, logistic regression to predict the attacks. Table 2: Shows the number of records for each group in the training and testing subsets. 58 Informatica 47 (2023) 55–64 A.D. Khaleefah et al. 2 Machine learning Passive modern security techniques rely heavily on mathematical analysis models, which frequently do not reflect the correctness of the systems. Suitable defense in wireless environments necessitates weighty mathematical answers, that takes a long duration to compute and adds complexity [27]. Since machine learning algorithms are effective at modeling techniques that aren't able to be expressed by mathematical formulas, they will consequently play a vital role in IoT security solutions. The area of computer science known as machine learning allows machines to utilize previous instances and experience. The development of a ground-breaking new anomaly detection methodology based on machine learning allows for the discovery of anomalous traffic that may point to attempted network breaches [28]. The following list of machine learning algorithms (MLAs) adds the capability for computers to make decisions without being explicitly taught. Each MLA is formulated using sample data. Based on the sort of supervision provided during training [29], there are four different groups of MLAs. supervised learning, Unsupervised learning, semi-supervised learning and reinforcement learning [30]. A. Supervised learning: In its simplest form, supervised learning describes teaching methods that involve a supervisor. It includes learning and prediction, as well as sample data with defined outcomes that make it easier for the algorithm to move from input to output [31]. Examples of supervised learning include classification techniques like KNN, SVM, Naive Bayes, Decision Tree, and Random Forest [32]. B. Unsupervised learning: Unsupervised learning is the process of evaluating data without labels. It's also referred to as clustering. Similar to a self-directed learning method, Finding the unexpected data points is the aim of unsupervised learning [31]. C. Semi-supervised learning: Machine learning techniques that blend a bigger sample of unlabeled data with a smaller amount of labeled data are referred to as semi-supervised approaches [31]. Between training data with labels and training data without labels, these learning fall. With more unlabeled data and fewer labeled data, these algorithms perform better [32]. D. Re-enforcement learning: The area of machine learning known as reinforcement learning is centered on the agent, action, state, reward, and environment [32]. It does not presuppose mastery of any precise mathematical model; instead, it trains an agent composed of learning algorithms and policy through trial and error in an unsupervised setting. 3 Supervised ML algorithms A. Linear regression Model of variable x's linear function of dependence with respect to one or more independent variables (factors, regresses). As a straightforward forerunner to non-linear techniques utilized to teach neural networks, linear regression is the process of identifying the "best fit line" through a collection of data points. The technique entails decreasing the Euclidean distance between two vectors— a vector of the dependent variable's restored values and a vector of its actual values as in Eq. (1). The premise of linear regression is that parameters affect function f in a linear fashion. The linear dependence does not, however, always rely on a free variable x [33]. 𝒑 (𝒚 |𝒙 ) = 𝜶 (𝑾 . 𝒙 + 𝒃 ) (1) The logistic function produces probabilistic labels y for input data x. The function first linearly transforms the input data x with the model’s learned weights (W)and bias (b) parameters. The function then applies the nonlinear sigmoid (α) transformation to the linear result to produce the probability labels y, Eq. (2). p(𝒚 |𝒙 ) = w 0 + w 1 x 1 + w 2 x 2 + … + w nx n +b (2) B. Logistic regression LR is a recognized statistical method for classifying data [34]. The logistic, or sigmoid, function provides the basis for the model (Eq. 3), and the training objective is to fit the function to optimally divide the training data. The resulting curve can be seen as an S-shape in 2D space in Figure 2. 𝒇 (𝒙 ) = 𝟏 𝟏 +𝒆 −𝒙 (3) LR can be (i) binary, where the dependent variable (i.e., the output) is a category of two possible choices (for example, benign and anomaly), (ii) multinomial, where the dependent variable can be selected from a number of categories (for example, benign, attack 1, and attack 2), or (iii) ordinal, which is multinomial while the classes have an ordinal relation (for example, attack severity) [35]. Based on a threshold and a decision boundary, LR's output is determined. According to Eq. 4, in the binary situation, for instance, if the output is 0.5, it belongs to class A, and instead, it belongs to class B. 𝒀 = { 𝑨 , 𝒇 (𝒙 ) ≥ 𝟎 . 𝟓 𝑩 , 𝑶𝒕𝒉𝒆𝒓𝒘𝒊𝒛𝒆 (4) Detection of IoT Botnet Cyber Attacks Using Machine Learning Informatica 47 (2023) 55-64 59 Figure 2: LR Sigmoid function. C. KNN Based on Euclidean distance calculations, K Nearest Neighbor algorithms classify objects by the majority vote of their K neighbors with entities belonging to several classes [36]. K has a positive and typically low value. The number of selected neighbors, or the amount of K, determines how accurate the KNN algorithm is. For binary classification, the value of K is often an odd number to avoid the chance of two classes' labels having the same count. The value of K that is selected should be the best possible number; if it is too small or too large, the model may not fit the data as well Figure 3 shows the KNN model. In Euclidean n-space, the Euclidean distance between two points (X, Y) is written as: 𝒅 (𝑿 , 𝒀 ) = √∑ (𝒀𝒂 − 𝑿𝒂 ) 𝟐 𝒏 𝒂 =𝟏 (5) D. SVM Based on the margin notation on either side of the hyperplane, SVM divides and separates the two data classes. Figure 5 illustrates the SVM. The margin and separation between the hyperplanes can be increased to improve classification accuracy. Support vector points are the data points that are located on the hyperplane's edge. SVM is divided into two main groups. Depending on the kernel function, it can be both linear and non- linear. Based on the type of detection, it may also be single-class or multi-class [37]. Both memory and time are important considerations when using SVM. In order to achieve better outcomes, SVM needs to be trained at various time intervals to learn the dynamic user's behavior. Eq. (6) represents the SVM [38]: 𝐦𝐢𝐧 ‖𝐰 ‖ 𝟐 + 𝐂 ∑ 𝒎𝒂𝒙 (𝟎 , 𝟏 − 𝐲𝐢𝐟 (𝐱𝐢 )) 𝑵 𝒊 (6) Where C is a regularization parameter that depicts the trade-off between maintaining that xi is on the predicted side of the plane and boosting the margin. In a two- dimensional space, where an SVM operates, the hyperplane appears as a line. When operating in extra dimensionality, it becomes an n-dimensional plane instead of a plane in three dimensions. Figure 3: KNN model Figure 4: SVM model. 4 The proposed system Figure 5: The structure of IDS system. This section outlines the steps taken to create the botnet detection model, along with the datasets employed, preprocessing phase, experimental environment, outcomes, and justifications. To choose the optimum approach for our model, various supervised ML were applied to various combinations of Botnet dataset and the results were benchmarked. First, we looked at the packet data to examine the botnet behavior. 60 Informatica 47 (2023) 55–64 A.D. Khaleefah et al. The dataset was split into two halves, one with regular traffic and the other with botnet traffic, in order to study the behavior of the botnet. This analysis assisted in choosing features with more trustworthy data, Figure (5) shows the structure of the IDS system. The term "data engineering" is frequently used to describe this procedure. For the learning process to be successful, this phase is essential. There are three processes in data processing: cleansing, normalization, and feature selection. A. Cleaning The first phase in the cleaning process is to look for null values. The dataset contains numerous fields with null values, which need to be replaced with the correct values. Another step in the cleaning process is removing the unnecessary fields, such as "id" represents the first feature to be removed. This feature is not descriptive; it is an index. “attack-cat” is the second functionality to be removed. Since this feature is a continuation of the target feature, utilizing it will result in 100% accurate predictions but not a generalizable model. The other features that must be removed are those that have excessive correlation. Since the model is first assessed to determine how effectively it can function, none of them were eliminated in the present edition. Look for any incorrect values that may be included in any of the fields. We must address any issues if there are any. Despite being a binary column in this dataset, "is ftp login" has values other than 0 and 1. remove any values but (0 and 1). B. Normalization The learning process for ML techniques like Linear Regression and Logistic Regression is impacted by the large numerical value of many attributes. Additionally, a lot of computer resources are needed for the learning of high dimensional datasets. Data is frequently scaled using techniques like Z-score standardization, Decimal scaling, Max normalization, and Min-Max scaling to address these difficulties [39]. The application is frequently taken into account while deciding which method to use. In the data processing step, we apply the Min-Max scaling (Eq. 4). 𝐹 : 𝐹 𝑛𝑜𝑟𝑚 = 𝐹 −𝐹 𝑚𝑖𝑛 𝐹 𝑚𝑎𝑥 −𝐹 𝑚𝑖𝑛 (7) The standardization calculation occurs as specified in Algorithm 1 with a dataset with an input sequence (feature domain) defined by U(f 1,...,f n)U(f 1,...,f n), where 1