https://doi.org/10.31449/inf.v43i1.2319 Informatica 43 (2019) 77–86 77 An Empirical Study for Detecting Fake Facebook Profiles Using Supervised Mining Techniques Mohammed Basil Albayati and Ahmad Mousa Altamimi Applied Science Private University, Amman, Jordan E-mail: mohammed.sabri@asu.edu.jo, a_altamimi@asu.edu.jo Keywords: data mining, online social networks, facebook, fake profiles, data science Received: April 29, 2018 Our social life and the way of people communicate are greatly affected by the social media technologies. The variety of stand-alone and built-in social media services such as Facebook, Twitter, LinkedIn, and alike facilitate users to create highly interactive platforms. However, these overwhelming technologies made us sank in an enormous amount of information. Recently, Facebook exposed data on 50 million Facebook unaware users for analytical purposes. Fake profiles are also used by Scammers to infiltrate networks of friends to wreak all sorts of havoc as stealing valuable information, financial fraud, or entering other user's social graph. In this paper, we turn our focus to Facebook fake profiles, and proposed a smart system (FBChecker) that enables users to check if any Facebook profile is fake. To achieve that, FBChecker utilizes the data mining approach to analyze and classify a set of behavioral and informational attributes provided in the personal profiles. Specifically, we empirically examine these attributes using four supervised data mining algorithms (e.g., k-NN, decision tree, SVM, and naïve Bayes) to determine how successfully we can recognize the fake profiles. To demonstrate the validity of our conceptual work, the selected classifiers have been implemented using RapidMiner data science platform with a dataset of 200 profiles collected from the authors’ profile and a honeypot page. Two experiments are developed; in the first one, the k-NN schema is applied as an estimator model for imputation the missing data with substituted values, whereas in the second experiment a filtering operator is applied to exclude the profiles with missing values. Results showed high accuracy rate with the all classifiers, however, the SVM outperforms other classifiers with an accuracy rate of 98.0% followed by Naïve Bayes. Povzetek: Opisana je metoda iskanja lažnih profilov na Facebooku s pomočjo strojnega učenja. 1 Introduction In recent years, social media technologies (e.g., Facebook, Twitter, LinkedIn, etc.) have become a vital part of our life [1]. They are designed and maintained by social media organizations presenting a portal for facilitating communication, interaction, sharing information, and entertainment via virtual communities and networks. Users typically utilize these services by creating their own profiles and then connecting them with others’ profiles through various technologies that offer social media functionality [2]. By using such services, users can create digital contents, such as text posts, comments, digital photos, videos, or data generated through all online interactions [3]. Social media sites have presented a various service included with high levels of quality, consistency, and The Introduction should provide a clear background, a clear statement of the problem, the relevant literature on the subject, the proposed approach or solution, and the new value of research which it is innovation and availability. This results in huge registered users [4]. Some of the most popular social media websites are Facebook (and its associated Facebook Messenger), Gab, Google+, MySpace, Instagram, LinkedIn, and others. Statistics and surveys for example the one that conducted by the American Academy of Pediatrics exhibit that about 84% of adolescents in America registered on Facebook social online site [5], also showed that the average users spend more than two hours on social network and even more on social online sites such as Facebook, Twitter and else more than any other sites or platform [6]. The benefit of engaging and participating in social online sites have gone beyond simply social online activities, sharing information, or communication but to building careers, making business opportunity, financial income, etc. [7]. Historically, according to Mark Zuckerberg, a co- founder of Facebook which is the largest social network site, Facebook have more than 175 million active users registered in 2009 after just five years' time frame. Nowadays, Facebook has 1.94 billion users on the last official announcement on March 31st, 2017, according to Facebook newsroom administration [8], which exceeds the population of some big countries. With this rapid revolution in this technology, number of negative consequences and risks are raised such as security risks, privacy violation, cloning, hacking, spamming, and others [3]. For example, Spam on social media repeatedly posts the same thing over and over or causes a sudden spike in messaging activity [9]. Fake profiles on the other hand, allow scammers to infiltrate networks of friends to wreak all sorts of havoc such as: stealing valuable information, financial fraud, or entering other user's social graph [10]. 78 Informatica 43 (2019) 77–86 M.B. Albayati et al. It is important to mention here that according to the Facebook’s Statement of Rights and Responsibilities; users should provide their real and legit information once they created their profiles. Facebook urges its users to be committed to these policies and terms in order to have an experience in an environment of safety, security, and privacy [11]. In this work, we focused on the problem of detecting fake profiles in Facebook and presenting a smart detection system (FBChecker) to handle this problem based on the prediction and classification techniques of data mining. Our work is motivated by works presented in [12-15], where researchers employed data mining approach for extracting hidden knowledge within social media. For example, authors of [12] utilized data mining tools for accurately capturing the behavior of intrusions and normal activates in an anomaly detections approach. One can consider also the Web mining that applies data mining tools onto web resources to further developments in World Wide Web mining [15]. In our model, supervised mining techniques are applied to classify Facebook's profiles into fake and real profiles based on a set of behavioral and informational attributes. These attributes are provided in their personal profiles and used to identify the reality of user’s identity such as: person’s legal name, location, workplace, age, education, and others. The required data set for the training and testing purposes in our work has been collected from the authors’ personal profiles considered as a source of real profiles, and from a created honeypot page, which is fake Facebook’s pages used for the purposes of data harvesting [10] to attract and collect these profiles. As a collecting tool we wrote our own script to develop a special CRAWLER for gathering the required data set. To underscore the practical viability of our approach, the selected classifiers (e.g., SVM, Naïve Bayes, k-NN, and Decision Tree) have been implemented using RapidMiner data science platform for the mining tasks. These classifiers were evaluated using 10-folds cross validation method and conducted on the collected data set. It is important to mention here that 33 records have some missing values of their attributes. To solve this problem, two empirical studies were developed, in the first one, the k-NN schema was used as an estimator model for imputation the missing data with substituted values. Results showed that the classifiers (SVM, Naïve Bayes, k-NN, and Decision Tree) achieved (0.9850, 0.9700, 0.8400, 0.9650), respectively. In the second experiment, a filtering operator is applied to exclude the profiles with missing values. Here, the classifiers showed (0.9880, 0.9641, 0.8443, 0.9461), which are relatively equal to the results of the first experiment. The numbers and the ROC graph (Receiver Operating Characteristics) which is a graphical plot utilized to assess the classifiers performance ability showed that, in the both experiments SVM classifier achieved the highest accuracy rates while, the k-NN performance showed the lowest accuracy detection rate among the classifiers. These experiments are discussed in more details in Section 5. The remaining of this paper is structured as following: Section 2 reviews the related works to the proposed approach and to the fake profiles in online social networks, specifically the Facebook. Section 3 describes the background material of the research work along with brief description of the employed supervised algorithms. Section 4 explains the research methodology while the proposed system along with its main components presents in section 5. Section 6 discusses the implementation of the FBChecker system, the evaluation and experimental results are given in Section 7. Finally, section 8 offers the conclusion and the possible future work. introduction. 2 Related work Many studies and works have been conducted focusing on the phenomena of fake profiles on online social networks, each researcher tried to came up with new way to detect and handle this problem. Studies in this field differ according to how they look at the problem from their own perspectives. Each of which is raised for solving a certain problem and faces certain challenges and difficulties. In this regard, many approaches presented in the literature for handling fake profiles. One can consider for example, the work in [16]. Here, the authors present a machine learning pipeline framework consists of three components for detecting clusters of duplicate accounts (cluster level detection) rather than making a prediction for an individual account. Here, the pipeline uses simple information that is provided at the registration time, so the profile is detected before it is activated. Moreover, the classifier determines whether the clusters of accounts were created by the same actor, showing a strong evaluation on sample grouping based on the simple text information like name, email, company, etc. and the IP address. Practically. The system captures more than 250,000 fake accounts in practical use. In contrast, [17] proposed a behavioral approach for detecting fake accounts on Facebook. It is designed using information regarding user profile’s activities and interactions with other users. Authors characterized these activities through an extensive set of 17 features like (likes, comments, shares, tag, and apps usage on Facebook). To ground their idea, these features are applied on a total of 12 supervised machine learning techniques. The system’s performance showed an accuracy of 79%, which may not be impressive results, but the author considered it as a first step or baseline work for further improvements. Detecting Spam profiles, which is one of the fake profiles types [10], has also considered in the literature. Authors of [18] proposed a statistical analyzing model with 14 generic features from Facebook and Twitter data set regarding 4 basic kinds of social interactions including (profile interaction features, posts/ tweets, URLs and tags & mentions). The model identifies spam profiles on Facebook and Twitter based on information collected manually through scanning these networks for both normal and spam profiles using three different supervised classification algorithms (naive Bayes, Jrip, and J48). Then two different experiments were performed: firstly, examining the role of the whole feature set and calculate An Empirical Study for Detecting Fake Facebook... Informatica 43 (2019) 77–86 79 the accuracy of the proposed system. And secondly, removing each one of the features and analyzing the results of the system to discover the impact of each features and find out which one can play the key role in the classification model. Detecting spam profiles is also presented in the literature as in [19], Presented Social Privacy Protector software (SSP) for detecting fake profiles on Facebook, the SSP consist of three protection layers: The software first identifies a user’s friends who might pose a threat and then restricts this “friend’s” exposure to the user’s personal information (The Friends Analyzer Facebook Application). The second layer is an expansion of Facebook’s basic privacy settings based on different types of social network usage profiles (The Social Privacy Protector Firefox Add-on). The third layer alerts users about the number of installed applications on their Facebook profile, which have access to their private information (The HTTP Server). The software present convenient method for restrict the users that may be suspected as fake profiles without removing it from the user’s friends list. The Friends Analyzer Application on the Facebook scans the user’s friends list and returns a credibility score. Each friend analyzed by machine learning algorithms which takes into account the strength of the connection between the user and his friends. The strength of each connection is based on a set of fifteen connection features depends on three types of the collected dataset, such as the number of common friends between the user and his friend and the number of pictures and videos the user and his friend were tagged in together. Applying eight supervised algorithms such as (Naive-Bayes, Bagging, Random- Forest, J48, and others). The Social Privacy Protector add- on in the Firefox browser help improve the user privacy with simple steps. Finally, The HTTP server responsible for connecting the SPP Firefox Add-on to the SPP Facebook application. Authors of [20] proposed a framework for detecting spammers/ fake profiles on online social network using Facebook as test case in a machine learning approach by exploiting a behavioral and community-based features (attributes) that include the structure of the nodes and some topological features (attributes) in the network. The framework implemented using WEKA tool as mining environment, using ten discriminative topological attributes (Total out-degree, Total in/out ratio, Total reciprocity, Core node, Community memberships, Foreign out-degree, Foreign in/out ratio, Foreign out-link probability, Foreign reciprocity, and Foreign out-link grouping) regarding the social interactive of the profiles like number of posts, number of sent/ received messages…etc. Four experiments are conducted using two datasets: Facebook dataset and Enron network (Email messages dataset). Four supervised classifiers are employed in this work (Naïve Bayes, J48, k-NN, and Decision Tree). Ultimately, authors of [21] proposed a machine learning approach for detecting spam bots in Twitter online social network through exploiting two main spam features, which are: The graph-based features including the number of friends, number of followers and the follower’s ratio (the ratio of the number of peoples following you to the number of peoples you follow). And the content-based approach which is the number of duplicated tweets, number of HTTP links, and the number of replays/mentions. Regarding the detection process, the approach applied different classification methods such as decision tree, neural network, support vector machines, and k-nearest neighbors to identify spam bots on Twitter. The evaluating results showed that the Bayesian classifier has a better overall performance. 3 Background material Data Mining basically is the process of extracting Knowledge from a huge amount of data, by looking for a pattern, identified, validated, make a prediction and summarize it into useful information. Data mining process goes through a sequence of procedures, applying set techniques, combining several of discipline and fields like statistics, machine learning, database, algorithms visualization methods, pattern recognition and other disciplines [22]. 3.1 Machine learning techniques in data mining Machine learning is a branch of computer science, which deals with algorithms that have the ability to learn and adapt to make a decision [22]. One of the most common tasks that data mining offers is Classification & Predication in which they fall into the machine learning techniques. In machine learning, there are two main techniques known as Supervised Learning, where the training dataset has a class label, and Unsupervised Learning, where the data are grouped together based on observable behavior or features. In other words, in supervised, a labeled set of training data is used to estimate or map the input data to the desired output. In contrast, under the unsupervised methods, no labeled examples are provided and there is no notion of the output during the process, instead the data with similar attributes or similar behavior are grouped together (clustered) [22, 23]. In this work only the supervised techniques have been employed as mentioned, particularly four supervised techniques that are: SVM, Decision Tree, k-NN, and Naïve Bayes. A brief description about these classifiers will be presented in the next subsection. 3.1.1 Supervised learning Supervised machine learning is a heuristic process of mapping inputs to specific output, estimating unknowns based on labeling samples. The objective of supervised learning technique is to build a model with distinguished features and predefining labels with a known class, then using this model to classify or predict a new data with unknown class. The process of classification and prediction in supervised machine learning involves two major steps: 80 Informatica 43 (2019) 77–86 M.B. Albayati et al. • The learning step: the model constructed, analyzed and trained with known label dataset called “training set”, then the classification and prediction rules are generated. • The classification and prediction step: the model (classifier) used for classifying or predicting a given data based on the gained experience from the training set. The model's results are evaluated through testing and evaluating process to estimate its accuracy. The test metrics use to assess how good or how accurate the classifier was. If the it reaches a level of accuracy that is acceptable based on specific standards, then the model can be deployed on new unknown labeled data, otherwise it will be modified [24]. In our work, we choose to employ the most common supervised machine learning algorithms that are: 1- Decision Tree is a predictive model takes a tree structure that generates the classification rule by breaking down the dataset into smaller and smaller subset until the decision node (class label) is met. Each node in the tree represents an attribute of the training set, however, leaf nodes hold the class label (final outcome), while the root node represents the attribute with highest information gain that determines the tree branches in which each branch represents one of the outcomes of the model. 2- k-NN is one of the simplest algorithms perform similarity functions, which store all cases with a known label and classifies new data based on the similarity measures or distance function. k-NN classify new data by using k value to find the nearest case in the data set, for example if (k =1) then simply assign the new case to the class of its first nearest neighbor, if the (k = 3) then k-NN calculate the distance of the nearest three cases and apply majority vote on the class of these cases to decide the class of the new data. The distance measures for finding the nearest neighbor for the numerical data is calculated by the Euclidian distance function and for the categorical data hamming distance measure. 3- Support vector machine algorithm is a classification technique designed to define a hyperplane that classify the training data vectors into classes, the goal or the best choice is to find a hyperplane with widest margin to separate the data classes. The support vector are the data points which are closest to the hyperplane. 4- Finally, Naïve Bayes or simple Bayesian classifier is considered also in the mining process as a supervised classification technique as it is simple and prove its effectiveness, Naïve Bayes is probabilistic algorithm depends on applying Bayesian theorem with naïve assumption that the occurrence of one of the attributes\ predictors are independent of the occurrence of other attribute and regardless of any correlation between these attributes in the classification process. Bayes rules adopted in this algorithm stated a conditional probability of certain event based on previous knowledge about that event [22, 23]. 4 Research methodology In this work, we developed a smart system (FBChecker) that enables users to detect the fake Facebook profiles by utilizing the supervised data mining techniques. To do so, the system firstly, collects the data of a set of behavioral and informational attributes derived from the user's friends’ profiles (listed in table 1). To achieve this, a special purpose module (called CRAWLER) is developed to collect the required attributes from the user’s friends list. CRAWLER is running at the user level for collecting this data. Secondly, the collected data is validated to increase the accuracy of the detection process. Specifically, the problem of missing values has been solved using two methods, the k-NN scheme and a special operator to exclude them. Ultimately, a set of supervised mining algorithms are implemented using the RapidMiner data science platform to detect the fake profiles. The main objective of using the supervised machine learning techniques is to build a model with distinguished features and predefining labels with a known class, then using this model to classify or predict a new data with unknown labels. This process involves two major steps. Firstly, the learning step that includes constructing, analyzing and training with known label data set (training set), then the classification and prediction rules are generated. Secondly, the classification and prediction step that the learner model (classifier) gives data based on the gained experience from the training set. 5 The FBChecker smart system Figure 1 illustrates the main components of proposed FBChecker System. In this section, we discuss the steps that followed carefully to build up the system along with its main components. Fig. 1: FBChecker System Components. 1- Collecting the required data: first thing needed to be considered in building a machine learning system is collecting the required data for the training and testing purposes. In this regard, a special purpose module (CRAWLER) was developed and written in An Empirical Study for Detecting Fake Facebook... Informatica 43 (2019) 77–86 81 Java Script for collecting the required attributes form the user’s friends list. The considered attributes are listed in table 1 along with their description and their using justifications. 2- Preparing the data: the raw data need to be prepared and validated to increase the data quality and to be eligible for applying the mining techniques. Here, the preparation process is done as following: • Missing Values: we note that some profiles have missing values due to privacy issues or the users did not fill these attributes with required information. To solve this problem, two methods have been applied, the k-NN schema is applied as an estimator model for imputation the missing data with substituted values, and a filtering operator is applied to exclude the profiles with missing values. • Profile Picture: it is recognized by the user himself as a real picture or not. • Education: it is validated according to a multi- lingual database of size ~10,000 records of colleges and universities existed around the world. • About "Bio." Section: making a textual condition, if the number of words in this section greater or equal 5 return true/real otherwise false/fake value. • Other attributes: such as Relationship Status, Life Events, Living Place, and Check Ins do not need to be validated as Facebook evaluates the attributes’ values. So, the CRAWLER module retrieves them as is. 3- Training and Appling the Supervised data mining algorithms: after the data is prepared and ready for mining, a supervised data mining technique is applied (Analyzer module). The classifiers are trained with known class data that are (Fake, Real) profiles. At this step, the system gaines the experience and the ability to classify and detect the fake profiles. In addition, the classification rules are generated and prepared through applying the supervised algorithms. Finally, the selected supervised data mining algorithms are applied using the prepared collected data for detecting fake profiles. 6 The FBChecker implementation 6.1 Data set description We note that there is no available standard data set with the required information. Thus, we choose to prepare our own one. The CRAWLER is employed on the author's profile for gathering real profiles and returns 151 profiles friends out of 151. However, 18 profiles were excluded as they were faked, underaged, or duplicated. This ends up with 133 real profiles. Regarding the fake profiles, a honeypot page is created and utilized as a source for collecting fake profiles. The inspecting of the fake profiles was finalized with selecting of 83 fake profiles as some of the collected profiles were not stable with their liking activity in which they drop their likes from our page after few days. As a result, 200 profiles were collected, 117 real and 83 fakes, as summarized in Figure 2. 6.2 Building the FBChecker system After collecting the 200 profiles data set, we are ready to generate the classification and prediction rules. In this regard, RapidMiner 8.0.1 platform was utilized as a mining tool, which offer the use of various machine learning algorithms easily and provides a flexible environment designed specifically for data science and Attribute Description Justification Profile Picture Visual identification of the user Real users use their real pictures more often than fake users Work place Workplace or job title's information Real users more often use their real workplace information than fake users Education Attended (school, college, university…etc.) information Real users mentioned their education information in their Facebook profiles more often than fake users Living Place Living place address (city, town, state…etc.) information Real users more often use their real living place information than fake users Relations hip Status Social relation status (married, single, engaged, etc.) information Real users share their real social relation status than fake users Check In Information for announcing user location Real users check into places in their Facebook's profiles more often than fake users Life Events Information for the users to tell their stories Real users share their life events more often than fake users. Introduc tion "Bio." Introduction information about Facebook's users Real users are more often write something about themselves than fake users No. of Mutual Friends Number of the people who are Facebook friends with both users and the target profiles Real users have more mutual friends with target profile than fake users, hence gives profile more incredibility No. of Pages Liked Number of pages liked Real users usually liked more pages than fake users No. of Groups Joined Number of groups joined by the target profile. Real users usually join groups more than fake users. Table 1: Attributes used by FBChecker. 82 Informatica 43 (2019) 77–86 M.B. Albayati et al. data mining purposes. For the training and testing processes, the (K fold) cross-validation with 10 folds was applied to evaluate the results accuracy as it is considered as one of the most effective methods for evaluating the predictive models with relatively small data set. 7 Evaluation process To evaluate the FBChecker performance, the selected classifiers were tested with two experiments. In the first experiment, the k-NN schema was utilized to substitute the missing values, while in the second experiment, profiles with missing value were excluded. These experiments are discussed in detail in the following subsections. Finally, metrics for the validation process were calculated and proper justifications were provided. 7.1 Performance metrics A group of common metrics are applied in the validation process, in this work the following metrics are used: Recall, Precision, Accuracy, F-measure, and specificity [25]. Next, we give a brief description for each one: 1) Recall true positive rate (total numbers of true positive divided by the total number of actual positives) 2) Precision: Measure the probability that the positive predications is correct (total numbers true positives divided of total number of predicted positives) 3) Accuracy Measure the performance of the classification model (total numbers of correct examples divided by total number of the example set) 4) Specificity true negative rates (total numbers of true negatives divided by the total number of actual negatives) 5) F-measure is an overall measure of a model’s accuracy that combines precision and recall. 7.2 The experimental results Four supervised algorithms were applied on the collected data set based on the following cases: 7.2.1 Estimating the missing values using k- NN schema In this case, the k-NN schema is utilized for handling the missing values. After that the four supervised algorithms (e.g., Decision Tree, k-NN, SVM, and Naïve Bayes) are tested. In addition, the Cross-validation technique with 10 folds is used for performance assessments of these classifiers. The results showed that while the Decision Tree and Naïve Bayes exhibit close results with accuracy of 0.9650 and 0.9700 respectively, the SVM classification registered higher performance accuracy with 0.9850. On the other hand, k-NN algorithm with k=1 showed accuracy of 0.8400. Table 2 shows the complete results along with the validation metrics of these algorithms. Also, Figure 3 shows the accuracy of the classifiers and Figure 4 shows the ROC graph comparison of these classifiers Moreover, Figures 5, 6, 7, and 8 illustrate the ROC of each and every classifier’s performance in this experiment. ROC graph is graphical plot that diagnosis the classifier performance by analysis the its work based on the rates of true positive predication against the true negatives predication [26]. Validation metrics Decision Tree k-NN SVM Naïve Bayes Accuracy 0.9650 0.8400 0.9850 0.9750 Recall 0.9658 0.8291 1.0000 1.0000 Precision 0.9741 0.8899 0.9750 0.9590 F-measure 0.9700 0.8584 0.9873 0.9791 Specificity 0.9639 0.8554 0.9639 0.9398 Table 2. Supervised performance with k-NN estimator. Figure 3: Supervised accuracy with k-NN estimator. 0,75 0,8 0,85 0,9 0,95 1 Decision Tree k-NN SVM Naïve Bayes Accuracy Figure 2: Collecting Training Data Set. An Empirical Study for Detecting Fake Facebook... Informatica 43 (2019) 77–86 83 Figure 4: ROC graph comparison of the all classifiers with k-NN estimator. Figure 5: ROC graph of the k-NN performance with k- NN estimator. Figure 6: ROC graph of the Naïve Bayes performance with k-NN estimator. Figure 7: ROC graph of the SVM performance with k- NN estimator. Figure 8: ROC graph of the Decision Tree performance with k-NN estimator. Although all the classifiers achieve high accuracy rate, however, the SVM outperforms other classifiers as it employs “Nominal to Numerical” operator to map the different types of data to numerical type, so SVM can calculate the distance of these attributes to the hyperplane that separates the concept classes. Specifically, SVM proved its efficiency for application of two concepts classes due to find the optimal decision boundary (Hyperplane) that separate the two class in which are (Fake and real) and calculate the distance of each case (profile) to its nearest class label for the classification process. 7.2.2 Excluding the missing values using filtering operator In the second case, the profiles with missing attributes are excluded by employing a special filtering operator provided by the RapidMiner, which filter the profiles based on specific conditions to keep/remove the profiles that met these conditions. Practically, the conditions of the Filter are set to remove any profile with missing values in anyone of their attributes. By applying this operator, a total of 33 profiles were removed from the collected data set leaving 167 profiles to be considered in this 84 Informatica 43 (2019) 77–86 M.B. Albayati et al. experiment. The main purpose behind this experiment is to eliminate any factor that could affect the model's classification process or the accuracy because we estimated the missing values in the first experiment. Validation metrics Decision Tree k-NN SVM Naïve Bayes Accuracy 0.9461 0.8443 0.9880 0.9641 Recall 0.9406 0.8317 1.0000 1.0000 Precision 0.9694 0.9032 0.9806 0.9439 F-measure 0.9548 0.8660 0.9902 0.9712 Specificity 0.9545 0.8636 0.9697 0.9091 Table 3. Supervised performance with filtering operator. Figure 9. Supervised accuracy with filtering operator. After that, the supervised algorithms are applied on the data, results showed the following accuracy rate (0.9461, 0.8443, 0.9880, and 0.9641) for Decision Tree, k-NN, SVM, and Naïve Bayes, respectively. Again, SVM exhibits the highest detection performance with accuracy of 0.9880, while K-NN the lowest with accuracy of 0.8433. Other performance indicators for these supervised algorithms are showed in table 3. And following the same vein of the previous experiment Figure 9 illustrates the accuracy results the employed classifiers and Figure 10 the ROC graph comparison of all classifiers employed in this experiment, Figures 11, 12, 13, and 14 shows the ROC graph for each one. However, although our results are stable and good, one limitation that affects the validity of our study is that the used dataset is relatively small. Therefore, further validations over large datasets is required. Figure 10. ROC curve of the supervised algorithms with filtering operator. Figure 11: ROC graph of the k-NN performance with filtering operator. Figure 12: ROC graph of the Naïve Bayes performance with filtering operator. 0,75 0,8 0,85 0,9 0,95 1 Decision Tree k-NN SVM Naïve Bayes Accuracy An Empirical Study for Detecting Fake Facebook... Informatica 43 (2019) 77–86 85 Figure 13: ROC graph of the SVM performance with filtering operator. Figure 14: ROC graph of the Decision Tree performance with filtering operator. 8 Conclusion and future work In this work, a smart system FBChecker is presented that have been designed specifically for detecting Facebook fake profiles. FBChecker consists of several components that collecting, preparing, validating, and mining the users’ profiles using four supervised data mining techniques. These supervised techniques were implemented using the open source RapidMiner data science platform. The proposed system shows high efficiency performance for detecting fake profiles with accuracy rates reached %98, which represents a successful and promising result. As a future work, we are aiming to use a large data set size and include more attributes that may employed in the detection model as discriminative features, and also apply more data mining techniques (unsupervised/Clustering algorithms) then evaluate which technique among them perform best. ACKNOWLEDGMENT The authors are grateful to the Applied Science Private University, Amman-Jordan, for the full financial support granted to cover the publication fee of this research article. References [1] Romero, Daniel M., Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman. "Influence and passivity in social media." In Proceedings of the 20th international conference companion on World Wide Web, 2011, ACM, pp. 113-114. https://doi.org/10.1145/1963192.1963250 [2] Ngai, E. W., Tao, S. S., & Moon, K. K. (2015). Social media research: Theories, constructs, and conceptual frameworks. International Journal of Information Management, 35(1), 33-44. https://doi.org/10.1016/j.ijinfomgt.2014.09.004 [3] Kaplan, Andreas M., and Michael Haenlein. "Users of the world, unite! The challenges and opportunities of Social Media." Business Horizons 53, no. 1: 59- 68,.2010. https://doi.org/10.1016/j.bushor.2009.09.003 [4] Agichtein, Eugene, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. "Finding high- quality content in social media." In Proceedings of the 2008 international conference on web search and data mining, 2008, ACM, pp. 183-194. https://doi.org/10.1145/1341531.1341557 [5] O'Keeffe, Gwenn Schurgin, and Kathleen Clarke- Pearson. "The impact of social media on children, adolescents, and families." Pediatrics 127, no. 4: 800-804, 2011. https://doi.org/10.1542/peds.2011-0054 [6] Hajirnis, Aditi. "Social media networking: Parent guidance required." The Brown University Child and Adolescent Behavior Letter 31, no. 12: 1-7, 2015. https://doi.org/10.1002/cbl.30086 [7] Tang, Qian, Bin Gu, and Andrew B. Whinston. "Content contribution for revenue sharing and reputation in social media: A dynamic structural model." Journal of Management Information Systems 29, no. 2: 41-76, 2012. https://doi.org/10.2753/MIS0742-1222290203 [8] Facebook Newsroom. https://newsroom.fb.com/ company-info/ (24th July 2017) [9] Aswani, R., Kar, A. K., & Ilavarasan, P. V. (2018). Detection of spammers in twitter marketing: a hybrid approach using social media analytics and bio inspired computing. Information Systems Frontiers, 1-16. https://doi.org/10.1007/s10796-017-9805-8 [10] Singh, N., Sharma, T., Thakral, A., & Choudhury, T. (2018, June). Detection of Fake Profile in Online Social Networks Using Machine Learning. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE) (pp. 231-234), IEEE. https://doi.org/10.1109/ICACCE.2018.8441713 [11] Facebook Terms of Service. https://www.facebook.com/legal/terms (18th august 2017). [12] Moustafa, N., & Slay, J. (2016). The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the 86 Informatica 43 (2019) 77–86 M.B. Albayati et al. comparison with the KDD99 data set. Information Security Journal: A Global Perspective, 25(1-3),18- 31. https://doi.org/10.1080/19393555.2015.1125974 [13] Fürnkranz, Johannes. "Separate-and-conquer rule learning." Artificial Intelligence Review 13, no. 1: 3-54,1999. https://doi.org/10.1023/A:1006524209794 [14] Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153-1176. https://doi.org/10.1109/COMST.2015.2494502 [15] Büchner, Alex G., and Maurice D. Mulvenna. "Discovering internet marketing intelligence through online analytical web usage mining." ACM Sigmod Record 27, no. 4: 54-61, 1998. https://doi.org/10.1145/306101.306124 [16] Xiao, Cao, David Mandell Freeman, and Theodore Hwa. "Detecting clusters of fake accounts in online social networks." In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, pp. 91-101.ACM,2015. https://doi.org/10.1145/2808769.2808779 [17] Gupta, Aditi, and Rishabh Kaushal. "Towards detecting fake user accounts in Facebook." In Asia Security and Privacy (ISEASP), 2017 ISEA, pp. 1-6. IEEE,2017. https://doi.org/10.1109/ISEASP.2017.7976996 [18] Ahmed, Faraz, and Muhammad Abulaish. "A generic statistical approach for spam detection in Online Social Networks." Computer Communications 36, no.10:1120-1129,2013. https://doi.org/10.1016/j.comcom.2013.04.004 [19] Fire, Michael, Dima Kagan, Aviad Elyashar, and Yuval Elovici. "Friend or foe? Fake profile identification in online social networks." Social Network Analysis and Mining 4, no. 1 (2014): 194. https://doi.org/10.1007/s13278-014-0194-4 [20] Bhat, Sajid Yousuf, and Muhammad Abulaish. "Community-based features for identifying spammers in online social networks." In Advances in Social Networks Analysis and Mining (ASONAM), 2013 IEEE/ACM International Conference on, pp. 100-107. IEEE, 2013. https://doi.org/10.1145/2492517.2492567 [21] Wang, Alex Hai. "Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach." DBSec 10: 335-342, 2010. https://doi.org/10.1007/978-3-642-13739-6_25 [22] Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011. [23] Cook, Diane J., and Lawrence B. Holder, eds. Mining graph data. John Wiley & Sons, 2006. https://doi.org/10.1002/0470073047 [24] Kotsiantis, Sotiris B., I. Zaharakis, and P. Pintelas. "Supervised machine learning: A review of classification techniques." Emerging artificial intelligence applications in computer engineering 160: 3-24, 2007. [25] Powers, David Martin. "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation." 2011. [26] Hanley, James A., and Barbara J. McNeil. "The meaning and use of the area under a receiver operating characteristic (ROC) curve." Radiology 143, no. 1: 29-36: 198.