https://doi.org/10.31449/inf.v48i6.4749 Informatica 48 (2024) 35–42 35 Crime Prediction Using Twitter Sentiments and Crime Data Gbadegesin Adetayo Taiwo, Muhamad Saraee, Jimoh Fatai School of Science, Engineering and Environment, University of Salford, Manchester, United-Kingdom Keywords: XGBoost, crime, SHAP, machine learning, sentiment analysis Received: The incidence of crime is now of great concern globally. The culprits change their tactics on a regular basis. These crimes affect persons, groups, and the government to the extent a whole lot of budgets are allocated to serve as preventive measure to these crimes. The aim of this research is to predict crime based on Twitter hourly sentiments and crime data records. This is because it has been observed that existing crime prediction models that used Twitter data entail some drawbacks in predicting criminal incidents as a result of the unavailability of hourly sentiment polarity and demographic factors. Additionally, SHAP framework was used for the interpretability to rank the feature based on their importance. The xgboost algorithm was utilized with tuning to have an optimal model. The accuracy of 0.81 (81%) was obtained and an Area Under the Receiver Operating Curve (ROC AUC) score of 0.7079 was obtained. The result of this study indicated that crime could be predicted in real-time in contrast to earlier studies on this subject matter. Consequently, it is advised that this work be applied to real-world situations. Povzetek: Raziskava napoveduje kriminal s pomočjo analize čustev na Twitterju in podatkov o kriminalu z uporabo algoritma XGBoost in okvira SHAP. 1 Introduction In this recent time, crimes stand out amongst other social challenges that have an effect on the lifestyle, economy, as well as image of a country [1,2]. Crimes have affected people, establishments, and governments. Therefore, crimes influence a number of choices individuals must make on their own or as a group including moving to a new location, traveling at a proper time, evading unsafe areas, setting up security and safety policies, among other things. As a result, this has significant impact on persons, establishments, and governments a lot such as providing additional security troops and devices, plus court cases in order to keep to economy and repute on the rise. Various reports have shown that crime rates are increasing. In a report there is 13% increase in all crime documented by police across England and Wales. The report also shows the rise in violent crimes including weapon crime, sexual crimes, and violence against individual [3]. Consequently, persons, establishments and government are working towards the reduction and eradication of all forms of crimes in the society. In order for this to work there is need for a change of strategy; government must implement augmented strategy [1] and workable information system suitable for this purpose [2]. Crimes can be predicted as a result of the criminals’ actions and their mode of operation as there is a high tendency of repeating the crime under similar conditions. The occurrence of crime depends on several things which include the state of security of the neighbourhood, the intellect of the criminals and so on. According to previous research, there is likelihood for crime to occur again though this is not applicable to all crimes. Consequently, making the crimes to be predictable [1]. Therefore, crime- solving is a painstaking work which entails human hard work and intellectual ability to analyze criminal data; therefore, the event of crimes is still the order of the day. It has also been discovered that for the past few years different crime data are being gathered [3], most especially for statistics purpose. Recently, Machine learning has been active in predicting virtually most of human events and natural occurrence [4]. Machine learning, a subfield of Artificial Intelligence such that computer uses systematic algorithms for carrying out task proficiently, without using explicit commands; but depends on patterns and inference in its place [5]. A whole lot of data have been collected; therefore, machine learning can help in crime identification challenges [6–8]. Various research works have been done using machine learning and data mining techniques in crime recognition and prediction [9]. A number of studies have proposed the use of decision trees for crime prediction [10–12] [13 -16]. Also, the research [10] has recommended the use of features such as population, the percentage of individuals above 16 years that are unemployed among others, to predict the level of violent crimes that are likely to occur in a particular area. The suggested methods did not put into consideration the type of crime that is likely to occur [10,11] [14-15]. In addition, these proposed methods used only decision trees classifiers. In another research [1], the authors used dataset from UK police department [13] which was used to visualize and predict crime using various machine learning algorithms. Also, a similar study [2], gathered data by crawling through various new archives such as The News, The Nation, Dunya News among others using a data miner 36 Informatica 48 (2024) 35–42 G.A. Taiwo et al. tool. The data collected was then analyzed and visualized. Then data mining techniques were used to gain more knowledge from the data by clustering the data and using various algorithms for crime prediction. Also, previous research [13] showed that GPS-tagged Twitter data can be utilized to predict future crimes in Chicago, Illinois, a major US city. However, current crime prediction models that incorporate Twitter data, have limitations in presenting criminal occurrences as a result of lack of hourly sentiment polarity and demographic factors. It is expected that adding sentiment polarity, crime data, and demographic factors to such models, will improve crime prediction. Furthermore, in crime prediction the interpretability of the machine or deep learning model is vital in order to know how the machine learning model has learnt and as well helped boost the reliability of such model as people especially law enforcement agencies cannot depend on “black box” system to forecast crime and influence their policies. Hence, affect the reliability of the existing system in crime prediction. Consequently, the lack of hourly sentiment polarity and demographic factors coupled with interpretability, pose a great problem; therefore, there is a need to alleviate this problem. As a result, alongside this line, the goal of this research is to predict categories of crime. This is done by merging sentiment polarity resulting from lexicon-based sentiment analysis with historical crime data through the use of XGBoost machine learning algorithm and SHAP technique. They are employed to interpret the prediction in making the system reliable as it is accepted that machine learning methods have greatly enhanced crime prediction. However, the inability to interpret the predictions from these sophisticated models is still a limitation. With this, crime prediction will be done in real time which is more reliable. The second section of this study explains the method used in the study; after which the third section explains the results followed by the discussion of the results. Finally, the conclusion is drawn in the fourth section. 2 Methodology This research entails three modules- data preprocessing, crime prediction and interpretability. 2.2 Data description and preprocessing The datasets for this research were gotten from the UK police department website [13] and Twitter. For the purpose of this research, the UK (stop and search) crime data was limited to Greater Manchester County and between January and June, 2019. The dataset entails records of crime with 12 attributes in which 5 attributes were taken into consideration for this research. The attributes taken into consideration are crime type, location, date, latitude and longitude. While the Tweets dataset entails GPS tagged tweets from Manchester between 2018 to 2019 were collected from Twitter using the Twitter streaming Application Programming Interface (API). Table 1 and Table 2 below respectively give the description of the dataset. Table 1: Description of the UK stop and search dataset Attri bute Data type Description Type Ordi nal Category of search carried out by the police officer: ● Vehicle Search ● Person Search ● Person and Vehicle Search Date Date Time Date stamp when the search occurred. Part of a policing operation Bool ean Was the action part of police activity? ● True ● False Polici ng Operation Ordi nal What part of the police activity occurred? Latitu de & Longitude Float Lattitude and Longittude of the location where the search took place. Gend er Ordi nal Gender of the individual. Age range Ordi nal Age range of the suspect: ● Under 10 ● 10-17 ● 18-24 ● 25-34 ● Over 34 Self- Defined Ethnicity Ordi nal The ethnicity of the officer as stated by the person. ● Black/African/Carib bean/Black British – African ● Asian/Asian British - Any Other Asian background ● White - English/Welsh/Scot tish/Northern Irish/British ● Mixed/Multiple ethnic groups - White and Black Caribbean ● Other ethnic group - Not stated. Crime Prediction Using Twitter Sentiments and Crime Data… Informatica 48 (2024) 35–42 37 Offic er- Defined Ethnicity Ordi nal Ethnicity group of the officer as stated by the officer. Legis lation Ordi nal Law implemented on the person. Which includes: ● Misuse of Drugs Act 1971 (section 23) ● Police and Criminal Evidence Act 1984 (section 1) ● Firearms Act 1968 (section 47) Object of search Ordinal The reason behind the search. Which may include: 1. Controlled drugs 2. Offensive weapons 3. Article for use in theft 4. Firearms 5. Stolen goods Outcome Ordinal Outcome of the search or the action carried out by the officer. ● Caution (simple or conditional) ● A no further action disposal ● Arrest ● Khat or Cannabis warning ● Summons / charged by post ● Community resolution Outc ome linked to object of search Bool ean The outcome of the search linked to the object of search is stated in this attribute. ● True ● False Remo val of more than just outer clothing Bool ean Does the search method involve the clothing? ● True ● False Table 2: Description of the tweet dataset Attribute Data type Description Row ID Integer ID for each tweet Date Date Time The timestamp of when the tweet was tweeted Tweet String The text of the tweet For the preprocessing, During the pre-processing stage, the inconsistent data (including missing values, unnecessary information, etc.) were removed and the data was transformed to the format required for crime prediction in the following modules. Manchester Crime Data The cases where missing values exists were removed from the dataset. Then for the purpose of this research, the outcome of the stop and search crime data was classified into three groups viz.: Antisocial, Drugs and Criminal which will be the target class for prediction as shown in Table 3, while the fourth group was removed for the dataset as this work aims at predicting crime only. The latitude and longitude features were changed to specific location. Table 3: Target/Label classification Class Category Nothing found - no further action NothingFound A no further action disposal Antisocial Caution (simple or conditional) Community resolution Summons / charged by post Local resolution Offender cautioned Offender given penalty notice Penalty Notice for Disorder Offender given drugs possession warning Drugs Khat or Cannabis warning Suspect summonsed to court Criminal Suspect arrested Arrest Tweet Data Data Cleaning: The tweets collected entails different twitter handles (@user), that are being used for identification of users on twitter. These handles bear no tangible information hence, was removed. Additionally, special symbols including (&^123!), punctuation 38 Informatica 48 (2024) 35–42 G.A. Taiwo et al. symbols, as well as numbers were replaced with blank spaces; thus, just characters and hashtags were the components of the tweets. Also, words with no meaning or information in the tweets including “oh”,” arg”,” hmm” and words with 3 letter words or lower were removed. Finally, tokenization which entails splitting of individual strings to pieces referred to as tokens was carried out after which stemming, which entails the removal of suffixes in words including “ness”, “ly”, “ed”, ”s”, was performed then the tokens were joined again to form sentences. Feature Extraction: this phase entails the extraction of meaningful features from the processed data. Due to the aim of this research, features need to be generated from the preprocessed data with the use of TF-IDF (Term Frequency – Inverse Document Frequency), which assigns lower weight to the most common word in a document, However, larger weight is assigned to words that are not common in the document. 𝐼𝐷𝐹 = 𝐿𝑜𝑔 ( 𝑀 𝑚 ) Where M is given as the number of documents and m is the number of documents a term k is present. TF is given as the frequency of a particular term k in a document Therefore, TF-IDF is gotten by multiplying TF and IDF that is, TF*IDF. Sentiment Extraction: In this study, SentiWordNet [14] was used to categorize sentiment of tweets. It is referred to as approach utilized for opinion mining in a way that it applied lexicon by calculating sentiment terms found in document (tweet in this study) and determine sentiment based on the class with highest polarity score. It consists of positive and negative documents that analyze each document’s (tweets) words critically to classify that sentiment of such document (tweet). The sentiment categories ranges from 23 to -28 where tweets with zero sentiment score are neutral. Tweets with positive sentiment score are positive and tweets, while negative sentiment scores are negative. Merging of Data and Feature Selection From the date feature of the two datasets, hour, minutes and seconds were taken out so as to merge the data effortlessly. Based on the date and time components mined out, the two datasets were merged together. Finally, the attribute that are perceived to be useful to this task such as location, gender, age range, ethnicity, outcome, and sentiment score were extracted whereas the other attributes were discarded. 2.2 Crime prediction In order to have good prediction and high interpretability, XGBoost was used as the algorithm for predicting the crime. XGBoost is a common algorithm that has a good balance of accuracy, scalability, and efficiency [15]. Additionally, great performance by XGBoost was recorded in the previous works on this subject matter as compared to other algorithms including [13,16,17]. XGBoost is based on decision tree which utilizes an ensemble learning approach to develop diverse models such that each new model attempts to address the shortcomings of the previous models [18]. The given samples are classified using the decision rules in this tree model, and the prediction is carried out by computing the scores in the leaves following the cumulative classification. 2.3 SHapley additive exPlanation (SHAP) The interpretability of the tree ensemble method is vital; however, tedious to accomplish. In some machine learning algorithms, when the weight of one significant attribute rises, the prominence of that attribute reduces, which leads to confusion [19] (Lundberg et al., 2018). Shapley additive explanation (SHAP) is a machine learning interpreter which can alleviate the challenge [20]. The aim of SHAP is to quantify the level of significance of attributes in machine learning models. Hence, FastSHAP package was used for visualizing the feature importance in this research. 3 Results The experiment was performed as stated previously. This section gives the details of the results of the experiment. XGBoost algorithm was fitted, and the grid search approach is utilized to optimize the model's parameters [21]. Then the system determines the most performing model on the basis of the evaluation metrics. The best combination of the parameters was chosen as the model after using cross-validation to assess how well each combination performed. The evaluation metrics used include accuracy, precision, recall, specificity, sensitivity and roc auc. performance evaluation. Details about the model’s performance is given in Table 4. Crime Prediction Using Twitter Sentiments and Crime Data… Informatica 48 (2024) 35–42 39 Table 4: Details of the performance evaluation of the model Mtry – is the number of variables randomly selected as candidates at each split Trees – number of trees to grow Min_n – An integer for the minimum number of data points in node that is required in order for the node to be divided further. Tree depth: The depth of each tree in the model The model with an accuracy of 0.8130 in Table 4 demonstrated that it performed significantly better than the other models when the parameters were adjusted to 1000 trees, Mtry of 10, 20 min n, and Tree depth of 20. The hyperparameters were adjusted to have performance measures with better score and it was discovered that the achieved result in the last model converged as it was the same with the preceding hyperparameters that produced the higher evaluation metrics. Figure 1. shows the ROC curve of the model. Figure 1: ROC curve of the model Tree s Mtr y Min_ n Tree_dept h roc_au c Pr_au c Accurac y Precisio n Recall specificity sensitivity 500 12 5 10 0.7049 0.733 0.8097 0. 0. 0.947 2 0.462 6 1000 10 20 20 0.7079 0.746 0.8130 0.7893 0.465 1 0.950 8 0.465 1 2000 15 20 30 0.7079 0.746 0.8130 0.7893 0.465 1 0.950 8 0.465 1 40 Informatica 48 (2024) 35–42 G.A. Taiwo et al. 3.1 Interpretability SHAP is generally used to get the description of the model. The importance of each feature to the prediction of the output is depicted with SHAP value which is weighted and summed over all conceivable feature value combinations. Figure 2 shows how each feature's mean absolute SHAP value is ranked from high to low. The features are arranged by their impact, with the most significant ones at the top. The age range (18-24) and sentiment score are the two most important features. Figure 2: Ranking of the absolute value of SHAP value of all features 3 Discussion The previous works on crime prediction that used machine learning approaches such as (Qi, 2020; Zhang et al., 2022) tend to be as “black box” as it one cannot determine what really occurs during the process. In essence, one only supplies the data and obtains the outcome. Hence, lacks interpretability which may affect people’s trust in the prediction models. However, in this study, the challenge has been alleviated not only by improving the performance of the model but also bringing about interpretability through visualizing the features based on their level of importance. The result of this research highlights that real time crime can be predicted through the merging of social media and historical crime data in which the reason can be drawn from the sentiment polarity of the social media data. Additionally, it was discovered that this study will be supplementary to existing works on this that have used a variety of data, including socioeconomic, spatiotemporal, and criminal data, causing the earlier research models to perform poorly in real-time. Also, it was discovered that attributes such as age range (especially, 18 -24), location, and sentiment are the crucial factors in predicting crime. Based on the Telephone-operated Crime Survey for England and Wales (TCSEW), which was conducted in 2021 (ONS, 2022), the office of national statistics also validated this. The survey showed that those between the ages of 18 and 34 are the most likely to commit crimes. Furthermore, it can be said from the attribute importance plot that sentiment score also plays a major role in predicting crime. Crime Prediction Using Twitter Sentiments and Crime Data… Informatica 48 (2024) 35–42 41 4 Conclusion In this research, historical crime data and twitter (sentiment scores) were utilized together with the use of XGBoost and SHAP. The model's performance during training was improved by adjusting the model hyperparameter. Besidesthis work has been able to produce an interpretable model to predict crime. Area Under the Receiver Operating Curve (ROC AUC) of 0.7079 and Accuracy of 0.81 (81%) were both achieved. It will be fascinating to see how sentiment analysis is improved in the future because social networks frequently utilize slang and other languages that the Natural Language Processing (NLP) system can understand, which in some way affects the model's effectiveness. References [1] ToppiReddy HKR, Saini B, Mahajan G. Crime Prediction & Monitoring Framework Based on Spatial Analysis. Procedia Comput Sci [Internet]. 2018;132(Iccids):696–705. Available from: https://doi.org/10.1016/j.procs.2018.05.075 [2] Umair A, Sarfraz MS, Ahmad M, Habib U, Ullah MH, Mazzara M. Spatiotemporal Analysis of Web News Archives for Crime Prediction. Appl Sci. 2020;10. [3] Tompson L, Johnson S, Ashby M, Perkins C, Edwards P. UK open-source crime data: Accuracy and possibilities for research. Cartogr Geogr Inf Sci. 2015;42(2):97–111. [4] Oladimeji OO, Oladimeji A, Oladimeji O. Classification models for likelihood prediction of diabetes at early stage using feature selection. Appl Comput Informatics. 2021; [5] Oladimeji OO, Oladimeji O. Predicting Survival of Heart Failure Patients Using Classification Algorithms. JITCE (Journal Inf Technol Comput Eng [Internet]. 2020 Sep 30;4(02):90–4. Available from: http://jitce.fti.unand.ac.id/index.php/JITCE/article/v iew/75 [6] Malathi A, Baboo SS. Enhanced Algorithms to Identify Change in Crime Patterns. Int J Comb Optim Probl Informatics. 2011;2(3):32–8. [7] Brayne S, Christin A. Technologies of Crime Prediction: The Reception of Algorithms in Policing and Criminal Courts. Soc Probl. 2021;68(3):608–24. [8] Manzanares MCS, Diez JJR, Sánchez RM, Yáñez MJZ, Menéndez RC. Lifelong learning from sustainable education: An analysis with eye tracking and data mining techniques. Sustain. 2020;12(5). [9] Kotevska O, Kusne AG, Samarov D V., Lbath A, Battou A. Dynamic Network Model for Smart City Data-Loss Resilience Case Study: City-to-City Network for Crime Analytics. IEEE Access. 2017; 5:20524–35. [10] Ahishakiye E, Omulo EO, Taremwa D, Niyonzima I. Crime prediction using Decision Tree (J48) classification algorithm. Int J Comput Inf Technol. 2017;06(03):188–95. [11] Nasridinov A, Ihm SY, Park YH. A decision tree- based classification model for crime prediction. Lect Notes Electr Eng. 2013;253 LNEE:531–8. [12] Iqbal R, Murad MAA, Mustapha A, Panahy PHS, Khanahmadliravi N. An experimental study of classification algorithms for crime prediction. Indian J Sci Technol. 2013;6(3):4219–25. [13] Chen X, Cho Y, Jang SY. Crime prediction using Twitter sentiment and weather. 2015 Syst Inf Eng Des Symp SIEDS 2015. 2015;(c):63–8. [14] Ohana B, Tierney B. Sentiment classification of reviews using SentiWordNet. 9th IT T Conf. 2009; [15] Mousa SR, Bakhit PR, Osman OA, Ishak S. A comparative analysis of tree-based ensemble methods for detecting imminent lane change maneuvers in connected vehicle environments. Transp Res Rec. 2018;2672(42):268–79. [16] Zhang X, Liu L, Lan M, Song G, Xiao L, Chen J. Interpretable machine learning models for crime prediction. Comput Environ Urban Syst [Internet]. 2022;94(November 2021):101789. Available from: https://doi.org/10.1016/j.compenvurbsys.2022.1017 89 [17] Qi Z. The Text Classification of Theft Crime Based on TF-IDF and XGBoost Model. Proc 2020 IEEE Int Conf Artif Intell Comput Appl ICAICA 2020. 2020;1241–6. [18] Mitchell R, Frank E. Accelerating the XGBoost algorithm using GPU computing. PeerJ Comput Sci. 2017;2017(7). [19] Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, et al. Explainable machine- learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng [Internet]. 2018;2(10):749–60. Available from: http://dx.doi.org/10.1038/s41551-018-0304-0 [20] Sayres R, Taly A, Rahimy E, Blumer K, Coz D, Hammel N, et al. Using a Deep Learning Algorithm and Integrated Gradients Explanation to Assist Grading for Diabetic Retinopathy. Ophthalmology. 2019;126(4):552–64. [21] Putatunda S, Rama K. A comparative analysis of hyperopt as against other approaches for hyper- parameter optimization of XGBoost. ACM Int Conf Proceeding Ser. 2018;6–10. 42 Informatica 48 (2024) 35–42 G.A. Taiwo et al.