https://doi.org/10.31449/inf.v47i6.4474 Informatica 47 (2023) 219–232 219 An Automated Python Script for Data Cleaning and Labeling using Machine Learning Technique Matthew Abiola Oladipupo 1 , Princewill Chima Obuzor 2 , Babatunde Joseph Bamgbade 3 , Kazeem M. Olagunju 4 , Abidemi Emmanuel Adeniyi 5 , Sunday Adeola Ajagbe 6* 1,2 Department of Data Science School of Science, Engineering and Environment, University of Salford, UK. 3 Federal College of Forestry (FRIN), Jericho GRA, Ibadan, Nigeria. 4,6 Department of Computer Engineering, Ladoke Akintola University of Technology LAUTECH, Ogbomoso, Nigeria 5 Department of Computer Sciences, Precious Cornerstone University, Ibadan, Nigeria. 6* Department of Computer & Industrial Production Engineering, First Technical University, Ibadan, Nigeria E-mail: m.a.oladipupo@edu.salford.ac.uk 1 , p.obuzor@edu.salford.ac.uk 2 , tundebamgbade@yahoo.com 3 , kmolagunju@student.lautech.edu.ng 4 , adeniyi.emmanuel@lmu.edu.ng 5 , sunday.ajagbe@tech-u.edu.ng 6a* Keywords: data cleaning, information science, machine learning, financial dataset, automation, python script Received: October 31, 2022 Every employee in the company who deals with data needs to have clean, noise-free data. Since data warehouses store and update enormous amounts of data from several sources, there is a potential that some of those references may contain inaccurate data. Due to the noise, inefficacy, and poor characterization of the vast amount of accessible data, as well as the ensuing insensitivity and inefficiencies of human data cleaning and labeling, the presentation of the data has become ambiguous, and the assessment of the information has become difficult. A hole in the creation of a better data analysis method was identified. This helped to guide the creation of a Python script for automatically cleaning and labeling data. The first step in the strategy used in this study to accomplish its goals and objectives was to obtain a financial dataset from the top database, "Kaggle". Create a machine learning (ML) approach in Python that intends to automate the financial dataset cleaning. This covers ingesting data, addressing incomplete data, addressing anomalies, one-hot wrapping and label encoding, extracting date and time values, and data normalization. Implementing an unsupervised machine learning method that attempts to automate financial dataset labeling (k- means). Using the method includes the elbow principle, k-means clustering, data modeling of "age" versus "arrival," dimensionality reductions, computer vision, and dataset categorizing using the groupings. An empirical assessment of the cleaned and labeled automated trading dataset utilizing a comparison of the cleaned dataset before and after PCA adoption. The results show that the developed ML technique not only improved the performance of the audit data used in this study, but also classified the data after cleaning it and removing the unpleasant section and incomplete data, as shown by the k-means segmentation result and grouping by PCA. Povzetek: Razvili so skripto v Pythonu za avtomatsko čiščenje in označevanje finančnih podatkov ter podatke uporabili za strojno učenje za avtomatizacijo postopka. 1 Introduction To prevent reaching the wrong conclusions, data cleaning is carried out to ensure the data is accurate. Data cleansing is an essential step in every operation using data. To enhance the outcomes of data mining, data purification is necessary. In a similar manner, data labeling guarantees that the dataset is accurately described. Firms are finding it less difficult to collect and retain enormous volumes of data. These huge datasets may help with better decision- making, greater comprehension, and, in some instances, training data for machine learning. However, data quality continues to be a significant issue, since flawed data can produce in incorrect conclusions and unreliable findings. Inadequate knowledge, errors, mismatched forms, numerous captures of the same genuine item, and transgressions of professional norms of regular errors are examples. Data cleansing has developed into a crucial area of database research because analysts must assess the effects of dirty data before reaching any conclusions. Databases can get corrupted for a number of reasons, such as missing, incorrect, or inconsistent data. ML techniques are increasingly being applied in current data analytics routes, and the effects of dirty data may be difficult to control. Simple sampling approaches are useless for elevated systems because dirty data is often of poor quality (Krishnan, et al., 2016). There has been a growth in interest in many aspects of data cleansing in latest years from both industry and academia, such as innovative abstractions (Beskales, et al., 2010; Fan, et al., 2010), interactions (Dallachiesa et al. 2013, Khayyat et al. 2015), robustness techniques, and crowdsourced techniques (Chen & Cafarella, 2014). 220 Informatica 47 (2023) 219–232 M.A. Oladipupo et al. Information collecting is a major obstacle to learning algorithms and a popular research topic in many domains. The sudden rise in importance of data collecting may be attributed to mainly two factors. First, when machine learning becomes more widespread, new applications appear that might not always have enough tagged data (Roh, et al., 2019). Second, deep learning techniques build classification models as opposed to conventional ML algorithms, reducing feature engineering costs but requiring more labelled data (Adeniyi et al., 2022). Modern data exploration originates not only from machine learning, natural language processing, and object identification but also from controlling the data field due to the necessity to process enormous amounts of data (Roh, et al., 2019) Machine learning has a significant influence on a wide range of applications, including textual analysis, picture and audio recognition, and care services genetics. We live in an exciting time of invention. For instance, deep learning algorithms are known to perform better than ophthalmologists in identifying diabetic eye issues in pictures (Phene, et al., 2019). Large amounts of training data and increased computer resources are largely to blame for the present success. Data collecting among other difficulties, has emerged as one of the main bottlenecks in machine learning. The majority of the time required to complete machine learning from start to finish is invested on data preparation, which involves data collection, cleanup, and analysis, presenting, and feature extraction. The goal of machine learning is to extract knowledge from data (Kubat, 2017). Supervised learning is the method of artificial intelligence that is most frequently used in stock market forecasting. This research trained the model using a number of machine learning techniques after properly cleaning and labeling the data (Ogunlese et al., 2022). KNN machine learning methods are used in this work to sanitize financial data. Using labelled data, the K-Nearest Neighbor (KNN) classifier employs supervised learning. In this instance, it was used to clean soiled financial datasets that were downloaded from the Kaggle database. Based on how similar its independent variables are to an existing instance, KNN determines the dependent variable. The autonomous data cleansing and labeling (ADCL) used in this research aims to deliver the preciseness and accuracy of the user-provided dataset. By offering automatic cleansing and labeling, the unsupervised approach in this research aids in reducing the customer's labor, energy, and other guides. The productivity of the cleansed dataset was also evaluated and shown in comparison to the uncleaned customers records utilized in this experiment, which gives the user confidence in its efficacy. There are differences in the scope, discretization technique, imputed columns, and quantity of incomplete data. For Alqami Quant Data Analysis, an unsupervised clustering program based on client data and character was created. A user profile assessment is a thorough examination of a business' ideal customers. It improves a firm's comprehension of its clients and makes it simpler to customize items to the distinctive demands, habits, and problems of diverse clients. This study consists of six sections. The next section describes the literature reviews. Section 3 presents the summary of a review of past work. The materials and methods used as described in section 4. Section 5 presents the result and discussion. Section 6 concludes the study. 2 Literature reviews The most efficient way to gather, analyse, and analyze massive volumes of diverse data from many sources is through the use of big data. Information quality is impacted by the volume and pace of data generation and processing. At every level of the Big Data system, Quality of Big Data (QBD) must be used to guarantee data quality (Alkatheeri et al., 2015; Taleb et al 2020; Ajagbe & Adigun 2023). The pre-processing stage, which comprises sub-processes like cleansing and merging, mainly concentrates on data integrity. Massive volumes of data that are challenging to evaluate in typical data management methods are processed using big data platforms. Toolan & Carthy (2010) looked at 40 characteristics that frequently occurred in the research. Four factors, Web address, specific topic, and script-based—were used to group the traits. Following the determination of the information obtained for each attribute throughout their inquiry, designs for each property were created and evaluated. The article's findings supported traits that are related to the body. Advanced phishing detection characteristics were explored by Bergholz et al. in 2008 and in 2010. Despite the numerical pointlessness of improving detection by changing the classification method itself, the scientists found that adding enhanced characteristics significantly improved email phishing categorization. On the basis of an unsupervised algorithm, two sets of result in a significant were created to enhance the 27 often used criteria in phishing identification. The basic features included spam attributes, word list attributes, link functionalities, component attributes, and structure characteristics. Among the novel features were the dynamic Markov chain prototype, latent topic model characteristics, and subject phrase groupings founded on latent Dirichlet distribution. To resolve classification task and recognize phishing SVM, deep learning techniques or other techniques like naive Bayes and support vector machines are frequently used. Recurrent convolutional neural network (RCNN)- based text classification model was proposed by Lai et al. (2015). To get context from the text, they created a repeating framework. In order to develop a written representation, they also used CNN. On four datasets, the model was put to the test, and its efficiency was assessed to that of a Convolution layer, a recursive neural network (RNN) concept, as well as other traditional models. They discovered that the RCNN prototype gave better results than all other tests conducted. In phishing research, 20% of the population responded and visited the fake hyperlink in the messages, according to Benenson et al (2017) description. 34% of An Automated Python Script for Data Cleaning and Labeling… Informatica 47 (2023) 219–232 221 individuals who were asked why they visited the hyperlink said that they were curious. They recommended companies to take every precaution to prevent employees from viewing and responding to phishing emails. To combat this growing threat, automating of phishing email identification using body email content is necessary. A brand-new text classification method utilizing graph neural networks was developed by Yao et al. in 2019. The main idea was to employ graph neural networks to train phrase and paragraph embeddings in tandem while representing the entire corpus as a heterogeneous graph. They put the system to the test on four text-size samples and contrasted the outcomes with those of existing state- of-the-art text classification and incorporate techniques. Having a 97% accuracy rate, this model was effective. The message content and subject are the areas in which the THEMIS classification model, created by Fang et al. (2019), functions. For text, the scholars used deep learning rather than feature extraction. The word2vec tool was used to depict messages, and the char-level email headers, word-level email header, char-level message content, and word-level email body were all recovered. The RCNN deep learning technique was used to build the model. The THEMIS model's accuracy rate of 99% was encouraging, illuminating the value of using NLP for email phishing prediction. Kulesza et al. (2014) found that annotators regularly changed their working framework of a baseline model and their supporting tags when they encountered more entries in a dataset. The ability to create specific frameworks for unclear items discovered during labeling helped annotators to gradually improve their overall understanding of the data and provide more regular ultimate descriptions. Kairam & Heer (2016) used label conventions to classify crowdworkers (e.g., various numbers of entities were recognized by liberal and conservative labelers during an entity extraction job). The subjective examination of these groups was then used to enhance future challenge concepts. In contrast to previous research, we use public disagreements to find and clarify confusing concepts in data in order to give annotated information for machine learning. Halgas et al., (2020) recommended an RNN-based classification to identify malicious email from legitimate emails based on the vocabulary they use. The classifier turned out to be reliable and helpful. It might also be used in combination with the existing classifiers. In order to increase the likelihood of accurately detecting the possibility that a message is a fraudulent message, this study develops an efficient phishing email identification classifiers employing NLP of email body features and deep learning techniques using GCN. In order to provide continuous and recurrent cleansing while keeping convergence assurances in statistical modeling issues, Krishnan et al. (2016) presented ActiveClean. ActiveClean focuses cleansing data that are likely to have an impact on the results and supports convex loss methods (such as regression analysis and SVMs). We evaluate ActiveClean using five real-world datasets: UCI Adult, UCI EEG, MNIST, IMDB, and Dollars for Docs, with both actual and fake problems. The results indicate that our suggested changes can increase model accuracy by up to 20% using the same volume of data that has been 2.5 times cleaned. Additionally, with a fixed cleansing expense and on all real datasets, ActiveClean builds more accurate estimates than regular selection and Active Learning. Table 1 presents the summary of the reviewed literature Table 1: Summary of review of literature S /N Author Title Methodology Result 1 Benenson et al., (2017) Unpacking spear phishing susceptibility Audience questionnaire Automation of phishing emails will help to address threats. 2 Bergholz et al., (2008) and (2010) Novel phishing message segmentation techniques Deep learning methods, dynamic Markov chain model The upgraded features improved email phishing classification 3 Lai et al., (2015) Text classification using repetitive convolutional neural systems CNN and RCNN model RCNN performed better than CNN 4 Yao et al., (2019) Text classification using graph convolutional networks. Graph Neural Networks This system obtained 97% accuracy rate. 5 Fang et al., (2019) Detecting phishing emails with an enhanced RCNN method with multilevel vectors and a probabilistic model THEMIS categorization model The model obtained 99% accuracy. 6 Kulesza et al., (2014) Organized labeling in machine learning to aid idea transformation Annotators structures for labelling. The annotators allowed progrssively 222 Informatica 47 (2023) 219–232 M.A. Oladipupo et al. gain a global grasp of data. 7 Kairam & Heer, (2016) Divergent explanations in crowdsourced tagging tasks: Parting the crowds Machine Learning The study analysis improved the future problem designs. 8 Halgas et al., (2020) Catching the Phish: Using recurrent neural networks to discover malicious scams (RNNs) RNN-based classifier. Deep learning using GCN. The GCN used boost the chance of automatic recognition of potential email phishing 9 Krishnan et al., (2016) ActiveClean is a visualization tools cleaning application for data analysis. Linear Regression and SVM were supported with ActiveClean The model improve accuracy by up to 20%. 3 Materials and methods The approaches utilized in this research are summarized in this part. These include data collection via Kaggle, information retrieval, feature engineering, and assessment, among other things. The conceptual structure of the method employed in this research is depicted in Figure 1. The client details from the database of a grocery shop are included in the dataset, which was obtained through Kaggle. Each consumer who has visited the business is represented by their biometric information and purchase history. Client age, first purchase date, relationship status, gender, number of siblings, education level, and other factors are among the variables included in the dataset. Based on the summary statistics, the dataset comprises 26 quantitative columns and about 2240 rows. Figure 2 displays the omitted client record data from the study's perspective. Figure 1: Cleansing and labeling of client data theoretical model. First obtaining the Customer Order information from the largest repository ‘Kaggle’. This stage allows for the accomplishment of the first goal. Create a method to automate the cleansing of the Customer Record dataset using the Python programming language. Incorporating data, managing incomplete data, controlling anomalies, one-hot encoding and labels encoding, extracting date/time values, and lastly standardizing data This stage allows for the accomplishment of the second goal. Application of a method designed to automate unsupervised machine learning labeling of customer details datasets (k-means). The procedure includes the use of • k-means grouping, the elbow rule, • Data visualisation comparing "age" to "incoming," • Diminution of dimensions, • Utilizing the cluster for data visualization and dataset labeling. This stage allows for the accomplishment of the third goal. Using a comparative of the cleansed dataset before and after PCA application, empirical assessment of the cleansed and labeled automated trading dataset is conducted. • Visualizing data This stage allows for the accomplishment of the fourth goal. An Automated Python Script for Data Cleaning and Labeling… Informatica 47 (2023) 219–232 223 Figure 2: Highlight the customer record data exploration analysis. 3.1 Feature engineering To create more functionalities, some information was manually orchestrated. As previously stated, this enhances the AI agent's ability to understand of the dataset. The newly added parameters were created by hand: • Age was calculated by subtracting each client's birth date from the current date. • Spent: To obtain this variable's value, add the amount expended on Vintages, Foods, Meat, Fish, Sweets, and so on. This represents the entire amount consumed at the supermarket. • Living Conditions: The groups in this section were chosen for their likeness. • Toddlers: This parameter was calculated by adding the quantity of kids and teenagers in the family. • Is Caregiver: This block was employed to distinguish those who have had at least one child from those who have not. • Schooling: The classifications in this block have been reorganized based on their resemblance. Moreover, sections with obscure names like Mntwines, Mnths, and others were rebranded to a more instinctive and comprehensible. Data Cleaning Algorithm Algorithm 1: Customer Information Data preparation using k-means Step 1: Start Step 2: The user should initiate various data records or data sources to the Ml algorithm for cleanup. Step 3: Fill in all of the blanks. Step 4: Put feature engineering into action. Step 5: Deal with anomalies by identifying them with the interquartile range (IQR) Step 6: Used one-hot embedding data or attribute encoding to perform classification encoding. Step 7: Requirements or situations for selecting either one-shot or label encoding data. If the attribute has ten distinct values, it will be one-hot engineered. If the attribute has 20 distinct values, it will be label-encoded. If the attribute has more than 20 distinct values, it will not be embedded. Step 8: Datetime features extraction Step 9: Use a classic scalar to standardize the client records scaler = StandardScaler () Step 10: Encoding labels with a label encoder LE=LabelEncoder () 224 Informatica 47 (2023) 219–232 M.A. Oladipupo et al. Data Labelling Algorithm Algorithm 2: Customer Record Data labelling based on k-means with PCA Step 1: Start Step 2: Using the elbow rule, determine the optimal number of clusters. Step 3: Use the k-means technique to group the dataset. Step 4: To create a labelled dataset, attach the groupings to the original dataset. Step 5: Use the cluster to label the data. Step 6: Depict or reveal the clustering quality using a scatter graph from the Seaborn library. Step 7: Used PCA to perform dimensional reduction. Step 8: Repeat the preceding steps as shown below. using the elbow rule, determine the optimal number of clusters. cluster the dataset using the k-means technique groupings the data using the cluster grouping effectiveness can be visualized or displayed using a scatter graph from the Seaborn library. Step 9: End 4 Results and discussion The findings of the parametric data and statistics are shown below. After data profiling, verifying limitations, the comparison among sections, and finally asserting null values, the sterilised (pre-processed) and labeled measurements were reverted to the subscriber. Figure 3 depicts the cleanup of the clients' collect data before handling the anomalies. The figure includes three distinct visualization tools of customer details cleanup, equivalent to the visual analysis in (Ajagbe, et al., 2020). The visualization consists of: i. Histogram (quantity of deals purchased vs count): Any dispersion that deviates from the sequence is an outcast; thus, anomalies were identified and addressed. ii. Possibility plot (conceptual quartile vs RM quartile): the red line indicates the likelihood storyline; any findings that deviate from the line are considered outliers. Boxplot (amount of deals acquired): Any data point outside the whiskers is an outcast. Anomalies are clearly present in the client record dataset obtained. Figure 3: Before treating the outliers. Figure 4 depicts the pictorial depiction of the user ’s account after the anomalies have been removed. The interquartile range (IQR) (IQR). This demonstrates how the study second goal was met. Essentially; An Automated Python Script for Data Cleaning and Labeling… Informatica 47 (2023) 219–232 225 i. Histogram: There are no observations that are outside the sequence, so the anomalies have been adequately dealt. ii. Probability plot: There are no observations that are substantially removed from the likelihood plotline's red line. As a result, the anomalies have been appropriately handled. iii. Boxplot: Because there are no data points outside the whiskers, the anomalies in the client recordings used in this study have been efficiently handled, achieving the goal of customer details data cleaning. Figure 4: After treating the outliers. 4.1 Result of customer record data labelling technique developed K-means was used in this study. Inertia was utilized to assess how well K-means performed on the study dataset. The quantity of clusters desired (n patterns), the quantity of initializations desired (n init), the highest number of rounds the technique will perform to selecting a sample of observations in order to decrease inertia (max iter), and the acceptance desired. The variety of clusters was charted against (WCSS) and depicted to demonstrate how the k-means algorithm works. WCSS is the total of the squared ranges between each spot and the centroid in a group. WCSS was mapped against the clustering. The number of groups utilized in the k-means concept and the best possible point selected for the elbow principal test is four. Figure 5 is the elbow rule that displays the outcomes of the data labeling investigation in this study with k-means prior to the application of PCA, and Figure 6 is the forearm principle of k-means after PCA. Figure 7 shows that each group is quite comparable to the others, indicating that they are nearly equitably spread. It is clear that those in group 0 are elevated shoppers with limited wages, whereas those in group 2 are minimal shoppers with limited wages. Group 3 includes high purchasers with median earnings, whereas Group 1 comprises high purchasers with large salaries. Figure 8 depicts the breakdown of the client record dataset based on income and expenditures. The dataset was clearly labeled by attaching the groupings to the original dataset and labeling it with the groupings. Figure 9 depicted additional details from this study's subcategories. The K-means proposed technique was executed satisfactorily. The areas of overlap were not differentiated enough. Nevertheless, when dimensionality reduction was applied, the groupings were well isolated, as depicted in Figure 10, allowing for a more accurate and effective understanding of client records and datasets using k-means. Table 2 presents the comparison of the existing work with our study and the SOTA of this work of this research. 226 Informatica 47 (2023) 219–232 M.A. Oladipupo et al. Figure 5: WCSS cluster analysis using K-means before application of PCA Figure 6: WCSS cluster analysis using K-means after application of PCA An Automated Python Script for Data Cleaning and Labeling… Informatica 47 (2023) 219–232 227 Figure 7: Distribution of the cluster. Figure 8: Revenue and expenditure dissemination. 228 Informatica 47 (2023) 219–232 M.A. Oladipupo et al. Figure 9: Profiling of the clusters Figure 10: K-means clustering segmentation with PCA. An Automated Python Script for Data Cleaning and Labeling… Informatica 47 (2023) 219–232 229 Table 2: Comparison of the existing work with our study. 5 Conclusion ADCL (autonomous data cleanup and labeling) attempts to ensure the preciseness and accuracy of the dataset supplied by the user. By providing automated cleanup and labeling, the unsupervised method employed in this research assists in reducing the customer's time, commitment, and other manuals. There are differences in the amount of omitted variables, sections ascribed, discretization approach, and variety. All of these factors were considered when assessing the experiment's effectiveness and ability to sanitize a dataset. The schemes proffered here are utilized to choose the data that produces the most effective and optimal findings for the raw data provided. This goal was accomplished because it improved the quality of information provided by clients by utilizing their ideal cleaning solution. The study obtained a customer record dataset from Kaggle, and information gathering revealed that the client log includes information such as customer age, original purchase period, family status, sex, number of dependents, skills training, and other variables that demonstrate the dataset's appropriateness. To obtain a better grasp of the data, an overview statistic, incomplete data recognition, and a dataset tally were created. This resulted in the use of the elbow rule to determine the optimal number of clusters, K-means grouping, visual analytics, and the realization that improved effectiveness of k-means, dataset labelling using the groupings. The study's effectiveness was determined by comparing the clean dataset before and after PCA implementation. According to the analytical outcomes, principal component analysis delivered a reasonable outcome. Additional research could look into other types of Principal component analysis methods, like iterative PCA, sparsity PCA, and single attribute decomposing. S/N Author Goals Contribution 1 Lai et al., (2015) Text classification using repetitive convolutional neural systems The model was put to test on four datasets and the test result shows that RCNN model gave better results than all other test conducted for text classification. 2 Kairam & Heer, (2016) To use label conventions to classify crowdsourced data. The ability to create specific frameworks for unclear items discovered during labeling helped annotators to gradually improve their overall understanding of the data and provide more regular ultimate descriptions 3 Krishnan et al., (2016) Is to use ActiveClean as a visualization tool for cleaning application for data analysis. The study was evaluated using five real world datasets. The results indicate that the suggested changes can increase model accuracy by up to 20% using the same volume of data that has been 2.5 times cleaned. Linear Regression and SVM were supported with ActiveClean 4 Fang et al., (2019) Detecting phishing emails with an enhanced RCNN method with multilevel vectors and a probabilistic model THEMIS categorization model was created using the word2Vec tool to depict messages. 5 Halgas et al., (2020) Catching the Phish: Using recurrent neural networks to discover malicious scams (RNNs) The study develops an efficient phishing email identification classifier employing NLP of email body features and deep learning techniques using GCN. In order to provide continuous and recurrent cleansing while keeping convergence assurances in statistical modeling issues. 6 Proposed study To use an automated python script for data cleaning and labeling with machine learning technique The developed ML technique not only improved the performance of the audit data used in this study, but it also classified the data after cleaning it and removing the unpleasant section and incomplete data, as shown by the k-means segmentation result and grouping by PCA. 230 Informatica 47 (2023) 219–232 M.A. Oladipupo et al. References [1] Adeniyi, E. A., Oguns, Y. J., Egbedokun, G. O., Ajagbe, K. D., Obuzor, P. C., Ajagbe, S. A. (2022), “Comparative Analysis of Machine Learning Techniques for the Prediction of Employee Performance”, Paradigmplus, vol. 3, no. 3, pp. 1-15, https://doi.org/10.55969/paradigmplus.v3n3a1 [2] Ajagbe, S. A., Adigun, M. O. (2023) Deep learning techniques for detection and prediction of pandemic diseases: a systematic literature review. Multimedia Tools Application (2023). https://doi.org/10.1007/s11042-023-15805-z [3] Ajagbe, S. A., Oladipupo, M. A. & Balogun, E. O., 2020. Crime Belt Monitoring Via Data Visualization: A Case Study of Folium. International Journal of Information Security, Privacy and Digital Forensic, 4(2), pp. 35-44. [4] Alkatheeri, Y. et al., 2020. The effect of big data on the quality of decision-making in Abu Dhabi Government organisations. In: Data management, analytics and innovation . s.l.:Springer, Singapore. [5] Alwert, K., Bornemann, M. & Will, M., 2009. Does intellectual capital reporting matter to financial analysts?. Journal of intellectual capital., Volume 10, pp. 354-368. [6] Bansal, S. K., 2014. Towards a semantic extract- transform-load (ETL) framework for big data integration. s.l., IEEE, pp. 522-529. [7] Bansal, S. K. & Kagemann, S., 2015. Integrating big data: A semantic extract-transform-load framework. Computer, 48(3), pp. 42-50. [8] Benenson, Z., Gassmann, F. & Landwirth, R., 2017. Unpacking spear phishing susceptibility. s.l., Cham: Springer, p. 610–627. [9] Bergholz, A. et al., 2010. New filtering approaches for phishing email. Journal of Computer Security, 18(1), pp. 7-35. [10] Bergholz, A. et al., 2008. Improved Phishing Detection using Model-Based Features. Mountain View, California, USA, s.n., pp. 1-10. [11] Beskales, G., Ilyas, I. F. & L., G., 2010. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1-2), pp. 197-207. [12] Chang, J. C., Amershi, S. & Kamar, E., 2017. Revolt: Collaborative crowdsourcing for labeling machine learning datasets. s.l., s.n., pp. 2334-2346. [13] Chen, Z. & Cafarella, M., 2014. Integrating spreadsheet data via accurate and low-effort extraction. s.l., ACM, p. 1126–1135. [14] Chicco, D., 2017. Ten quick tips for machine learning in computational biology. Bio Data mining, 10(1), pp. 1-17. [15] Dallachiesa, M. et al., 2013. Nadeef: a commodity data cleaning system. SIGMOD, pp. 541-552. [16] Fang, Y. et al., 2019. Phishing Email Detection Using Improved RCNN Model With Multilevel Vectors and Attention Mechanism. IEEE Access, Volume 7, pp. 56329-56340. [17] Fan, W. et al., 2010. Towards certain fixes with editing rules and master data. PVLDB, 3(1-2), pp. 173-184. [18] Halgaš, L., Agrafiotis, I. & Nurse, J. R. C., 2020. Catching the Phish: detecting Phishing Attacks Using Recurrent Neural Networks RNNs. s.l., Springer, pp. 219-233. [19] Hellerstein, J. M., 2008. Quantitative data cleaning for large databases, s.l.: United Nations Economic Commission for Europe (UNECE). [20] Johnson, G. M., 2021. Algorithmic bias: on the implicit biases of social technology. Synthese, 198(10), pp. 9941-9961. [21] Kairam, S. & Heer, J., 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. s.l., ACM, pp. 1637-1648. [22] Khayyat, Z. et al., 2015. Bigdansing: A system for big data cleansing. s.l., ACM, pp. 1215-1230. [23] Kostopoulos, G., Kotsiantis, S. & Pintelas, P., 2015. Estimating student dropout in distance higher education using semi-supervised techniques. s.l., s.n., pp. 38-43. [24] Krishnan, S. et al., 2016. ActiveClean: interactive data cleaning for statistical modeling. s.l., ACM, p. 948. [25] Kubat, M., 2017. An introduction to machine learning (2nd Ed.). s.l.:Springer Publishing Company, Incorporated. [26] Kulesza, T. et al., 2014. Structured labeling for facilitating concept evolution in machine learning. s.l., ACM, p. 3075–3084. [27] Lai, S., Xu, L., Liu, K. & Zhau, J., 2015. Recurrent convolutional neural networks for text classification. s.l., ACM, p. 2267–2273. [28] Liebchen, G. A. & Shepper, M., 2005. Gernot Armin Liebchen, Martin Shepper, “Software Productivity Analysis of a Large Data Set and Issues of Confidentiality and Data Quality” 11th IEEE International Software Metrics Symposium (METRICS 2005).. s.l., ACM. [29] Madanagopal, K., Ragan, E. D. & Benjamin, P., 2019. Analytic provenance in practice: The role of provenance in real-world visualization and data analysis environments. IEEE Computer Graphics and Applications, 39(6), pp. 30-45. [30] Myklebust, T. et al., 2021. Data safety, sources, and data flow in the offshore industry. ESREL, Angers. [31] Ogunseye, E. O., Adenusi, C. A., Nwanakwaugwu, A. C., Ajagbe, S. A., Akinola, S. O. (2022) Predictive Analysis of Mental Health Conditions Using AdaBoost Algorithm”, Paradigmplus, vol. 3, no. 2, pp. 11- 26, Aug. An Automated Python Script for Data Cleaning and Labeling… Informatica 47 (2023) 219–232 231 [32] Phene, S. et al., 2019. Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs. Ophthalmology, 126(12), pp. 1627- 1639. [33] Pisani, M., 2020. CHAPTER 1 – Introduction. In: MACHINE LEARNING . s.l.:Rootstrap, pp. 1-10. [34] Rajasekar, S. P., Philominathan, P. & Chinnathambi, V., 2019. Research Methodology. Knowledge Management Techniques for Risk Management in IT Projects.. Knowledge Management Techniques for Risk Management in IT Projects, pp. 1-53. [35] Reddy, U. S., Thota, A. V. & Dharun, A., 2018. Machine learning techniques for stress prediction in working employees. s.l., IEEE, pp. 1-4. [36] Roh, Y., Heo, G. & Whang, S. E., 2019. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering, pp. 1-1. [37] Sadique, F., Kaul, R. & Badsha, S. S. S., 2020. An Automated Framework for Real-time Phishing URL Detection. s.l., IEEE, pp. 335-341. [38] Sidi, F. et al., 2012. Data Quality: A Survey of Data Quality Dimensions. s.l., IEEE, pp. 300-304. [39] Taleb, I., Dssouli, R. & Serhani, M. A., 2015. Big data pre-processing: A quality framework. s.l., IEEE, pp. 191-198. [40] Tang, N., 2014. Big Data Cleaning. International Journal of Database Theory and Application, pp. 13- 24. [41] Thadson, K., Visitsattapongse, S. & Pechprasarn, S., 2021. Deep learning-based single-shot phase retrieval algorithm for surface plasmon resonance microscope based refractive index sensing application. Scientific Reports, 11(1), pp. 1-14. [42] Tomar, D. & Agarwal, S., 2014. A Survey on Pre- processing and Post-processing Techniques in Data Mining. International Journal of Database Theory and Application , 7(4), pp. 99-128. [43] Toolan, F. & Carthy, J., 2010. Feature selection for Spam and Phishing detection. s.l., IEEE, pp. 1-12. [44] Yao, L., Mao, C. & Luo, Y., 2019. Graph convolutional networks for text classification. s.l., ACM, p. 7370–7377 232 Informatica 47 (2023) 219–232 M.A. Oladipupo et al.