https://doi.org/10.31449/inf.v46i6.4081 Informatica46 (2022) 143–158 143 Hybrid-MELAu: AHybridMixingEngineeredLinguisticFeatures FrameworkBasedonAutoencoderforSocialBotDetection Zineb Ferhat Hamida 1 , Allaoua Refoufi 1 , Ahlem Drif 1 and Silvia Giordano 2 E-mail: zineb.ferhat@yahoo.com, allaoua.refoufi@univ.setif.dz, adrif@univ.setif.dz, silvia.giordano@supsi.ch 1 Networks and Distributed Systems Laboratory, Department of Computer Science, University of Sétif 1, Sétif, Algeria 2 Networking Lab, SUPSI University of Applied Sciences of Southern Switzerland Lugano, Switzerland Keywords: social bots detection, natural language processing, autoencoder, recurrent neural network, feature engineering, classification Received: March 17, 2022 Social bots are defined as computer algorithms that generate massive amounts of obnoxious or meaningful information. Most bot detection methods leverage multitudinous characteristics, from network features, temporal dynamics features, activities features, and sentiment features. However, there has been fairly lower work exploring lexicon measurement and linguistic indicators to detect bots. The main purpose of this research is to recognize the social bots through their writing style. Thus, we carried out an exploratory study on the effectiveness of only a set of linguistic features (17 features) exploitable for bot detection, without the need to resort to other types of features. And we develop a novel framework in a hybrid fashion of Mixing Engineered Linguistic features based on Autoencoders (Hybrid-MELAu). The semi- supervised Hybrid-MELAu framework is composed of two essential constituents: the features learner and the predictors. We establish the features learner innovated on two powerful structures: a) the first is a Deep dense Autoencoder fed by the Lexical and the Syntactic content (DALS) that represents the high order lexical and syntactic features in latent space, b) the second one is a Glove-BiLSTM autoencoder, which sculpts the semantic features; subsequently, we generate elite elements from the pre-trained encoder part from each latent space with transfer learning. We consider a sample of 1 Million from Cresci datasets to conduct our linguistic analysis comparison between the writing style of humans and bots. With this dataset, we observe that the bot’s textual lexical diversity median is greater than the human one and the syntactic analysis based on speech-tagging shows a creative behavior in human writing style. Finally, we test the model’s robustness on several public dataset (celebrity, pronbots-2019, and political bots). The proposed framework achieves a good accuracy of 92.22%. Overall, the results shown in this paper, and the related discussion, argue that it is possible to discern the differences between humans’ and bots’ writing styles based on an efficient linguistic deep framework. Povzetek: V prispevku je opisana metoda za detekcijo pogovornih asistentov (botov) na osnovi jezkovnega stila. 1 Introduction As a result of the invention of social media, many users are performing various acts that can produce incorrect in- formation that propagates easily through the internet for different purposes [1]. Some try to deceive the reader or sway his perspective on a topic. Others are created from scratch with a tempting caption to enhance website traffic and visits. Recently, there have been several works girding fake news features analysis [2, 3, 4, 5, 6, 7]. A veritably complex task consists to supervise and investigate the dif- fusion’s information sources and the nature of profit users. This is owing to users’ aversion to disclosing their genuine identities, which is regarded illegal, or even these users can be nonhuman (social bots and cyborg use). We focus in this work on social bots detection as it became an efficient mechanism for fake news propagation that can get a nega- tive impact on individuals and society. When working with social bots detection, one of great challenges is features engineering. Multitudinous ex- ploration pinpoints spambots through multi-feature ap- proaches. Kosmajac et al [8] extracted a fingerprint of user behavior from the users’ features to realize automated users. They applied machine learning algorithms to dis- cover the social bots. However, the textual content itself is probable to be a crucial characteristic for social bots detec- tion. It’s therefore important to reach a way of representing the text that can capture the information necessary for ac- knowledging if the account is either human or bot. Wei et al [9] concentrated on distinguish between twitter ac- counts of both human and spambot, by building a BiLSTM network to efficiently capture the content features across tweets. Different from these works, our work focuses on developing a novel framework based on linguistic features to recognize the social bots through their writing styles. In other words, linguistics features can add significant value 144 Informatica46 (2022) 143–158 Z.F. Hamida et al. to the retrieval information process. Hence, we design a linguistically oriented framework that combines the embedding-based strength with the ad- vantage offered by Autoencoder (AEs) in dimensionality reduction. The proposed architecture is separated into two segments: the features learner and a deep neural networks classifier. The feature learner aims at performing the fea- ture extraction task due to a deep autoencoder based on dense layers and a BiLSTM autoencoder. We enhance the feature extractor: (i) by feeding the lexical and syntactic features to the first autoencoder to represent the high or- der features in latent space; (ii) constituting the semantic and the context features using the BiLSTM autoencoder; (iii) the merging of the two previous trained encoder blocks would generate a compacted data to get rid of any tan- gential information, and just concentrate on the highly es- sential characteristics which would discover human writing style patterns accurately. We summarized our contribution as follows: – Extracting various feature sets that indicate the writing style for both humans and bots, to be able to compare and evaluate the performance of the social bot detec- tion model enhanced by the distinct writing patterns in estimating the human and bot classes. The different writing style features are lexical features relying on text richness and diversity, syntactic features based on Pos-tagging, and semantic features that are extracted with word embeddings techniques. – We develop Hybrid-MELAu: novel semi-supervised framework in a hybrid fashion mixing engineered lin- guistic features based on autoencoder. Therefore, we trained two autoencoders. The first is a deep dense au- toencoder fed by the lexical and the syntactic features (DALS). The second one is a Glove word embedding BiLSTM autoencoder (GloVe-BiLSTM autoencoder), which effectively captures the semantic or the con- textual features across tweets. Then, we stapled the trained encoder building blocks to generate elite char- acteristics from both latent spaces. The idea behind this combination is to complement one another; they successfully model the lexical, syntactic, and seman- tic knowledge. In low-dimensional spaces, this repre- sentation will become very efficient. – We benefit from the profound features attained from encoders for transfer learning to discern differences in the writing styles of both humans and bots. The initialization of the classifiers with transferred features has improved the performance when modeling the bot detection task. – We chain the Hybrid-MELAu output with six Re- current Neural Network classifiers: SimpleRNN, BiRNN, LSTM, BiLSTM, GRU (Gate Recurrent Units), and BiGRU. – Experiments were carried out with a real-world data set. Trials show that the introduced framework signif- icantly achieves an accurate social bot detection. The paper rest is categorized following this order: section 2 explain the objectives of this research, the third section delves into related works. Section 4 is devoted to the fea- tures extraction study and the most prominent measures for natural language texts. Section 5 presents the proposed semi-supervised framework for building high-performance bot detection models. Section 6 details our architectures results applied on real-world datasets. Then, a beneficial discussion is provided in section 7. Eventually, the conclu- sion is provided in Section 8. 2 Researchobjectives – Our focus was on investigating the bot writing style to confirm whether it is possible or not to obtain a com- petitive detection performance using just a set of rel- evant linguistic features, unlike the majority of work on bot detection that have investigated a bigger set of features without taking into account the specific type- token ratio and the vocabulary knowledge. – Since a successful bot can use a linguistic approach based on the linguistic structure, we aim at digging deeply to show that the linguistics features can add significant value to differentiate human accounts from bot accounts. Our exploratory study address the fol- lowing writing style features analysis: lexical features relying on text richness and diversity, syntactic fea- tures based on Pos-tagging, and word embeddings ap- proaches are used to extract semantic features. – Bot detection has been implemented using a variety of deep learning and machine learning methods [10, 11], but there is still much work to be done. In fact, exploiting only the bots automatic writing style be- haviors is a challenging task because their combina- tion produce incomplete, unstructured, and noisy data. Therefore, we will develop a hybrid deep learning ap- proach based only on linguistic features that can im- prove the detection performance. 3 Relatedwork The bots are hefty threats on social media credibility. Un- derstanding of bot accounts behaviors in social platforms is of pivotal importance to enable its detection. El-Mawass at al [12] created a Markov Random Field on a graph of similar users and utilize state-of-the-art classifiers to infer previous beliefs, they used Loopy Belief Propagation to get later predictions about the user. Yang et al [13] displayed that most common political messages on Twitter issued by a small group of hyperactivity accounts. Authors noted that overactive users are more probably to spread low-integrity Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 145 information for quotidian users, and they tend to show sus- picious behaviors often associated with false accounts au- tomation. Yang et al [14] published a web application named bot electioneering volume (BEV) that notifies the level of bot practices and presents the subjects addressed by them on twitter per diem. Many studies have focalized on collective and not genuine activities for harmful accounts to detect consistent campaigns and disseminate suspected content. Bot2Vec is an innovative approach to identifying bots/spammers offered by Pham et al [15]. The approach is based on the local neighborhood relationships and the community internal structure of human nodes. Nizzoli et al [16] demonstrated that invite link exchange is a proxy for homophily and habitual goals between the involved agents and characteristic patterns related to deceptive schemes. In the work of Giglietto et al [17], The researchers devised a mechanism for recognizing coordinated link-sharing be- havior (CLSB). By scanning URLs published by public groups, pages, and validated profiles on Facebook, this ap- proach generates and maintains lists of sources that could be troublesome. Luceri et al [18] studied bots and humans virtual attitude and compared their communicating activ- ities. Based on the nature and the quarrel within the on- line discussion, Luceri et al [19] identified several accounts classes. The researchers concluded that in the political de- bate the hyperactive bots played a consequential role in the news diffusion. Bots accounts are a problem on social media since they can inveigle information, disseminate misinformation,and inflate unverified news, which can alter the social media analyses results. Generally Social bot detection approaches are supervised. Ferrera et al [20] use an vast range of characteristics (Timing of tweets, network of tweet inter- actions, meaning, language, and emotions) and create a k- nearest neighbor with dynamic time warping (KNN-DTW) method for online bot detection. Cresci et al [21] work founded on DNA inspired fingerprinting coding to inves- tigate social media user behavior in temporal dimension. In the work of Kudugunta et al [22], The authors pulled contextual information extracted from user metadata and provided as additional entry to LSTM deep neural network that analyse tweet text in order to identify bots. Yang et al [23] suggested a framework that utilizes minimum meta- data account while focusing just on user profile informa- tion. In Heidari [24] work,the authors proposed a Bidi- rectional Encoder model for tweets sentiment classification to determine features from topic-independent for bot de- tection model. Sayyadiharikandeh et al [25] suggested a supervised learning approach that uses the maximum rule to combine the decisions of specialized classifiers. The suggested method is included in Botometer’s latest version (v4), a commonly used tool for detecting social bots. To categorize tweets as bot tweets or not, a neural network en- semble of CNN and LSTM models with BERT embedding was developed by Kumar et al [26] which is based on the tweets’ textual content. Gaurav et al [27] pinpointed ac- count patterns types using machine learning mechanisms and provides intelligent clues that may be utilized as a ro- bustness gauge for several systems. Several machine learn- ing approaches for detecting malicious users have been suggested on Praveena [28] work based on glow worm op- timization technique to in order to deal with a small set of features. The authors employed generalized regression neural network to train these features. Also, Chakraborty et al [29] proposed an innovative method for categoriz- ing Twitter user accounts as valid or illegitimate, by us- ing a combination of features engineered from user Meta- data.The authors integrated graph centrality values and graph embedding generating from the followers-followings graph. In addition to these works based on supervised and unsupervised approaches, present studies utilize a semi- supervised method to identify social bots. Zhao et al [30] present a semi-supervised model founded on a attention mechanism-based graph CNN, which spots spam bots by integrating many user characteristics and relational struc- tures. To detect counterfeit accounts from a vast vol- ume of Twitter data, BalaAnand et al [31] presented an enhanced graph-based semi-supervised learning algorithm (EGSLA). Another work of Shaabani et al [32] present a semi-supervised self-training architecture capable of cap- turing Pathogenic Social Media users. To identify single and batches of spam accounts, Alharthy et al [33] use two semi-supervised techniques plus a set of specified features. A recent work of Guo [34] symmetrically involved BERT and GCN (Graph Convolutional Network, GCN),and a new architecture for bot identification that merged large-scale pre-training and transductive learning was proposed. Numerous studies have considered the bot detection problem as a binary classification. However, only binary classifiers will be capable to differentiate bots and gen- uine users when bots are of the identical category as the ones used when training the model. To detect the bots, Rodriguez et al [35] used a one-class classification strat- egy. This strategy has the advantage of not necessitating examples of anomalous activity. When the goal is to de- tect deviations from predicted behavior, one-class catego- rization is usually applied. The researchers select the ac- count features (retweet, replies, inter-time, number of listed tweets, and friends-to-follower ratio) and illustrated that the one-class classifier distinguishes the bots and the legit- imate users consistently. Building on this idea, we suggest a one-class classification approach to extract the linguis- tic features that can effectively separate bots and legitimate accounts. Moreover, the proposed Hybrid-MELAu gives significant control of how to model latent spaces. Eventually, the previous semi-supervised techniques are summarized and briefly compared in Table 1 146 Informatica46 (2022) 143–158 Z.F. Hamida et al. Table 1: Brief description of prior surveyed semi-supervised methodes for bot detection. Methode Datasetused Features se- lected Accuracy Precision Recall F1-score Zhao et al[30] Twitter 1KS- 10KN dataset user and net- work features _ 0.93 0.88 0.91 Guo [34] cresci-rtbust, botometer- feedbak, gilani, cresci- stock-2018 and midterm dataset tweet text 0.9026 (The best result achieved on midterm dataset) 0.8842 (The best result achieved on cresci-rtbust dataset) 0.7884 (The best result achieved on midterm dataset) 0.8089 (The best result achieved on midterm dataset) BalaAnand et al[31] automated data col- lection by Python web- scraping Fraction of retweets, Standard tweet length, Fraction of URLs, Av- erage time between tweets 0.903 0.923 0.908 _ Rodriguez et al[35] Cresci-2017 dataset account features 0.921 _ _ _ Shaabani et al [32] ISIS dataset _ 0.82 0.90 _ _ Alharthy et al [33] automated data col- lection by Twitter API Tweet meta- data and Account metadata 0.91 0.88 _ _ 4 Thefeaturesextractionstudy 4.1 Lexical,syntacticandsemanticfeatures extraction We will examine the text content in the first part of this project to delineate bot behavior, as the language and phrase composition of bots and genuine people may dif- fer. Although the techniques of Natural Language Pro- cessing have a redoubtable function in ensuring that bots grasp the language and more human-like, it seems that the bots be surrounded by dissension due to their restrictions to communicate with people who speak the same language [36]. To find insights into this issue, we focus on three NLP steps: – The first step is lexical analysis. The batch of sen- tences and words is a language lexicon. We will first analyze the text and separate it into sentences and words. Every word and punctuation mark is a sepa- rate unit. – The second phase is syntactic analysis. We will ex- plore the grammatical role of every word in a sentence by tagging each of it to indicate what type of token it is, for example, is a verb (in past, present, or future tense), a pronoun, an article, a stop word, adjective . . . , and identifies the words relationship. – In the third step, we perform the semantic analysis. To do this, we have to deploy the Word embedding techniques that mainly take words or phrases from the vocabulary to map them to real number vectors. There are many word embedding initiatives. For ex- ample, Word2vec was created for computing continu- ous words vector representations from huge data sets [37], 2) The GloVe stands for Word Representation Global Vectors [38]. Glove model is a log bilinear model where the possibility of the next word is calcu- lated when the previous words are given which means the word appearances statistics in a corpus is the main source of information obtainable to all unsupervised approaches for learning word representations. Machine learning algorithms can be used in these NLP phases to dynamically learn the rules by exploring a corpus. We start our approach by extracting features based on the previous three main analysis levels. The first process phase is the lexical analysis: 1. Divide tweets into sentences and words. 2. Elicit emojis, hashtags, both emoticons happy and emoticons sad. Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 147 3. Identify upper letters, numeric and blank spaces. 4. Then, we calculate all these features number, besides determining the whole number of characters and the average-word in both human and bots tweets. The next step is to analyze the tweets words in terms of syntax where an ensemble of syntactic features was ex- tracted: 1. The frequency of punctuations (commas, question mark and exclamation mark). 2. The frequency of stop words and URLs. 3. We also focus in identifying the grammatical role of each word in a sentence via speech tagging. For the semantic approach, we realize it based on the GloVe embedding technique. 4.2 Featuresextractionbasedonlexical diversity We study a key linguistic feature called “Lexical Diversity” (LD), which aims to indicate the complexity and the diffi- culty to read a text. There are many LD measures and the type-token ratio (TTR) [39, 40] is the most popular of them. It is a quantitative relation between the unparalleled words number (V) in a text and the total number of items (N) [41]. TTR=V/N (1) We take these tweets as an example: "To live with untreated PTSD is to feel as if you might die any moment. Again and again. Help cost money." The token size in this sentence is 20 contains 18 types ( to, live, with, untreated, PTSD, is, feel, like, you, might, die, any, moment, again, and, help, costs, money). In this example, the TTR is 0.90 (i.e., 18/20). Moreover, numerous researches have demonstrated that TTR strongly relies on text length [42, 43]. The TTR value gradually decreases as the text becomes longer [44]. Con- sequently, some conversions of TTR raw have been sug- gested, this to relieve or avert this text length subordina- tion. The Measure of Textual Lexical Diversity (MTLD) [45, 46] creates factors from the textual sample based on the TTR values. Every factor closes when it accomplishes 0.72, which often known as the default TTR size value, and its tokens number is greater or equal to 10 tokens. Finally, the whole TTRs mean is calculated. The final MTLD result is the number of words (N) split by the number of factors reached 0.72 of TTR [41]. MTLD =N/factors (2) Then, the same process will be repeated after reversing the text and the final MTLD appreciation is calculated by averaging the two obtained MTLD values. 4.3 Featuresselection In machine learning models training phase, the data charac- teristics get a big influence on its attainments. A bad choose of these features can injuriously influence model perfor- mance and decrease accuracy. There are many advantages of applying feature selection before shaping the data such as reducing overfitting, improving accuracy, reducing algo- rithm complexity, and algorithms’ training faster. For selecting features task, we used Extremely Random- ized Trees Classifier(Extra Trees Classifier) [47] which is an updated Tree-Based Classifiers that extracts the most relevant features. It attach a score for every feature; if this score is high it indicates that this feature is pertinent for the model performance. 5 Theproposedarchitecture We introduce Hybrid-MELAu: novel semi-supervised deep framework oriented mixing engineered linguistic fea- tures based on autoencoder to improve the Twitter bot de- tection performance. Thus, we implement a feature ex- tractor based on DALS (Deep Dense Autoencoder based on Lexical and Syntactic features) and GloVe-BiLSTM au- toencoder (GloVe Word Embedding Bidirectional-LSTM Autoencoder) to learn better latent representations of the human linguistic behaviors. Also, there is a growing in- terest in bot detection to utilize one-class classifiers based solely on examples from a single class to learn its represen- tations and determine whether a new example belongs to that class or not. Hence, the proposed architecture includes two parts: The feature learner, which relies on two autoen- coders components [48, 49, 50] that pre-train the layers of the model. The feature learner associates the extracted lex- ical and syntactic features with their corresponding seman- tic features using the transfer learning technique. It consists in building the latent spaces from the pre-trained encoder part of the two autoencoders. These latent spaces present the most robust representation of the dataset. Whereas, we need to hold the concatenated encoders of the two autoen- coders and fix their weights, to gain the advantage of their experience. Hence, the weights values of the encoders are frozen while we learn feed-forward deep learning network weights, following the architecture illustrated in Figure 1. Then the second part is the classification model, where we chain the features, that have been extracted, with several deep neural networks classifiers. 5.1 Featureslearner To perform extracting features process, we used one of the well-known deep Representation Learning Algorithms; Autoencoders. It’s a form of feedforward neural network that trains itself to match input and output. Let’s assume that only a group of unlabeled training exam- ples exist:{ x 1 ,x 2 ,x 3 ,... } , wherex i ∈ℜ n . The autoen- coder compresses the input into a lower dimension then 148 Informatica46 (2022) 143–158 Z.F. Hamida et al. Figure 1: The Hybrid-MELAu for Twitter bots detection. Part (A) shows the freezing weights process for the two pre-trained encoders parts form DALS and Glove-BiLSTM autoencoder and their concatenation. Part (B) exhibits the classifiers. reconstructs the output from this representation. It’s an unsupervised algorithm that employs backpropagation for matching both the target and the input value:y i =x i . The autoencoder’s network is consisting of two sections: The encoder, which encodes the inputs into a hidden repre- sentation (h) or a latent space, that is capturing everything that must reconstruct the original input. h=g((w∗ x i )+b) (3) From the latent space (h), the decoder extracts the input again. b x i =f((w ′ ∗ h)+c) (4) Autoencoders are constrained to only copy but rather to construct and deconstruct the input. Because it’s con- strained by this reduction, it is forced to make priorities on which features of the input to learn, which are very useful to discriminate the humans and bots’ writing style. The proposed approach comprises of two different autoen- coders: the first one is based on a deep-stacked autoencoder that reconstructs the input features, and the second one is a sequence-to-sequence LSTM autoencoder that learns vec- tor representations of any unstructured text. The first autoencoder DALS is composed of six layers to input the lexical and syntactic features. Its architecture is provided in Figure 2.The first three layers are setting up the encoder with 15, 10, and 6 neurons respectively, while the third layer is the latent space. The last three layers perform the decoder such as the initial two layers have 10 and 15 neurons successively, and the last layer is the output layer (it outputs the same neurons numbers as the inputs). The training procedure of this architecture is summarized in Al- gorithm 1. At the semantic level, the sequence prediction remains a Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 149 Figure 2: DALS Architecture Algorithm1: Deep Dense Autoencoder based on Lexical and Syntactic content. Input :X: vector of unlabeled features λ : hyper-parameters T : the maximum number of iteration Output: ˆ X: reconstructed representation of the input begin // Preparing data to be passed to the network Sett to1 Initializew,w ′ ,b,c repeat Encode the inputX into the latent space h according to the equation.3. Decompress the original input from the latent spaceh via the equation.4. E(X, ˆ X)=|| X− ˆ X|| 2 (the error rate). t=t+1. Update (w,w ′ ,b,c). untilt>T ; return ˆ X; end complex issue, not only because the input sequence length can vary, but the notes temporal scheduling can make it dif- ficult to extract the appropriate features for use as input to supervised learning models. To capture the temporal struc- ture, we develop a GloVe-BiLSTM autoencoder model. In another word, the encoder part of the model can be used to compress tweets text that in turn may be used as a fea- ture vector input to a supervised learning model. For a bet- ter understanding, let’s visualize the architecture in Figure 3. This figure shows the tweets flow across the GloVe- BiLSTM autoencoder network layers for one ensample of data. The encoder is accountable for the source tweet read- ing and encoding it to an inner representation by captur- ing the meaning of these tweets. A simple model creation Figure 3: GloVe-BiLSTM autoencoder Flow Diagram includes an embedding input ensued by a Bidirectional- LSTM hidden layer that generates a fixed-length represen- tation. First, we input the tweet texts to the embedding layer, where each word is transformed into a distributed representation [51]. This layer is a matrix of sizem× v , wherev is the vector length and it’s equal to 300, in which we learned word embeddings from text using a pre-trained 300-dimensional Google News Vectors approach (GloVe) [38], and m is the tokens number in the tweets which is fixed on 50. The Bidirectional-LSTM layer [52] used the hidden states to maintain the inputs information and fed it in a for- ward way from past to future and backward from future to past. Moreover, Bidirectional LSTMs have the capac- ity to better understand the context [53]. After that, we add the decoder which is a Bidirectional-LSTM layer. It assumes a three dimensional input for creating a decoded sequence of various lengths determined by the problem. So we configure first the RepeatVector layer to create a three dimensional BiLSTM output. Then, like the encoder, one Bidirectional-LSTM layer with the same number of cells was utilized in the decoder model implementation. Finally, the dense layer generates the autoencoder output which is also a matrix of size m× n which n is the tweet corpus (50.000). The training procedure of the GloVe-BiLSTM autoen- coder is summarized in Algorithm 2. 150 Informatica46 (2022) 143–158 Z.F. Hamida et al. Algorithm2: GloVe-BiLSTM autoencoder Input :S: set of tweets K: set of tokens in one tweet: sizem C: The tweet corpus: sizen Batch: the number of training examples utilized in one iteration: sizez θ : hyper-parameters Output: ˆ P : reconstructed matrix: size(m× n) begin // Preparing data to be passed to the stack foreachs∈ S do s← nlp.prepossessing (s) end repeat foreach Batchdo // Calculating embeddings for each token foreachk∈ K do emb(k)← glove(k) end Encoder= Build- Model(LSTM_Bidirectional.input ([m× Embedding_size],θ )) Encoder_output ← [1× Thedoublenumberofcells] // repeat Encoder_output m times to create 3D vector repeat(Encoder_output,m) output← [m× Number of cells∗ 2] Decoder= Build- Model(LSTM_Bidirectional.input (output,θ )) Decoder_output← [m× Number of cells∗ 2] // Generating the output using fully connected layer with size n ˆ P =[m× n] end until Untill convergence; return ˆ P ; end 5.2 Predictormodel The key idea of our proposed framework (Hybrid-MELAu) is using transfer learning [54], by copying both the pre- trained encoder part of DALS and GloVe—BiLSTM first n layers to the n first layers of the deep learning classifiers. The implemented classifiers are 1)—a Recurrent Neural Network (RNN) [55] classifier which is a universal ap- proximation of dynamical systems, 2)—a Long short-term memory networks (LSTMs) [56] predictor which consid- ered as an update of RNN that used on several works for example in Kalyoncu [57] research, 3)—a Gated recurrent units (GRU) [58] classifier, 4)- a three bidirectional architectures (BiRNN[59],BiLSTM and BiGRU [60]). During the predictor models training, we set the mean squared error (MSE) as a cost function. It is defined below: L(Y,f(X,s))=L(Y, ˆY)= 1 N N X i=1 (Y, ˆY) 2 (5) Where N is the feature dimensionality, X is the features vectors and s is a set of tweets, Y is the output ground truth, and ˆY is the predicted output (Human or Bot). Using a pre-trained network that is trained on data with one class only ensures that the bot detection task is per- formed based on the most frequent characteristics of non- intrusion samples. 6 Experimentsresults 6.1 Dataset Several tweeter real-world datasets are used in our re- search. The first is defined in [61, 62]. According to [61]), genuine accounts consists of 3,474 real users accounts with 8,377,522 tweets. The bots accounts separated on three datasets. During the 2014 Romanian Mayoral election, the social spambots1 dataset was scraped from Twitter, it is composed of 991 accounts and 1,610,176 tweets. Spam- bots 2 dataset is a group of3,457 bots accounts who passes many months promulgating the #TALNTS hashtag through 428,542 tweets. Where this last concerns a mobile phone application for contacting and recruiting artists working in several fields. The immense generality of tweets were in- nocuous statements, sporadically scattered by tweets nam- ing a specific human account and recommending that he purchase the VIP edition of the software from a Web store. The dataset of Spambots 3 is a set of 464 accounts and 1,418,626, this dataset announced products for selling on Amazon.com. The delusive activity is executed by spam- ming URLs referring to the publicized products. The second one is the celebrity dataset which contains celebs’ accounts [63]. The Center for Complex Networks and Systems Research at Indiana University (CNetS team) collected 5,918 celebrity human accounts. We also add two other datasets: pronbots-2019 and political-bots [63]. Pronbots-2019 is a set of 21,963 bot accounts distributed by Andy Patel. Political-bots is a set of 62 Automated po- litical accounts. 6.2 Exploratorystudyresults To extract syntactic and lexical characteristics, we have ap- plied NLP analysis approaches as explicated in section 4. Hither, we consider a sample of 1 000 000 data containing an equal number of human and bots tweets and compare the writing style of both at the lexical level. The findings of this comparison are illustrated in Figure.4. We observe that Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 151 humans use a greater number of hashtags than bots. Also, they use different number of emotions type, numbers, sen- tences, words, blank space, and upper letters. It is due to the fact that the human can easily diversify their lexical context. Then, we compare the syntactic features analysis results based on URLs, punctuation, and stop words. As we can see in Figure.5, the number of different syntactic tokens could be different, especially since a successful bot can use a linguistic approach based on the linguistic structure. For syntactic analysis based on speech-tagging, there are many tags, so, we have just focused on recognizing some tags, that essentially help in the interpretation of the given sen- tence. Besides, we compare the different POS tagging fea- tures in the bots and human writing styles (Figure.6). From Figure 6, we notice that bots used to write their tweets, the following features: proper plural noun, proper singular noun, plural noun, singular noun, prepositions, coordinating conjunctions, determiners, modal, verbs 3sg pres, verbs base form, adjective, comparative adjective, superlative adjective, superlative adverbs and comparative adverbs much more than humans. Because these character- istics are considered as basic units (tokens) in the construc- tion of the sentences and aren’t difficult to simulate. Al- though this is a good bot imitation, they haven’t been able to outperform humans in terms of features shown on the right side (personal pronoun, adverb, verb past tense, verb present participle, verb past participle, verb non-3sg pres, interjections, and foreign words from other languages). The key idea is that exploiting this feature set is more com- plicated and requires special conditions. For example, hu- mans make the use of various interjections with rich con- text. Therefore, we can conclude that human vary their tone in writing depending on their feelings, the reader, and the events by using empathy, encouragement and astonishing events. For the lexical richness task, We chose the MTLD metric due to the fact it is a robust lexical diversity indicator that is unaffected by sample length. [64]. First, we compute all the POS (part-of-speech) tagged to rich inflectional languages. After that, we compute the MTLD. Figure.7 shows how the MTLD metric varies be- tween the human and bot tweets. As we can observe from Figure 7 and Table 2, the range of bot’s MTLD values is bigger than the human ranges values. The maximum value of the bots’ MTLD is higher than the humans’ MTLD while both minimum values are equals. It can be seen that MTLD is a metric of analyzing the number of consecutive words supported by a specific type-token ra- tio. We observe that a well automated bot rely on the NLP rules to generate a rich lexicon but human are using an odd approach as their writing skills outperform a simple NLP rules. 6.3 Hybrid-MELAuevaluationresults For this evaluation phase, we have selected the 17 highest linguistic features that have a great impact on predictability (see Figure.8). We split Cresci datasets into approximately 80% training and 20% testing set. As mentioned in the 4.2 subsection, the two autoencoders will be trained on data with one class only to ensure that the prediction task is per- formed based on the most frequent characteristics of human samples. So, the training group is divided again based on the dataset label (Human and Bot). The human and bot la- bel rates are respectively 54% and 46% of the training set. Then, we rely on the training set of the human class. After dividing dataset into 75% training set and 25% validation set, and for retrieving the best hyper-parameters of the two autoencoders, we used one of the optimization approaches that are provided in scikit-learn: “GridSearchCV” [65]. It evaluates all potential values of parameter composition and retains the best one. Table 3 shows the autoencoders hyper- parameters after using GridsearchCV . The top hyper-parameters are: – Utilization 256 as a batch size for the both autoen- coders. – The usage of “Adam” and “Nadam” separately as op- timizer functions for the DALS and GloVe-BiLSTM autoencoder. – The DALS loss function is MSE and for GloVe- BiLSTM autoencoder is sparse-categorical- crossentropy. – The learning rate values for the first autoencoder and the second one are 0.0001 and 0.001. – For the hidden activation function the choice fell on “relu” for the DALS and “tanh” for the GloVe- BiLSTM autoencoder. And for the output activation function, linear and softmax functions were selected respectively for the two autoencoders. First, the two autoencoders were trained on the dataset based on human class only using the selected linguistic fea- tures and the best hyper-parameters. Then, the two encoder parts are frozen and cemented to make one features vec- tor. After this phase, six recurrent neural networks models were built as follows: the feature vector was repeated once to create a 3D output utilizing RepeatVector, and it’s fed to the next layer of six classifiers: (1 — SimpleRNN clas- sifier, 2 — BiRNN classifier, 3 — LSTM classifier, 4 — BiLSTM classifier, 5 — GRU classifier, 6 — BiGRU clas- sifier). The units number of cells in each one is fixed at 300. Afterward, to make a one-dimensional vector a flat- ten layer was added. For rending the model more powerful, the output vector is passed to a fully connected layer. Then, the last layer transforms its input into a one result using the sigmoid function [66]. The different recurrent neural classifiers are imple- mented on the whole labeled dataset with 55% of data for 152 Informatica46 (2022) 143–158 Z.F. Hamida et al. Figure 4: Comparison between the writing style of both human and bots at the Lexical level. Table 2: Summary values of Measure of Textual Lexical Diversity distribution across the two lables. MeasureofTextualLexicalDiversity(MTLD) Min Max Mean 25th percentile Median 75th percentile IQR Human 1.0 69.91 25.23 6.0 11.0 31.5 25.5 Bot 1.0 81.55 27.38 8.0 14.0 37.33 29.33 Figure 5: Comparison between the writing style of both human and bots at the syntactic level. the training set, the previous preserved testing set (20% of data), and 25% of data for the validation set using Google Colab environment. The runtime had configured to use Keras [67] API v2.4.3, Tensorflow v2.4.0, Python 3.6.9 - 64bit-, a GPU Hardware accelerator. The classifiers were trained for 150 epochs with256 as a batch size using Adam as optimization function and mse as loss function. For the fully connected layer and output layer we used ReLu and segmoid activation function respectively. We employ dif- ferent metrics : Precision, Recall, F-Measure, Accuracy and Matthew Correlation Coefficient (MCC) [68] to com- pare the classifiers performance. Experiments on the Cresci dataset show that it is possible to forecast with a high degree of accuracy. As we can see from Table 4, the Hybrid-MELAu+BiRNN classifier shows high performance for bot detection, and it is better than the other recurrent classifiers when the overall accuracy is 92.22%. All recurrent classifiers had closely comparable performance. After that, because our Hybrid-MELAu (with BiRNN) model falls under the semi-supervised techniques, we choose to compare its performance with the methods men- tioned in Table 1, the results are presented in Figure 9. As we can see from Figure.9, the Hybrid-MELAu out- performed the other models in terms of the different met- rics. In fact, in this work, we emphasize linguistic features without taking into account the users’ features to discrim- inate the human’s and bots writing style behavior. Hence, this result illustrates the ability of the feature learner based on autoencoders with transfer learning to generate elite fea- tures from latent spaces from the pre-trained encoder part. Certainly that the linguistic features capture sentence level and word level complexity using different lexical and syntactic indexes influence bot identification and show bet- ter results. It also showed that, when compared to hu- man accounts, Bot accounts have a high non-homogenize in their discriminatory behavioral characteristics [25], de- signing a deep linguistic framework with transfer learning founded only on collections of linguistic characteristics is able to define if a single tweet is being written by a hu- man or a bot with good accuracy. Moreover, the generative ability of the part of the pre-trained encoder enhances the predictor to discern differences in the writing styles of both humans and bots. 7 Discussion In this section, we will discuss the main findings of the manuscript and address its implications. Moreover, we will Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 153 Figure 6: Comparison of different POS tagging features. the left part finds out the most characteristics employed by bots in comparison to humans, while the right side shows the features for which humans surpassed the bots. Table 3: GridsearchCV for the best hyper-parameters optimization. Optimizer Hidden Activation Function Output Activation Function Loss batch size learning rate 1 SGD softmax softmax mse 16 0.00001 2 RMSprop softplus softplus sparse-categorical-crossentropy 32 0.0001 3 Adagrad softsign softsign msle 64 0.001 4 Adadelta relu relu categorical-crossentropy 128 0.01 5 Adam tanh tanh kullback-leibler-divergence 256 0.1 6 Adamax sigmoid sigmoid mae 512 - 7 Nadam hard-sigmoid hard-sigmoid binary-crossentropy - - 8 - linear linear hinge - - 9 - elu elu squared-hinge - - 10 - selu selu - - - DALS Adam relu linear mse 256 0.0001 GloVe-BiLSTM autoencoder Nadam tanh softmax sparse-categorical-crossentropy 256 0.001 Figure 7: Variation of MTLD metric between the human and bot tweets Figure 8: Top 17 most important features in the data using Extra Trees Classifier 154 Informatica46 (2022) 143–158 Z.F. Hamida et al. Table 4: Comparison among the various presented approaches in terms of performance. Precision Recall F1-score Accuracy Loss MCC Classifiers: Hybrid-MELAu+SimpleRNN 0.92455 0.9102 0.91375 0.9154 0.0718 0.8347 Hybrid-MELAu+BiRNN 0.9318 0.9169 0.92065 0.9222 0.0654 0.8486 Hybrid-MELAu+LSTM 0.9231 0.908 0.91165 0.9134 0.0728 0.8310 Hybrid-MELAu+BiLSTM 0.93085 0.9168 0.92035 0.9219 0.0657 0.8476 Hybrid-MELAu+GRU 0.92195 0.90705 0.91065 0.9124 0.0740 0.8289 Hybrid-MELAu+BiGRU 0.9311 0.9166 0.92025 0.9218 0.0658 0.8476 Figure 9: Experiments Results. Figure 10: The prediction accuracy of Hybrid- MELAU+BiRNN classifier on an unseen data (Celebrity, pronbots-2019 and political bots dataset) test the proposed model robustness by introducing further experiment. Whilst the majority of work on bot detection has focused on investigating various sets of features, our first concern in this present research is the analysis of the bot writing style through using the Natural Language Pro- cessing (NLP) to find insights about how the linguistic fea- tures helps in bots detection. Our research reveals that cer- tain lexical and syntactic measures are the most significant signs that contribute to distinguishing the writing style of both bots and humans. In fact, the exploratory analysis showed that humans could infer the relationship between different contexts by employing a context-related lexical level (as discussed in section 6.2). Unlike bots, humans in- tend to express and argue their ideas using numeric (digit, date, real numbers). In addition, humans use more phrases in one tweet than the bots. According to the syntactic analysis based on speech- tagging (see Figure 6), we find that although humans mas- ter the language’s syntax, they show creative behavior in their writing style. Therefore, humans make the use of in- terjections in a specific sentence related to their psycho- logical state and their feelings. They might also express a position whether the latter is personal or related to an- other person. For example, in this human tweet “hmm fishy!!” an exclamation sentence existed, conveying that the person expresses arousing feelings of doubt or suspi- cion. As we can note there is no grammatical structure in this sentence, it’s just composed of two words, an interjec- tion (hmm) and an adverb (fishy). Furthermore, the human explains a specific statement taking profit from a variety of adverbs and personal pronouns. It means that they tend to diversify their writing styles according to a somewhat odd approach to tweeting their ideas, such as using words in foreign languages. They don’t also focus on one tense to conjugate verbs. These results represent a strong conclu- sion to discern the difference between humans’ and bots’ writing styles. Furthermore, computing the lexical diversity measures (see Figure.7) would further disseminate the writing style from humans and bots, which can be seen differently in a text depending on specific type-token ratio and vocabu- lary knowledge. We conclude that a successful (well auto- mated) bot includes the NLP approaches to generate tweets and get a rich lexicon. Meanwhile, the human includes their skills with the language to write in an intelligent way Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 155 ("To live with untreated PTSD is to feel like you might die any moment. Again and again. Help costs money."). De- spite the fact that the machine learning techniques used in bots through NLP have improved their ability to generate content with high lexical diversity, as we can see from this bot tweets: "Today’s Inspirational Quote Climb the moun- tains and get their good tidings. Nature’s peace will flow into you as...", there is still a lot to do to imitate the smart human writing. Our second concern was how to develop a hybrid deep learning approach based only on linguistic features that can improve the detection performance. This can be achieved by building a framework in a hybrid fashion Mixing Engi- neered Linguistic features based on Autoencoders (Hybrid- MELAu). In fact, deep neural networks’ versatility allows them to integrate numerous neural building blocks to con- struct a more powerful hybrid model by complementing one another. The autoencoder has shown to be a useful model for modeling latent distributions since it allows you a lot of control. To demonstrate the model’s sturdiness, we tested our framework’s prediction performance on a new unseen dataset that combined three datasets: celebrity, pronbots- 2019, and political bots. The capacity of a predictive model to perform well over a variety of data sets determines its robustness. Therefore, the resilience of the models built in this study was tested on this new dataset after they had been trained with the Cresci dataset. Figure 10 illustrates a good prediction result when applied to an unseen dataset. Our framework ensures efficient detection because once the autoencoder model is trained, its results will be used di- rectly for transfer learning without the need to resort to the two features of learners’ training. The fact that our frame- work is semi-supervised with one-class authorize benefits from the myriad of unlabeled training data for learning task performance amelioration because the amount of unlabeled samples is generally greater and more accessible than the number of labeled samples. Finally, the findings of this work also show that pre-trained models based on transfer learning are able to improve the accuracy of the bots detec- tion. Surprisingly, a set of linguistic features, such as those obtained from our exploratory study, are effective in distin- guishing social bots. In future works, since we have found that the linguistic deep framework with transfer learning model is discernible of the bots writing style, we are go- ing to incorporate the different set of features in our frame- work. This could help for social bots detection accuracy improving. 8 Conclusion We develop the Hybrid-MELAu: a semi-supervised frame- work to model different mixing engineered linguistic fea- tures based on autoencoder, and use the transfer learning to take profit from its strong ability to generalize to un- seen samples, which improve the social bots detection. The framework is composed of two essential parts: the features learner and the predictor. The features learner combine two encoder part from the following two components: i) the DALS and ii) The GloVe-BiLSTM. The DALS maps the content features to higher-order features, which enables the lexical richness to be encompassed. The GloVe-BiLSTM trains two LSTMs instead of one on the input sequences. This can provide reliable semantic features and result in accurate learning on the detection. The proposed approach captures different lexical and syntactic indexes that influ- ence bot detection and shows significant results. Our new mechanism for detecting bot based on a mining writing style effectively detects bots with a 92.22% accuracy rate. Finally, to confirm the gained results and implement a more until study, we plan to apply our approach to data- sets with long corpus, length to provide deep insights about the text diversity impact on the detection process. Fur- thermore, highlighting human behavioral trends might be a fruitful direction for future research, such as their activity and their dynamics, which can be associated with linguistic features. References [1] E. Kajan, N. Faci, Z. Maamar, M. Sellami, E. Ugl- janin, H. Kheddouci, D. Stojanovic, and D. Bensli- mane, “Real-time tracking and mining of users’ ac- tions over social media,” Computer Science and Infor- mation Systems, vol. 17, pp. 403–426, 2020. [Online]. Available: https://doi.org/10.2298/CSIS190822002K [2] X. Zhou and R. Zafarani, “A survey of fake news: Fundamental theories, detection methods, and oppor- tunities,” ACM Comput. Surv., vol. 53, no. 5, 2020. [Online]. Available: https://doi.org/10.1145/3395046 [3] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on social media: A data mining perspective,” SIGKDD Explor. Newsl., vol. 19, no. 1, p. 22–36, 2017. [Online]. Available: https://doi.org/10.1145/3137597.3137600 [4] B. M. Amine, A. Drif, and S. Giordano, “Merging deep learning model for fake news detection,” in 2019 International Confer- ence on Advanced Electrical Engineering (ICAEE), 2019, pp. 1–4. [Online]. Available: https://doi.org/10.1109/ICAEE47123.2019.9015097 [5] L. Azevedo, M. d’Aquin, B. Davis, and M. Zarrouk, “Lux (linguistic aspects under ex- amination): Discourse analysis for automatic fake news classification,” in ACL/IJCNLP (Find- ings), 2021, pp. 41–56. [Online]. Available: https://doi.org/10.18653/v1/2021.findings-acl.4 [6] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón- Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, 156 Informatica46 (2022) 143–158 Z.F. Hamida et al. M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, and T. Mandl, “The clef-2021 check- that! lab on detecting check-worthy claims, previ- ously fact-checked claims, and fake news,” in Ad- vances in Information Retrieval. Cham: Springer In- ternational Publishing, 2021, pp. 639–649. [Online]. Available: https://doi.org/10.1007/978-3-030-72240- 1_75" [7] Z. Ferhat Hamida, A. Refoufi, and A. Drif, “Fake news detection methods: A survey and new perspec- tives,” in Advanced Intelligent Systems for Sustain- able Development (AI2SD’2020). Cham: Springer International Publishing, 2022, pp. 123–141. [On- line]. Available: https://doi.org/10.1007/978-3-030- 90639-9_11 [8] D. Kosmajac and V . Keselj, “Twitter bot detection using diversity measures,” in Proceedings of the 3rd International Conference on Natural Language and Speech Processing. Trento, Italy: Association for Computational Linguistics, 2019, pp. 1–8. [9] F. Wei and U. T. Nguyen, “Twitter bot detection using bidirectional long short-term memory neu- ral networks and word embeddings,” in 2019 First IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), 2019, pp. 101–109. [Online]. Available: https://doi.org/10.1109/TPS-ISA48467.2019.00021 [10] R. De Nicola, M. Petrocchi, and M. Pratelli, “On the efficacy of old features for the detection of new bots,” Information Processing Management, vol. 58, no. 6, p. 102685, 2021. [Online]. Available: https://doi.org/10.1016/j.ipm.2021.102685 [11] I. Alsmadi and M. J. O’Brien, “How many bots in rus- sian troll tweets?” Information Processing Manage- ment, vol. 57, no. 6, p. 102303, 2020. [Online]. Avail- able: https://doi.org/10.1016/j.ipm.2020.102303 [12] N. El-Mawass, P. Honeine, and L. Vercouter, “Sim- ilCatch: Enhanced social spammers detection on Twitter using Markov Random Fields,” Information processing management, vol. 57, p. 102317, 2020. [Online]. Available: https://doi.org/10.1016/j.ipm.2020.102317 [13] K.-C. Yang, P.-M. Hui, and F. Menczer, “How twitter data sampling biases u.s. voter behavior characteri- zations,” ArXiv, vol. abs/2006.01447, 2020. [Online]. Available: https://doi.org/10.48550/arxiv.2006.01447 [14] K. C. Yang, P. M. Hui, and F. Menczer, “Bot elec- tioneering volume: Visualizing social bot activity during elections,” in Companion Proceedings of The 2019 World Wide Web Conference, ser. WWW ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 214–217. [Online]. Available: https://doi.org/10.1145%2F3308560.3316499 [15] P. Pham, L. T. Nguyen, B. V o, and U. Yun, “Bot2vec: A general approach of intra-community oriented representation learning for bot detection in different types of social networks,” Information Sys- tems, vol. 103, p. 101771, 2022. [Online]. Available: https://doi.org/10.1016/j.is.2021.101771 [16] L. Nizzoli, S. Tardelli, M. Avvenuti, S. Cresci, M. Tesconi, and E. Ferrara, “Charting the landscape of online cryptocurrency ma- nipulation,” IEEE Access, vol. 8, pp. 113 230–113 245, 2020. [Online]. Available: https://doi.org/10.1109%2Faccess.2020.3003370 [17] F. Giglietto, N. Righetti, L. Rossi, and G. Marino, “Coordinated link sharing behavior as a signal to surface sources of problematic information on facebook,” in International Conference on Social Media and Society, ser. SMSociety’20. New York, NY , USA: Association for Computing Ma- chinery, 2020, p. 85–91. [Online]. Available: https://doi.org/10.1145/3400806.3400817 [18] L. Luceri, A. Deb, A. Badawy, and E. Ferrara, “Red bots do it better:comparative analysis of social bot partisan behavior,” in Companion Proceedings of The 2019 World Wide Web Conference, ser. WWW ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 1007–1012. [Online]. Available: https://doi.org/10.1145/3308560.3316735 [19] L. Luceri, F. Cardoso, and S. Giordano, “Down the bot hole: Actionable insights from a one-year analysis of bot activity on twitter,” First Mon- day, vol. 26, no. 3, 2021. [Online]. Available: https://doi.org/10.5210/fm.v26i3.11441 [20] E. Ferrara, O. Varol, F. Menczer, and A. Flammini, “Detection of promoted social media campaigns,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 10, 2016, pp. 563–566. [21] S. Cresci, R. Di Pietro, M. Petrocchi, A. Spog- nardi, and M. Tesconi, “Dna-inspired online be- havioral modeling and its application to spam- bot detection,” IEEE Intelligent Systems, vol. 31, no. 5, pp. 58–64, 2016. [Online]. Available: https://doi.org/10.1109/MIS.2016.29 [22] S. Kudugunta and E. Ferrara, “Deep neural net- works for bot detection,” Information Sciences, vol. 467, pp. 312–322, 2018. [Online]. Available: https://doi.org/10.1016/j.ins.2018.08.019 [23] K. Yang, O. Varol, P.-M. Hui, and F. Menczer, “Scal- able and generalizable social bot detection through data selection,” in AAAI, 2020. [Online]. Available: https://doi.org/10.1609/aaai.v34i01.5460 Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 157 [24] M. Heidari and J. H. Jones, “Using bert to ex- tract topic-independent sentiment features for social media bot detection,” in 2020 11th IEEE Annual Ubiquitous Computing, Electronics Mo- bile Communication Conference (UEMCON), 2020, pp. 0542–0547. [Online]. Available: https://doi.org/10.1109/UEMCON51285.2020.9298158 [25] M. Sayyadiharikandeh, O. Varol, K.-C. Yang, A. Flammini, and F. Menczer, “Detection of novel social bots by ensembles of specialized classifiers,” Proceedings of the 29th ACM Inter- national Conference on Information and Knowl- edge Management, 2020. [Online]. Available: http://doi.org/10.1145/3340531.3412698 [26] S. Kumar, S. Garg, Y . Vats, and A. S. Parihar, “Content based bot detection using bot lan- guage model and bert embeddings,” in 2021 5th International Conference on Computer, Communication and Signal Processing (IC- CCSP), 2021, pp. 285–289. [Online]. Available: https://doi.org/10.1109/ICCCSP52374.2021.9465506 [27] V . Gaurav, S. Singh, A. Srivastava, and S. Shidnal, “Codescan: A supervised machine learning approach to open source code bot detection,” in Applied Infor- mation Processing Systems. Singapore: Springer Singapore, 2022, pp. 381–389. [Online]. Available: https://doi.org/10.1007/978-981-16-2008-9_37 [28] A. Praveena and S. Smys, “Effective spam bot de- tection using glow worm-based generalized regres- sion neural network,” in Mobile Computing and Sus- tainable Informatics. Singapore: Springer Sin- gapore, 2022, pp. 469–487. [Online]. Available: https://doi.org/10.1007/978-981-16-1866-6_34 [29] M. Chakraborty, S. Das, and R. Mamidi, “Detection of fake users in twitter using network representation and nlp,” in 2022 14th International Conference on COMmunication Systems NETworkS (COM- SNETS), 2022, pp. 754–758. [Online]. Available: https://doi.org/10.1109/COMSNETS53615.2022.9668371 [30] C. Zhao, Y . Xin, X. Li, H. Zhu, Y . Yang, and Y . Chen, “An attention-based graph neural network for spam bot detection in social networks,” Applied Sciences, vol. 10, no. 22, 2020. [Online]. Available: https://doi.org/10.3390/app10228160 [31] B. Muthu, K. Natesapillai, K. Subburathinam, R. Varatharajan, G. Manogaran, and C. B. Sivaparthipan, “An enhanced graph-based semi- supervised learning algorithm to detect fake users on twitter,” The Journal of Supercom- puting, vol. 75, 09 2019. [Online]. Available: https://doi.org/10.1007/s11227-019-02948-w [32] E. Shaabani, A. Sadeghi-Mobarakeh, H. Alvari, and P. Shakarian, “An end-to-end framework to identify pathogenic social media accounts on twitter,” 2019 2nd International Conference on Data Intelligence and Security (ICDIS), pp. 128–135, 2019. [Online]. Available: https://doi.org/10.48550/arxiv.1905.01553 [33] R. Alharthy, A. Alhothali, and K. Moria, “De- tecting and characterizing arab spammers cam- paigns in twitter,” Procedia Computer Science, vol. 163, pp. 248–256, 01 2019. [Online]. Available: https://doi.org/10.1016/j.procs.2019.12.106 [34] Q. Guo, H. Xie, Y . Li, W. Ma, and C. Zhang, “Social bots detection via fusing bert and graph convolutional networks,” Symmetry, vol. 14, no. 1, 2022. [Online]. Available: https://doi.org/10.3390/sym14010030 [35] J. Rodríguez-Ruiz, J. I. Mata-Sánchez, R. Mon- roy, O. Loyola-González, and A. López-Cuevas, “A one-class classification approach for bot detection on twitter,” Computers & Security, vol. 91, p. 101715, 2020. [Online]. Available: https://doi.org/10.1016/j.cose.2020.101715 [36] C. A. Davis, O. Varol, E. Ferrara, A. Flammini, and F. Menczer, “Botornot: A system to evaluate social bots,” in Proceedings of the 25th International Con- ference Companion on World Wide Web, ser. WWW ’16 Companion. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2016, p. 273–274. [Online]. Available: https://doi.org/10.1145/2872518.2889302 [37] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations, ICLR 2013, Scotts- dale, Arizona, USA, May 2-4, 2013, Work- shop Track Proceedings, 2013. [Online]. Available: https://doi.org/10.48550/arxiv.1301.3781 [38] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in Proceed- ings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Lin- guistics, 2014, pp. 1532–1543. [Online]. Available: https://doi.org/10.3115/v1/D14-1162 [39] J. W. Chotlos, “Iv. a statistical and comparative anal- ysis of individual written language samples,” Psycho- logical Monographs, vol. 56, p. 75–111, 1944. [On- line]. Available: https://doi.org/10.1037/h0093511 [40] M. Templin, Certain Language Skills in Children: Their Development and Interrelationships. Min- neapolis,MN: University of Minnesota Press, 1957. [Online]. Available: https://doi.org/10.1086/459642 158 Informatica46 (2022) 143–158 Z.F. Hamida et al. [41] P. Lissón and N. Ballier, “Investigating lexi- cal progression through lexical diversity met- rics in a corpus of french l3,” Discours [En ligne], vol. 23, 2018. [Online]. Available: https://doi.org/10.4000/discours.9950 [42] N. Chipere, D. Malvern, and B. Richards, “Us- ing a corpus of children’s writing to test a solution to the sample size problem affecting type-token ra- tios,” in Corpora and language learners. John Benjamins, 2004, pp. 139–147. [Online]. Available: https://doi.org/10.1075/scl.17.10chi [43] K. Kettunen, “Can type-token ratio be used to show morphological complexity of lan- guages?” Journal of Quantitative Linguistics, vol. 21, p. 223–245, 2014. [Online]. Available: https://doi.org/10.1080/09296174.2014.911506 [44] H. Heaps, Information Retrieval: Computa- tional and Theoretical Aspects. New York: Academic Press, 1978. [Online]. Available: https://doi.org/10.5860/crl_40_03_276 [45] P. M. MacCarthy, “An assessment of the range and usefulness of lexical diversity measures and the po- tential of the measure of textual, lexical diversity,” Ph.D. dissertation, University of Memphis, 2005. [46] P. McCarthy and S. Jarvis, “Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lex- ical diversity assessment,” Behavior Research Meth- ods, vol. 42, pp. 381–92, 2010. [Online]. Available: https://doi.org/10.3758/BRM.42.2.381 [47] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely ran- domized trees,” Mach Learn, vol. 63, p. 3–42, 2006. [Online]. Available: https://doi.org/10.1007/s10994- 006-6226-1 [48] Y . Lecun, “Modeles connexionnistes de l’apprentissage (connectionist learning models),” Ph.D. dissertation, Universite de Paris VI, 1987. [49] H. Bourlard and Y . Kamp, “Auto-association by multilayer perceptrons and singular value decomposition,” Biological Cybernetics, vol. 59, p. 291–294, 1988. [Online]. Available: https://doi.org/10.1007/BF00332918 [50] G. Hinton and R. Zemel, “Autoencoders, minimum description length and helmholtz free energy,” in Ad- vances in Neural Information Processing Systems, vol. 6. Morgan-Kaufmann, 1994, pp. 3–10. [Online]. Available: https://doi.org/10.5555/2987189.2987190 [51] K. Lopyrev, “Generating news headlines with recurrent neural networks,” CoRR, vol. abs/1512.01712, 2015. [Online]. Available: https://doi.org/10.48550/arxiv.1512.01712 [52] A. Graves, S. Fernández, and J. Schmidhuber, “Bidi- rectional lstm networks for improved phoneme clas- sification and recognition,” in Artificial Neural Net- works: Formal Models and Their Applications – ICANN 2005. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 799–804. [Online]. Available: https://doi.org/10.1007/11550907_126 [53] A. Kulkarni and A. Shivananda, Natural Lan- guage Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning us- ing Python. Apress, 2019. [Online]. Available: https://doi.org/10.1007/978-1-4842-4267-4 [54] J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural net- works?” Advances in Neural Information Processing Systems (NIPS), vol. 27, 2014. [Online]. Available: https://doi.org/10.48550/arxiv.1411.1792 [55] D. Rumelhart, G. Hinton, and R. Williams, “Learn- ing representations by back-propagating errors,” Na- ture, vol. 323, p. 533–536, 1986. [Online]. Available: https://doi.org/10.1038/323533a0 [56] S. Hochreiter and J. Schmidhuber, “Long Short- Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735 [57] S. Kalyoncu, A. Jamil, E. Karata¸ s, J. Rasheed, and C. Djeddi, “Stock market value prediction us- ing deep learning,” Data Science and Applications, vol. 3, no. 2, pp. 10–14, 2020. [Online]. Available: https://doi.org/10.1186/s40537-020-00333-6 [58] K. Cho, B. van Merriënboer, D. Bahdanau, and Y . Bengio, “On the properties of neural machine translation: Encoder–decoder approaches,” in Pro- ceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Lin- guistics, 2014, pp. 103–111. [Online]. Available: https://doi.org/10.3115/v1/W14-4012 [59] M. Schuster and K. K. Paliwal, “Bidirectional recur- rent neural networks,” IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997. [Online]. Available: https://doi.org/10.1109/78.650093 [60] C. Xiong, S. Merity, and R. Socher, “Dynamic mem- ory networks for visual and textual question answer- ing,” ArXiv, vol. abs/1603.01417, 2016. [Online]. Available: https://doi.org/10.48550/arxiv.1603.01417 [61] S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi, “The paradigm-shift of social spam- bots: Evidence, theories, and tools for the arms race,” in Proceedings of the 26th International Conference on World Wide Web Companion, ser. WWW ’17 Hybrid-MELAu: A Hybrid Mixing Engineered Linguistic Features. . . Informatica46 (2022) 143–158 159 Companion. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2017, p. 963–972. [Online]. Available: https://doi.org/10.1145/3041021.3055135 [62] S. Cresci, “Mib datasets,” http://mib.projects.iit.cnr.it/dataset.html, 2017, accessed: 2021-01-12. [63] K.-C. Yang, O. Varol, C. A. Davis, E. Ferrara, A. Flammini, and F. Menczer, “Arming the pub- lic with artificial intelligence to counter social bots,” Human Behavior and Emerging Technologies, vol. 1, no. 1, pp. 48–61, 2019. [Online]. Available: https://doi.org/10.1002/hbe2.115 [64] G. Fergadiotis, H. H. Wright, and S. B. Green, “Psychometric evaluation of lexical diversity in- dices: Assessing length effects,” Journal of Speech, Language, and Hearing Research, vol. 58, no. 3, pp. 840–852, 2015. [Online]. Available: https://doi.org/10.1044/2015_JSLHR-L-14-0280 [65] F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and Édouard Duchesnay, “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011. [Online]. Avail- able: https://doi.org/10.5555/1953048.2078195 [66] J. Han and C. Moraga, “The influence of the sig- moid function parameters on the speed of backpropa- gation learning,” in From Natural to Artificial Neural Computation. Berlin, Heidelberg: Springer Berlin Heidelberg, 1995, pp. 195–201. [Online]. Available: https://doi.org/10.1007/3-540-59497-3_175 [67] F. Chollet, “Keras: Theano-based deep learning li- brary, 2015„” http://keras. io, 2015, accessed: 2021- 01-18. [68] P. Baldi, S. Brunak, Y . Chauvin, C. A. F. Andersen, and H. Nielsen, “Assessing the ac- curacy of prediction algorithms for classifica- tion: an overview,” Bioinformatics, vol. 16, no. 5, pp. 412–424, 2000. [Online]. Available: https://doi.org/10.1093/bioinformatics/16.5.412