https://doi.org/10.31449/inf.v47i5.4617 Informatica 47 (2023) 63–68 63 
Classifying Argument Component using Deep Learning on English 
Dataset 
William Gunawan
1*
 and Derwin Suhartono
2
 
1
 Computer Science Department, BINUS Graduate Program, Master of Computer Science, Bina Nusantara University, 
Jakarta, Indonesia. 
2
 Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia. 
E-mail: william.gunawan001@binus.ac.id, dsuhartono@binus.edu 
Keywords: argumentation mining, argument component, bidirectional encoder representations from transformers  
Received: January 14, 2023 
 
The study focuses on the argument component in argumentation mining, specifically examining claim and 
premise types. Various datasets exist for argumentation components, each with different classes. The study 
evaluates the performance of deep learning architectures, particularly using contextual embedding as the 
initial layer. Six datasets with diverse argument components are used for validation. The research 
provides a comprehensive comparison of deep learning architectures, combining multiple layers such as 
BERT or word embedding with LSTM, GRU, or CNN. The results and their implications are discussed in 
the concluding section of the journal. The study demonstrates significant results with the BERT-BiGRU-
CRF architecture after conducting several experiments. 
Povzetek: Študija preučuje komponente argumentacije pri rudarjenju argumentov, pri čemer je 
osredotočena na arhitekture globokega učenja in kontekstno vstavljanje. 
 
1 Introduction
Argumentation is a major element of human intelligence. 
The ability to argue is fundamental for humans to 
understand new problems, perform scientific reasoning, 
express, clarify and defend an opinion in everyday life [1]. 
Therefore, argumentative sentences frequently appeared 
in public spaces, namely social media debates, reviews, 
and scientific articles. Nevertheless, determining the 
meaning of an argumentative sentence requires complex 
processes and the use of deep learning could accelerate the 
completion of the task. 
Natural language processing (NLP) techniques such 
as Argumentation mining could identify and classify the 
component of arguments contained in a writing. By 
focusing on the automatic identification of argument 
structures in natural languages [2], it has the capability to 
understand an argumentation structure so that the reasons 
for the opinions issued can be known [3]. In addition, the 
technique is not limited to understanding the meaning of 
each word. The advantage yields valid argumentation 
sentences because the arguments are supported by relevant 
facts [4]. Furthermore, a comprehension of the 
relationship between argumentative sentences is needed to 
get the meaning of the sentence [5]. 
 
 
 
 
 
 
Table 1: Each dataset with its argument component 
followed by the count of argument components from 
each class. 
Dataset Type of Argument Component 
Web 
discourse 
Backing (205), Claim (183), Premise (499), Rebuttal 
(65), Refutation (23) 
Persuasive 
essays 
Claim (1,160), Major Claim (465), Premise (3,336) 
Hotel 
reviews 
Background (157), Claim (936), Implicit Premise 
(112), Major Claim (259), Premise (385), 
Recommendation (118) 
News 
comments 
Premise (4,294) 
Various 
(Araucaria) 
Premise (1,229), Claim (496) 
Wiki 
discussions 
Premise (1,299), Claim (1,039) 
 
Datasets on argumentative sentences are developed 
from time to time, as can be seen in Table 1. There are 
several components of the argument that are familiar. the 
components of the argument such as backing, rebuttal, and 
refutation which have been described in other studies. [6]. 
The majority of datasets consist of promises and claims 
arguments, meanwhile, Web and Hotel datasets contain 
additional arguments, namely Backing, Rebuttal, 
Refutation, Background, Implicit Premise, Major Claim, 
and Recommendation. 
 
 
64 Informatica 47 (2023) 63–68 W. Gunawan et al. 
Numerous research has been accomplished in 
argument component classification topics with various 
datasets and approaches. They produce valuable results, 
unfortunately, some limitations are still shown. Social 
media [7], news [8], essays or articles [9], and Wikipedia 
discussions [10] are the datasets that are being used. 
Moreover, the approaches have a great range of algorithm 
complexity, they are using Support Vector Machine 
(SVM) with max entropy [11,12], deep learning [4,13], 
probabilities modeling [14], and Transformer model using 
BERT model [15]. 
This paper focuses on classifying argument 
components with various types of datasets using several 
deep learning architectures, specifically the BERT-related 
models. With the different number of classes in each 
dataset, it is expected to provide more insight from the 
obtained result on each model and dataset. Some previous 
works remarkably inspired this research. Firstly, the 
research uses MTL and STL on six datasets and provides 
an understanding that the amount of data and the diversity 
of classes in the existing datasets cannot provide the same 
improvement [2]. 
Secondly, the work with the approach of the use of 
contextual language models provides promising results in 
classifying argumentation components [16]. Thirdly, the 
approach of using BLSTM-CNNs model that handles 
sequence labeling data [17], and lastly, the research that 
uses the combination of BERT and Bidirectional RNN 
(Recurrent Neural Network) architecture [18- 20]. As a 
disclaimer, this study does not provide a comparison 
between the conducted research with the previous ones. 
2 Related works 
Conducted studies on argumentation mining, especially in 
the argument component, to come up with insightful ideas. 
Moreover, they apply diversified architecture and a great 
number of them provide tremendously. The research of 
argumentation mining started with the roots of philosophy 
[21]. The evolution of the Artificial Intelligence (AI) 
algorithm followed by the advantage technique in 
Machine Learning (ML) produces impressive progress 
that attracts the scientific community [22]. The use of 
Machine Learning techniques combined with statistical 
knowledge such as maximum entropy and the rules of 
Context Free Grammar (CFG) obtained a promised result 
[11]. 
Plentiful approaches are used by researchers using the 
combination of NLP discipline. Great improvement in the 
argumentation mining area using semantic textual 
similarity (STS) combined with textual entailment [23] 
and the research uses the combination of 8 features 
containing structural, lexical, syntactic, contextual, 
indicator, embedding, probability, and similarity [24]. It 
shows that the interaction in arguments can be used to 
recognize the argument. Another approach is carried out 
by identifying the component of the argument using 
multiclass classification with a Support Vector Machine 
(SVM) followed by identifying the structure of the 
argument [25]. And the research that applied Word2Vec 
and semi-supervised learning to the argument data with 
sequence structure in Greek [8]. 
The advantage technique of Deep Learning (DL) has 
been shown and enlarged the research area in 
argumentation mining, especially on the component of the 
argument. By comparing the usage of Bidirectional Long 
Short-Term Memory using Single-Tasking Learning 
(STL) and Multi-Tasking Learning (MTL). Experiment on 
six argumentation mining datasets with the same model. It 
shows that the complex model can beat a shallow one, 
which comes from the results that MTL overcomes STL 
on every dataset [2]. Besides that, the imbalanced data of 
the component of the argumentation problem has become 
one of the research branches of argumentation mining. 
Using SVM and Partial Tree Kernel (PTK), show that 
imbalanced data can be solved [26]. 
Mainly, Argumentation mining is like the other NLP 
tasks. Argumentation mining has been studied using 
Transformer models [15,16]. Not limited to classifying 
argumentation components, the Transformer model is also 
used to classify the relations of the argument [13], and 
even the use of the Transformer model to summarize the 
argument gives a promising result [27].  
 
Table 2: Summary of related works results. 
Ref. 
No. 
Year Data Technique Results 
[2] 2018 Table 1 MTL and STL 
using BLSTM 
Table 3. 
Column 
2 
[13] 2020 Corpus 
US2016 & 
Moral 
Maze cross-
domain 
Transformer 
model 
70% for 
US2016 
and 
61% for 
Moral 
Maze 
cross-
domain 
[15] 2020 Extended 
MEDLINE 
Corpus 
Fine-tuning 
SciBERT 
F1-
score 
87% 
[25] 2014 Persuasive 
Essays 
Multi-class 
SVM with 
features 
selections 
F1-
score 
72.2% 
[26] 2019 IBM Topic 
Corpus 
SVM and 
partial tree 
kernel (PTK) 
F1-
score 
74% 
[27] 2021 IBM 
Debater(R)-
ArgKP 
Text-to-Text 
Transfer 
Transformer 
F1-
score 
98.5% 
 
Table 2 provides a concise summary based on similar 
research that has been published. Most of the research that 
has been done has only focused on one dataset. This is 
what underlies the researchers to conduct this research. 
Following the given knowledge from previous research, 
we developed several Transformer-based models, several 
deep learning models which represent the sequence-to-
sequence model, and a combination of BERT and deep 
Classifying Argument Component using Deep Learning on English… Informatica 47 (2023) 63–68 65 
learning models. This study will provide an overview of 
how each Transformer-based model is able to study the six 
datasets that have different labels and uneven distribution 
of data. The comparison between models will be given in 
section 4. 
3 Proposed method 
 
Figure 1: Research Frameworks. 
 
With several conducted experiments, the proposed model 
is constructed of several deep learning models and uses 
frameworks as shown in Figure 1. Initially, the datasets 
are loaded, then they are trained with a predefined model, 
and finally compared with the testing metrics. 
 
3.1 Dataset 
The experiments utilize six different datasets that have 
been preprocessed. They are transformed into a proposed 
token level using BIO tags [28]. The BIO itself stands for 
Begin, Inside follows the begin tag, and Outside for the 
words that are not in the classes. Furthermore, the amount 
of data is not distributed well and each dataset has 
different labels. Hence, the data is trained and evaluated 
separately. 
 
Table 3: The dataset used in this research with the total 
document is used for training, validation, and testing. 
Dataset Total 
Training 
Data 
Total 
Validation 
Data 
Total 
Testing 
Data 
Total 
Training 
Token 
Web 136 60 338 21,542 
Essays 108 45 598 21,013 
Hotel 138 64 36 21,042 
News 196 86 1,645 21,031 
Various 192 81 263 21,084 
Wikipedia 130 57 954 21,066 
 
The table above presents the sources of the original 
datasets that are expected to have a great result in 
identifying the argumentation component. The dataset 
consists of Various (Araucaria) [29], Wikipedia 
Discussions [10], Hotel Reviews [30], Web discourse 
[31], News Comments [32], and Persuasive Essays [25]. 
Besides, the total training data, total validation data, and 
total testing data refer to the number of documents. 
Moreover, our research uses 21K training data that has 
already been used in previous research [2]. 
 
Table 4: Label Distribution. 
Dataset Label 
Web Backing (2.557), Claim (953), Premise 
(5.733), Rebuttal (529), Refutation 
(472), Other (11.298) 
 
Essays Claim (3.387), Major Claim (1454), 
Premise (9.539), Other (6.633) 
 
Hotel Background (1.495), Claim (8.110), 
Implicit Premise (1.626), Major Claim 
(1.402), Premise (4.574), 
Recommendation (1.241), Other (2.594) 
News Premise (10.999), Other (10.032) 
 
Various Premise (9.817), Claim (3.355), Other 
(7.912) 
 
Wikipedia Premise (5.340), Claim (1.706), Other 
(14.020) 
 
 
The number of each label in each dataset can be seen 
in the table above. Each dataset has a relatively broad 
distribution of labels, but the Web and Wikipedia datasets 
stand out as having significantly fewer positive than 
negative labels. The positive labels and the negative labels 
are almost evenly distributed in the other datasets. 
3.2 Deep learning architectures 
Several deep-learning approaches are applied in this 
research. Furthermore, the models that solve sequence 
problems are preferred to the traditional machine learning 
algorithm. The architecture models implement three 
different types of embedding layers. The first layer is the 
BiGRU model without pre-trained word embedding. The 
second ones are BiGRU and BLSTM-CNNs models that 
use Glove with 200 dimensions with six billion tokens as 
pre-trained word embedding. Lastly, the models with 
contextual embeddings, such as BERT, DistilBERT [33], 
and BERT-BiGRU-CRF. 
 
 
Figure 2: BERT-BiGRU-CRF Architecture. 
 
66 Informatica 47 (2023) 63–68 W. Gunawan et al. 
On the predicted layer, BERT-BiGRU-CRF uses 
CRF, while the other models do not use contextual 
embedding with dens layer and Softmax. Uncased pre-
trained models are implemented on the model that 
includes contextual embedding and every available token 
has been lowercase. Five CNN layers are set in parallel to 
the word embedding layer on the BLSTM-CNNs 
architecture model. Later the CNN layers will be 
concatenated and continued with two BLSTM layers of 
200 units each. 
For BiGRU models that either do not apply pre-
trained word embedding, they use two BiGRU layers with 
200 units each. On the BERT-BiGRU-CRF model, 2 
BiGRU layers with 200 units each and the CRF layer as 
the prediction layer are set sequentially. In general, every   
architecture model uses a dropout layer with a value 
of 0.5 after the embedding layer. 
4 Experiment 
Two treatments are equally applied to the proposed 
architecture models. First, the model utilizes six datasets 
and is accomplished in two batch sizes, namely 8 and 12 
that yield 72 experiments in total. Second, the model uses 
a 512-sequence length which determines the token value, 
if it is less than 512 it will be filled with padding otherwise 
the rest value will be ignored. Besides that, other 
procedures are uniformly applied for each experiment. 
For instance, Adam optimizers are implemented with 
a learning rate starting from 3x10-4 with epsilon value 10-
8 and the data will be trained using epochs as many as 100. 
Early stopping, one of the regularizations is chosen to the 
outcome of the overfitting. It is configured with four 
maximum errors, meaning that the training will be stopped 
if the iteration is unable to reduce the loss value four times. 
5 Results and discussion 
 
Table 5: Macro-F1 for AM component. 
 
This research does not have any comparison with 
other research, hence, the result using a full dataset will 
become divergent. All the models will be evaluated using 
the Macro-F1 score because we focused on a positive 
class. Batch size 8 experiments produce better results than 
the 12 ones for many of the datasets except for the Web 
dataset. Web dataset is difficult to learn and provides 
unacceptable results on every model. In contrast, News, 
and Essays datasets are easier to understand. Table 4 
shows the comparison results on each dataset. 
Following the results from Table 5, can assume that 
BERT-BiGRU-CRF has overcome other models include 
with the previous research with the same dataset 
distribution. But the results are not significant to some 
datasets. Poor results are obtained on datasets that have 
many classes. However, the dataset with 2 components 
such as Various (araucaria) and Wikipedia increased by 
about 10% more than the other experiment. The 
imbalanced data has a big role in the results. Mostly, each 
dataset that has an imbalanced argument component gives 
a bad result to the average or micro F1-Score. In this case, 
the class imbalance that occurs is the uneven distribution 
between the positive class and the negative class and 
between each positive class. 
The model runs in a very small iteration because of 
the small amount of data, the model learns about 14 to 24 
variations based on the total training data over the given 
batch size. But the BERT model reaches the 50 epochs 
without being penalized by the early stopping function. 
From Table 4, it can be concluded that word embedding is 
very influential. Moreover, in this case, BERT based 
model used using vanilla BERT does not give a good 
result for Hotel and Web dataset that has various classes. 
The combination between BERT and GRU extended by 
CRF can give a better result for every dataset. 
In general, the authors discovered a number of flaws 
in the native Transformer model, including the model's 
inability to perform well on data with uneven distribution, 
like the Web and Wikipedia. With other datasets, 
however, it can accommodate deep learning hybrid 
models built on Transformers and non-BERT models. The 
hybrid approach also can’t perform well on data from the 
Web and Wikipedia. 
 
 
 
As a result, the pattern of subpar findings for the two 
data is based on a comparison of the relatively large 
number of positive and negative labels. To further support 
the analysis, it should be noted that the Hotel dataset, 
despite a small distribution of positive label data, still 
yields promising results due to the distribution between 
Dataset 
Model 
Previous 
Research 
[2] 
BiGRU 
BiGRU + 
Gloves 
Embedding 
BLSTM-
CNNs + Glove 
Embedding 
BERT DistilBERT 
BERT-BiGRU-
CRF 
Web 0.234 0.173 0.265 0.262 0.066 0.075 0.299 
Essays 0.605 0.257 0.612 0.594 0.604 0.504 0.655 
Hotel 0.479 0.214 0.435 0.458 0.476 0.427 0.504 
News 0.577 0.477 0.632 0.619 0.558 0.407 0.677 
Var 0.474 0.356 0.505 0.354 0.481 0.368 0.59 
Wiki 0.325 0.305 0.38 0.397 0.312 0.251 0.434 
Classifying Argument Component using Deep Learning on English… Informatica 47 (2023) 63–68 67 
6 Conclusion 
In short, the experiment result shows that the BERT model 
achieves better performance on the data with a small 
number of classes, yet poor on the data with numerous 
classes such as Hotel and Web datasets. Besides, BLSTM-
CNNs model obtains stable performance, nevertheless 
insignificant. 
Several things can be concluded after doing this 
research, namely: 
• BERT-BiGRU-CRF overcomes all the 
experiments for each dataset. 
• BERT and DistilBERT models deliver poor 
results on the diverse imbalance classes of 
datasets. However, they produce the same result 
as BLSTM-CNNs or BiGRU with Glove 
embedding on a small number of class datasets. 
• CNN in BLSTM-CNNs model results steadily. It 
is proven by the comparison of BiGRU with 
Glove embedding. 
• Imbalance class of argument component results 
worse on the small amount label than the large 
one. For example, the refutation argument in the 
Web dataset only consists of 5% compared to the 
claim argument component. 
 
Several improvements in features engineering and 
parameter tuning could potentially be advancing the 
research. The first is to increase the experiment with 
another BERT model such as Big Bird [34]. The second is 
to use the hybrid model of BERT followed by BLSTM-
CNNs in one architecture. The last is to use bigger word 
embeddings for the non-BERT model to compare this 
research. 
7 References 
[1]  Mochales, R. & Moens, M. F. (2011). 
Argumentation Mining. Artificial Intelligence and 
Law. 19. 1-22. doi:10.1007/s10506-010-9104-x. 
 
[2]  Schulz, C., Eger, S., Daxenberger, J., Kahse, T., & 
Gurevych, I. (2018). Multi-Task Learning for 
Argumentation Mining in Low-Resource Settings. 
NAACL- HLT. doi:10.18653/v1/n18-2006. 
 
[3]  Lawrence, J. & Reed, C. (2019). Argument Mining: 
A Survey. Computational Linguistics. 45. 765-818. 
doi:10.1162/COLI_a_00364. 
 
[4]  Suhartono, D., Gema, A. P., Winton, S., David, T., 
Fanany, M. I., Arymurthy, A. M. (2020). Argument 
annotation and analysis using deep learning with 
attention mechanism in Bahasa Indonesia. Journal 
of Big Data. 7(90). Springer. doi:10.1186/s40537-
020-00364-z. 
 
[5]  Gema, Aryo & Winton, Suhendro & David, 
Theodorus & Suhartono, Derwin & Shodiq, Muhsin 
& Gazali, Wikaria. (2017). It Takes Two To Tango: 
Modification of Siamese Long Short Term Memory 
Network with Attention Mechanism in Recognizing 
Argumentative Relations in Persuasive Essay. 
Procedia Computer Science. 116. 449-459. 
doi:10.1016/j.procs.2017.10.036. 
 
[6]  Hunter, Anthony. (2007). Elements of 
Argumentation. 4. 10.1007/978-3-540-75256-1_3. 
 
[7]  Lippi, M., & Torroni, P. (2016). Argumentation 
Mining: State of the Art and Emerging Trends. ACM 
Trans. Internet Techn., 16, 10:1-10:25. 
doi:10.1145/2850417. 
 
[8]  Sardianos, C., Katakis, I.M., Petasis, G., & 
Karkaletsis, V. (2015). Argument Extraction from 
News. ArgMining@HLT-NAACL. 
doi:10.3115/v1/w15-0508. 
 
[9]  Stab, C., Kirschner, C., Eckle-Kohler, J., & 
Gurevych, I. (2014). Argumentation Mining in 
Persuasive Essays and Scientific Articles from the 
Discourse Structure Perspective. ArgNLP. 
 
[10]  Biran, O., & Rambow, O. (2011). Identifying 
Justifications in Written Dialogs by Classifying Text 
as Argumentative. Int. J. Semantic Comput., 5, 363-
381. doi:10.1142/s1793351x11001328. 
 
[11]  Palau, R., & Moens, M. (2009). Argumentation 
mining: the detection, classification and structure of 
arguments in text. ICAIL. 
doi:10.1145/1568234.1568246. 
 
[12]  Lippi, M., & Torroni, P. (2015). Argument Mining: 
A Machine Learning Perspective. TAFA. 
 
[13]  Ruiz-Dolz, R., Barberá, S.H., Alemany, J., & 
García-Fornes, A. (2020). Transformer- Based 
Models for Automatic Identification of Argument 
Relations: A Cross- Domain Evaluation. ArXiv, 
abs/2011.13187. doi:10.1109/mis.2021.3073993. 
 
[14]  Culotta, A., McCallum, A., & Betz, J. (2006). 
Integrating Probabilistic Extraction Models and 
Data Mining to Discover Relations and Patterns in 
Text. HLT-NAACL. 
doi:10.3115/1220835.1220873. 
 
[15]  Mayer, T., Cabrio, E., & Villata, S. (2020). 
Transformer-Based Argument Mining for 
Healthcare Applications. ECAI. 
 
[16]  Hidayaturrahman, Dave, E., Suhartono, D., & 
Arymurthy, A. M. (2021). Enhancing argumentation 
component classification using contextual language 
model. Journal of Big Data, 8(1), [103]. 
doi:10.1186/s40537-021-00490-2. 
 
[17]  Gunawan, W., Suhartono, D., Purnomo, F., & 
Ongko, A. (2018). Named-Entity Recognition for 
68 Informatica 47 (2023) 63–68 W. Gunawan et al. 
Indonesian Language using Bidirectional LSTM-
CNNs. Procedia Computer Science, 135, 425-432. 
doi:10.1016/j.procs.2018.08.193. 
 
[18]  Yu, Q., Wang, Z., & Jiang, K. (2020). Research on 
Text Classification Based on BERT-BiGRU Model. 
Journal of Physics Conference Series, 2021, vol. 
1746, no. 1. doi:10.1088/1742-6596/1746/1/012019. 
 
[19]  Tobias, M., Elena, C., & Serena, V. (2020) 
Transformer-based Argument Mining for 
Healthcare Applications. In Proceedings of the 24th 
European Conference on Artificial Intelligence 
(ECAI 2020), Santiago de Compostela, Spain. 
doi:10.3233/faia325. 
 
[20]  Qin, Q., Zhao, S., & Liu, C. (2021). A BERT-BiGRU-
CRF Model for Entity Recognition of Chinese 
Electronic Medical Records. Complexity. 2021. 1-
11. doi:10.1155/2021/6631837. 
 
[21]  Toulmin, S. (2008). The Uses of Argument, Updated 
Edition. 
 
[22]  Lytos, A., Lagkas, T., Sarigiannidis, P., & 
Bontcheva, K. (2019). The evolution of 
argumentation mining: From models to social media 
and emerging tools. ArXiv, abs/1907.02258. doi: 
10.1016/j.ipm.2019.102055. 
 
[23]  Boltuzic, F., & Šnajder, J. (2015). Identifying 
Prominent Arguments in Online Debates Using 
Semantic Textual Similarity. ArgMining@HLT-
NAACL. doi:10.3115/v1/w15-0514. 
 
[24]  Winata, R., Haryono, E. G., Suhartono, D. (2021). 
Toward Better Argument Component Classification 
in English Essays. ICIC Express Letters. vol. 12. 
111-119.  
 
[25]  Stab, C., & Gurevych, I. (2014). Identifying 
Argumentative Discourse Structures in Persuasive 
Essays. EMNLP. doi:10.3115/v1/d14-1006. 
 
[26]  Kusmantini, H.A., Asror, I., & Bijaksana, M.A. 
(2019). Argumentation mining: classifying 
argumentation components with Partial Tree Kernel 
and Support Vector Machine for constituent trees on 
imbalanced persuasive essay. doi:10.1088/1742-
6596/1192/1/012009 
 
[27]  Harly, W., Kwee, R. H., Suhartono, D. (2022). 
Quantitative Argument Summarization Using Text-
to-Text Transfer Transformer. ICIC Express Letters 
Part B: Applications. vol. 13, 749-756. 
doi:10.24507/icicelb.13.07.749. 
 
[28]  Ratinov, L., & Roth, D. (2009). Design Challenges 
and Misconceptions in Named Entity Recognition. 
Conference on Computational Natural Language 
Learning. doi:10.3115/1596374.1596399. 
 
[29]  Reed, C., Palau, R., Rowe, G., & Moens, M. (2008). 
Language Resources for Studying Argument. LREC. 
 
[30]  Liu, H., Gao, Y., Lv, P., Li, M., Geng, S., Li, M., & 
Wang, H. (2017). Using Argument-based Features 
to Predict and Analyse Review Helpfulness. ArXiv, 
abs/1707.07279. doi:10.18653/v1/d17-1142. 
 
[31]  Habernal, I., & Gurevych, I. (2017). Argumentation 
Mining in User-Generated Web Discourse. 
Computational Linguistics, 43, 125-179. 
 
[32]  Habernal, I., Wachsmuth, H., Gurevych, I., & Stein, 
B. (2017). The Argument Reasoning Comprehension 
Task. ArXiv, abs/1708.01425. 
doI:10.1162/coli_a_00276. 
 
[33]  Sanh, V., Debut, L., Chaumond, J., & Wolf, T. 
(2019). DistilBERT, a distilled version of BERT: 
smaller, faster, cheaper and lighter. ArXiv, 
abs/1910.01108. 
 
[34]  Zaheer, M., Guruganesh, G., Dubey, K.A., Ainslie, 
J., Alberti, C., Ontañón, S., Pham, P., Ravula, A., 
Wang, Q., Yang, L., & Ahmed, A. (2020). Big Bird: 
Transformers for Longer Sequences. ArXiv, 
abs/2007.14062.