https://doi.org/10.31449/inf.v48i7.5347 Informatica 48 (2024) 29–38 29 Hierarchical Model Rule Based NLP for Semantic Training Representation Using Multi Level Structures Fangmian Liu 1 , Qiyuan Bian 2.* 1 Henan Vocational College of Tuina, Luoyang, Henan 471023 2 Fudan University, Yangpu, Shanghai , 200438 ,China E-mail: liufangmian@163.com *Corresponding author Keywords: NLP, hierarchical model, semantic training, multi-level structures Recieved: October 20, 2023 During evaluation of large amounts of natural language texts, the utilisation of multi-level models is essential for the purpose of extracting knowledge that is relevant. It is essential to complete these duties to solve a variety of concerns relating to the development of textual information as well as its analysis. It is necessary to have a substantial quantity of annotated texts that contain various levels of lexical, syntactical, semantic, and narrative information to develop multi-level models for natural language texts. due syntactical annotations are maintained in a tree structure, these annotated texts are frequently referred to as text corpora or treebanks. This is due of the tree structure. Semantic treebanks are a relatively new development in this area that were introduced not too long ago. These treebanks join syntactical trees through logically-expressed smart representations of phrase sense. During the last few years, a great number of semantic treebanks that contain superficial as well as deep semantic information have been constructed. There have been a lot of different ways created, both manually and mechanically, for generating semantic treebanks. Because there aren't many standards that are universally accepted in this quickly developing subject, many semantic banks include vastly varied kinds of information. This is especially true on the lexical level. The authors of this work investigate a variety of semantic treebanks and the ways in which such treebanks could be used for text modelling. They investigate the various kinds of information, such as semantic, narrative, syntactical, and lexical data, that are stored in these treebanks. The authors also study the quantity and character of relevant corpora in addition to the key tools utilised for working with the data included within treebanks. These methods have a wide range of applications in decision-making processes that are concerned with the generation and analysis of text. An example of their usage is for annotating and retrieving information resources to facilitate collaborative development of a domain information space based on ontology, particularly in scientific research and learning. Additionally, you can use them to create and re-write texts for a variety of purposes, including fiction writing, marketing, and scientific communication. Povzetek: Raziskava obravnava razvoj in analizo tekstovnih informacij s pomočjo hierarhičnih NLP modelov, ki uporabljajo večnivojske strukture za semantično usposabljanje, zlasti z uporabo semantičnih drevesnih struktur. 1 Introduction The use of multi-level modeling for natural language texts is beneficial for solving various problems related to text generation, analysis, annotation, and retrieval. To create reliable multi-level models, a significant amount of annotated textual data must be analyzed. Treebanks, which are annotated text corpora, are a useful resource for modeling purposes. Text analysis requires different types of input data depending on the dimensions being analyzed. There are several significant classes of modeling data, including semantic data, which represents the text's direct meaning and is often represented through first-order logic. Narrative data captures the narrative elements of the text, including the genre, intended audience, and author's style. Syntactical information describes sentence structure andfunctions which can be stored using constituency and dependency structure trees. Lastly, lexical information includes specific word details, such as part-of-speech describing, lemmas, data. The objective of modeling the lexical aspect of text is to provide guidance for selecting the appropriate phrase to 30 Informatica 48 (2024) 29–38 F. Liu et al. convey deliberate meaning. This involves picking a synsetbeginning a list of alternative expression and determining level of evidence for the synset. To create a robust model for word selection, it is necessary to consider lexical, syntactical, and narrative information. Labeling the meanings of words, constructing dependency trees for phrases and clauses, and analyzing description components all contribute to this process. Synsets rely on semantic relations between synsets, especially hyponymy, as opposed to enhancing predicative semantic restraints, which characterise only a small subset of WordNet relations [19- 21]. This recursive approach to language allows for the definition of any constraints on meaning.Another attempt to establish semantic roles focuses exclusively on verbs; examples include the sense id models VerbNet [22] and PropBank [23]. Dependency trees are important for modeling natural language texts because the choice of a dependent word may be influenced by the head word in the dependency link. In addition to lexical information, narrative data plays a crucial role because the selection of a restrictedalternative expression can be induced by numerouselements, such as text's genre, background of narrator, and intended audience. To aid decision-makers in choosing the appropriate type of sentence, the number and type of clauses, and other structural choices, modeling the syntactical dimension of text is also essential. This requires considering both the narrative information that affects sentence length and structure, as well as the semantic depiction of sentence's content in verb phrase logic system. Syntactical data can be represented using constituency trees, which can be constructed from dependency trees. The semantic component of text modeling involves organizing a text fragment to convey information effectively. This involves determining the appropriate length of the text fragment, the level of detail required, how to organize the content, and adding specific details to important information. It also involves determining the number of sentences and their content to create a coherent narrative structure. To model this level, predicate logic semantic information, high-level syntactical information (such as the top of dependency and constituency trees), and detailed narrative annotation, including text fragments and their components, mode of narration, narrative goals and links, and other relevant information, are necessary. The advancement of NLP technology in recent years has been remarkable, with the BERT paper by Devlin et al. [24] being a pivotal moment that introduced a new neural network architecture and training method with a significant impact on the expansion of NLP. BERT is a highly versatile tool for a wide range of NLP applications, improving the performance of many benchmarks by over 20%. The BERT neural network is based on self-attention, where the algorithm infers a hidden word from its left and right context during training by focusing on each surrounding word. A trained BERT model, also known as a Masked Language Model, can perform various NLP tasks, including paraphrase extraction, question answering, and semantic similarity testing. However, BERT has some limitations, such as its computational complexity, which is proportional to O(N2), where N is the dimension of the hidden layer and FLOP is the number of floating-point operations. As a result, input sequence lengths are typically limited to 512 or 1024 tokens. When existing queries produce unsatisfactory results, business users can modify their search queries by selecting more specific or general updated concepts. Adding new terms to existing business taxonomies to better reflect the dynamic world news poses two challenges. Firstly, locating existing BI-specific datasets and vocabularies to enrich taxonomies, and secondly, adding new concepts while preserving the current taxonomy's respect for how business concepts are structured. 2 Literature review According to the International Monetary Fund [18], private tax rulings, also known as PTRs, are a form of guidance that taxpayers can request from tax authorities in order to gain a better understanding of how tax rules apply to their particular circumstances. When taxpayers rely on PTRs, they are often shielded from further taxes, penalties, and interest, and the tax authority is obligated to follow the ruling. Additionally, when taxpayers rely on PTRs, the tax authority is required to obey the ruling. PTRs, on the other hand, almost always exclusively benefit the person who requests them and do not set a precedent for subsequent taxpayers. Both taxpayers and tax authorities benefit from increased consistency and clarity in the administration of tax legislation because to the existence of the private tax ruling system. .For intricate or unusual economic transactions, taxpayers can submit tailored PTR applications. To improve transparency and predictability in tax systems, the IMF suggests disclosing private rulings with appropriate redactions. Private tax rulings are a useful instrument for reducing or even getting rid of the tax risks that are involved with big commercial transactions. They frequently serve as the basis for subsequent interpretations and reveal the initial attitude of tax officials in the area. While private tax rulings are frequently used by tax planners and advisors for large corporations, smaller taxpayers also use them as a safeguard. However, preparing a request for a tax ruling is often too complex for the average taxpayer, which is why requests are usually submitted by tax advisors or lawyers and require significant effort from highly skilled tax professionals. Nonetheless, obtaining a private tax ruling is generally safer, less expensive, and quicker than litigating taxes in court. Hierarchical Model Rule Based NLP for Semantic Training… Informatica 48 (2024) 29–38 31 Taxpayers do not submit requests for a judgement carelessly. They typically only perform it in complicated and uncommon instances. The body of tax judgements provides a direct glimpse into taxpayers' everyday issues and illuminates previously hidden patterns of behaviour. As a result, policymakers can benefit from quantitative analysis of the corpus since it identifies problematic regions that could be resolved by modifying the tax code. Table 1: Comparison of state of art models Ref Technology Challenges [18] Private Tax Rulings Often shielded from further taxes, penalties [19] WordNet relations Opposed to enhancing predicative semantic restraints [20] WordNet relations Rely on semantic relations [21] WordNet relations Focuses exclusively on verbs [22] VerbNet models Difficulty of extending a manually-curated resource [23] PropBank Takes a practical approach to semantic representation, adding a layer of predicate-argument information [24] BERT Expensive and requires more computation because of its size. The Table 1 proves and illustrates the drawbacks of the existing state of the art models and its computational complexity as a major challenge of the domain. Relevant knowledge must be extracted through the analysis of vast volumes of natural language texts and the application of multi-level models. Completing these tasks is necessary in order to address a number of issues regarding the creation and interpretation of textual data. To create multi-level models for texts in natural language, a significant amount of annotated texts with different levels of lexical, syntactical, semantic, and narrative information are required. Text corpora or treebanks are common terms used to describe these annotated texts that have syntactical annotations kept in a tree structure. 3 Proposed work A measure of the degree to which two sections of text are semantically same is referred to as the "Semantic Textual Similarity" (STS). Rather than giving a straightforward yes or no answer, algorithms that evaluate semantic similarity typically produce a ranking or percentage that indicates the extent of textual similarity rather than a simple yes or no answer. Unfortunately, there is no definition of semantic equivalence that is globally acknowledged, which means that there is no definition of STS that is either universally accepted or widely accepted. In order to properly evaluate semantic equivalence and similarity, it is essential to take into account the context in which a word or phrase is employed. The context in which a word or phrase is used is what establishes its meaning; this, in turn, defines how semantically similar it is to other words or phrases. Over the years, various algorithms and techniques have been developed for measuring semantic (textual) similarity, including knowledge-based, corpus-based, and deep learning approaches. By leveraging large corpora and deep learning, semantic similarity techniques are able to quantify the semantic similarity between phrases. The "distributional hypothesis," which assumes that "similar words frequently co-occur," forms the basis of these techniques, but does not consider actual suggesting of words. The application of transformers-based deep neural network techniques has shown higher performance when compared to the majority of traditional approaches, and the recent success of these techniques has completely reshaped the field of semantic similarity. Devlin and colleagues [3] conducted a ground-breaking study in which they presented a novel neural network as well as a new training approach, which they combined referred to as BERT. The natural language processing (NLP) algorithm known as BERT, as well as its several variants, is regarded as one of the most effective algorithms currently on the market, according to a number of studies and benchmarks [29–34]. By analysing both the left and right contexts in which a word or phrase appears, language models that are based on BERT are able to discern between several meanings of the same word or 32 Informatica 48 (2024) 29–38 F. Liu et al. phrase. The proposed project's workflow is depicted in figure 1, which may be found here. Figure 1: Flow process of NLP in semantic training representation Over the course of centuries, legal career has progressed its own specialised language, commonly known as legal (sub)language, which lawyers use to discuss the law. The foundation of this language is legislative language, which refers to the terminology used to draft legislation. Linguists often classify legal language. A sublanguage has its own unique grammar, a narrow scope [35], specific lexical [36], syntactic [37], and semantic constraints [38], and allows for 'deviant' grammatical rules that are not allowed in the dominant language. Given that the legal profession has been dubbed "a profession of words," it is crucial to have a strong command of legal language. We will not delve into the intricacies of legal jargon here; those who are interested can find a plethora of literature on the topic. The legal profession has developed its own specialised language over the course of several centuries, and this language, which is frequently referred to as legal (sub) language, is the language that lawyers use to explain the law to their clients. Most of this language is comprised of legislative language, which is the terminology that is used while writing legislation. Linguists frequently consider legal language to be its own distinct sublanguage [39]. A sublanguage is characterised by its unique syntax, restricted usage, the imposition of lexical, syntactic, and semantic limitations, and the presence of 'deviant' grammar rules— that is, rules that are not permitted in the grammar of the dominant language. Since the legal profession is often referred to as "a profession of words," having a strong grasp of legal language is crucial. There is an abundance of literature available for those interested in exploring legal jargon, so we will not delve into the specifics here. 3.1 NLP – For hierarchical modeling Processing legal content presents a problem for natural language processing systems. In order to get the best results possible from NLP algorithms, they should be trained on vast corpora of plain text. A good example of this kind of corpus is the Open Super-large Crawled Aggregated (OSCAR) corpus, which was only recently made public and has a size of many terabytes. For the purpose of processing legal language, ordinary text language models (LM) need to be modified to legal language, ideally using a large corpus of previously processed legal texts. However, it is extremely difficult to acquire legal corpora, and many of them are kept privately, making them inaccessible to academic scholars. The scholarly works, statutory instruments, judicial decisions, court memoranda and pleadings, commentary on statutes, and administrative law decisions would all be included in a comprehensive legal corpus. In our study, we utilized the corpus of private tax judgements to enhance a pre-existing BERT model for Polish, as no corpus of this kind is currently available. 3.2 Similarity among the legal semantic text In the subject of law, a straightforward semantic or linguistic similarity comparison might not be enough. Legal scholars look for content similarities within a legal framework, such as a legislation, rule, or judgement, where the words and phrases must have the same legal meaning in order for the Hierarchical Model Rule Based NLP for Semantic Training… Informatica 48 (2024) 29–38 33 similarities to be considered relevant. The fact that two different sections of text contain the same terms is not sufficient. Legal Semantic Text Similarity (LSTS) is a measure of legal and semantic textual similarity between two segments of text, always evaluated in the same legal context. This measure compares the legal and semantic textual equivalence of the two segments of text. We want to stress that LSTS is only applicable to textual similarity and that an LSTS algorithm is unable to identify the relevance of a textual similarity. In other words, the system that conducts the evaluation of legal semantic similarity needs to be able to differentiate between the legal meanings of a word or phrase when it is used in a variety of distinct legal situations. When it comes to matters pertaining to taxes, the legal framework is often determined by the statute or legislative instrument to which the terminology in question applies. It can be difficult to ascertain what a particular word, phrase, or paragraph in a document is supposed to signify, and the ability to discriminate between the statutory and common meaning is essential in a great number of financial and legal disputes. We suggest a fresh approach to finding tax judgements with legal semantic textual similarities. Finding a way to identify judgements that are semantically similar is our goal, as one ruling may be a tens of page legal language. Experiments revealed that the tax authority stance, which mentions pertinent provisions of statutes, other tax judgements, court decisions, and other sources, is the most helpful element of the ruling for semantic similarity searches. This section, which utilizes formal legal terminology, offers, in general, a summary of Sections 2, 3, and 4 of the decision (circumstances, enquiries, and the taxpayer's legal standing). In our experiments, we determine the degree to which the various aspects of the position taken by the tax authorities share a semantic similarity. We are aware that the section on the tax authority provides further information regarding the specific legal setting of the ruling. The clustering, cosine similarity, and SBERT vector embeddings that we use form the foundation of our methodology. We compute the semantic textual similarity of SBERT sentence vector embeddings. The SBERT sentence vector embeddings are used. We determine the cosine similarity of the selected vector to other vectors by conducting an analysis on the vector embeddings of each and every component of each and every PTR. The cosine similarity between the RRNN's two vectors, uu and vv, is defined by Equation 1, which may be found below. 𝑠𝑖𝑚𝑖𝑙𝑎𝑡𝑖𝑟𝑦 (𝑢 , 𝑣 ) = |𝑈 ||𝑉 | 𝑇 ‖𝑈 ‖‖𝑉 ‖ (1) Using cosine similarity, we identify the K tax authority positions that are the K closest neighbours in the embedding space. The embedding vectors are then projected using UMAP, and the HDBSCAN clustering algorithm is then applied to the UMAP output. Using this strategy, we can identify discrete clusters that accurately reflect the fine semantical characteristics of the judgements. 4 Experimental analysis The implementation of our specialized feature extraction methodology is the first step in the processing of the PTR corpus. This step takes place in the beginning. The pipeline compiles a comprehensive database of information pertaining to each judgement that it processes. This includes citations to previous judicial decisions, legislation, rules, and schedules, as well as references to other PTRs, rules, and schedules. Additionally, the pipeline extracts references to other relevant PTRs. Additionally, the document is broken up into paragraphs and sentences by the pipeline, which makes use of the Polish translation of Stanford's Stanza. The wording of the PTR is broken up into four sections: an explanation of the circumstances (the facts), the taxpayer's queries, the taxpayer's legal position, and an explanation of the perspective of the tax authority. Next, we independently build sentence BERT (SBERT) vector embeddings for each part using our improved BERT model, which is derived from Polish BERT [6]. In the event that a section of text has more than 512 BERT tokens, we break it up into subsections comprised of sentences and paragraphs. When it is possible, one sentence from the preceding paragraph will be carried over into the next paragraph. If the vector embedding of a PTR component creates more than one vector, we follow the established protocol and take the mean of all of the vectors into account. In our research, we employ a number of well-known, open- source Python packages: • Tokenizers, sentence-transformers, and hugging face transformers packages • HDBSCAN with UMAP Scikit-learn The following is the technique that should be followed in order to locate PTRs with comparable semantic features. We start by selecting a PTR of interest from the precomputed data frame to use as the reference PTR, and then we extract its SBERT embedding. We compare the reference PTR with each and every other PTR utilising the cosine similarity metric, and then select the top K nearest neighbors. The value of K will be set to 500 for the sake of our experiment. The cosine similarity of the three PTR sections' 500 nearest neighbors is depicted in Figure 2, which may be found here. In order to find the PTRs with the highest similarity 34 Informatica 48 (2024) 29–38 F. Liu et al. coefficients, the tax authority position of the reference PTR and the top k PTRs are compared. For the next stage of processing, we select the list of 500 PTRs whose tax authority parts have the most striking similarities to those found in the reference PTR. Figure 2: Similarity Index of 500 neighbor to the reference PTR A tax specialist must evaluate the list of the 500 PTRs that are the most identical. Textual similarity alone does not establish legal resemblance between a PTR and the reference PTR. The certified tax advisor who carried out this manual analysis is familiar with the issue mentioned. Analysis's outcome was excellent—indeed, it was surprising—because all 500 PTRs were indeed validly comparable. Related outcomes were obtained in experiments using other reference PTRs. The list makes it clear that the system discovered PTRs that were related in both syntactic and semantic terms. Four instances of question-related PTRs that the algorithm identified are provided in Table 2. Due to the fact that translation might remove some syntactic and semantic elements, we chose not to translate the question wording. However, readers who are not Polish might notice that the questions are phrased differently and have a distinct syntax. All of the mentioned decisions are related to the same matter and are identical in terms of their subject matter and legal standing. It is important to note that while there is a small degree of word and phrase similarity among taxpayers' responses, there is a very high degree of cosine similarity. 4.1 Virtualization The SBERT vector embeddings were used to conduct an analysis on the 500 PTRs that shared the highest degree of similarity. Because the dimensions of the embedding vector space (N=768) is too high for direct viewing, we make use of the UMAP technique to project embedding vectors into a space that is either two or three dimensional in order to make it easier to visualize the data. A technique for reducing the number of dimensions that is known as UMAP, which stands for uniform manifold approximation and projection for dimension reduction, can be used to depict high-dimensional vectors. The results of plotting the projections are illustrated in figure 3, which may be found below. The two- dimensional diagram already gives hints that the embedded vectors cluster together, and the three-dimensional figure proves what the two-dimensional diagram already suggests. Figure3: Plots of the 500 most comparable decisions' three-dimensional embedding vectors We used a clustering technique since the visualisations show that vector embeddings include clusters. We conducted the popular HDBSCAN [9] algorithm on a 2-dimensional UMAP projection using the Scikit-learn toolkit. The same outcomes were obtained when HDBSCAN was applied to vector embeddings, although the runtime was significantly longer. Hierarchical Model Rule Based NLP for Semantic Training… Informatica 48 (2024) 29–38 35 Figure 4: Plot of detected clusters, coded by color Figure 5: Plot of the internal structure of clusters with edge bundling Three clusters were present, as evidenced by the clustering algorithm's output (Fig. 6). The "hammer" plot, which is second plot in Fig. 6, displays inner organisation of clusters with edge shoving. For more information about this plot, check the HDBSCAN documentation. Red ovals in Fig. 6's dendrogram of samples (rulings) denote clusters [40]. Figure 6: Dendrogram cluster plot 4.2 Cluster analysis Interestingly, we observed the formation of clusters containing obviously distinct embedding vectors. Upon examination, we discovered that judgements within a cluster share nontrivial semantic properties. The largest cluster we identified consists of decisions addressing more general issues, such as the applicable VAT rate [41] that landlords should use when billing tenants for the use of various services, including water, gas, electricity, sewage, and so on. The second, more concentrated cluster is made up of decisions addressing a more specific question, such as the VAT rate landlords should apply when billing tenants for electricity consumption.. The cluster featured unusual judgements, such as special situations and agreements and judgements that were decided in a manner that deviated from accepted practise (case law). We also looked at two other examples: real estate sales and tax judgements pertaining to the IP BOX tax structure [42]. Cluster searches in both situations revealed distinct clusters. It is amazing how well-kept the semantic information is in even the averaged embedding vectors. There is unquestionably more to learn and learn about in this area. We conclude that the discovery of semantically and legally related judgements as well as the homogenous grouping of the rulings is facilitated by our two-step search method, which entails discovering the K nearest neighbours (K most similar rulings) using the cosine similarity metric. 36 Informatica 48 (2024) 29–38 F. Liu et al. 5 Conclusion This study presents a method for extracting thematic characteristics from news articles using corpus-based thematic characteristics extraction from news articles, pre- trained word embeddings, linked open data, and lexical datasets. Approximately 91% of taxonomy concepts were present in selected datasets, and their corresponding semantic data was retrieved for taxonomy enrichment. By adjusting the cosine similarity threshold while selecting relevant ideas, the enhanced depth of the business taxonomy can be altered. The scope of a taxonomy with a high threshold is more limited. However, a low similarity threshold may permit the extraction of meaningless concepts. This essay offers original contributions in two areas namely, a way to look for decisions that are most like a source decision, and locating groups among set of decisions that are most comparable. The findings of this manuscript suggest a number of future research areas, including, semantic similarity in law research and studying the composition of similar orzecze clusters. Data Availability The data used to support the findings of this study are available from the corresponding author upon request. Conflicts of Interest The authors declare no conflicts of interest Funding Statement This study did not receive any funding in any form. References [1] Abeille´,A.,2012.Tree banks: Building and Using Parsed Corpora. Text, Speech and Language Technology, Springer Netherlands. [2] Abend, O., Rappoport, A., 2013a.Ucca: A semantics- based grammatical annotation scheme, in: IWCS. [3] Abend, O., Rappoport, A., 2013b.Universal conceptual cognitive annotation (UCCA), in: ACL. [4] Anikin, A., Litovkin, D., Kultsova, M., Sarkisova, E., 2016. Ontology-based collaborative development of domain information space forlearning and scientific research, in: Ngonga Ngomo, A.C., Krˇemen, P. (Eds.), Knowledge Engineering and Semantic Web: 7th International Conference, KESW 2016, Prague, Czech Republic, September 21-23, 2016, Proceedings, pp. 301–315. [5] Anikin,A.,Sychev,O.,Gurtovoy,V.,2019.Multi-level modeling of structural elements of natural language texts and its applications. Advances in Intelligent Systems and Computing 848,1–8. [6] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., Schneider, N., 2013. Abstract meaning representation for sem banking, in: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Association for Computational Linguistics. pp. 178–186. [7] Bonial, C., Bonn, J., Conger, K., Hwang, J.D., Palmer, M., 2014. Propbank: Semantics of new predicate types, in: Proceedings of the NinthInternational Conference on Language Resources and Evaluation (LREC-2014), European Language Resources Association (ELRA). [8] Bos, J., 2011. A surveyofcomputationalsemantics:Representation,inf erenceand knowledge in wide- coveragetextunderstanding. Language and Linguistics Compass 5, 336–366.URL: [9] Bos, J., Basile, V., Evang, K., Venhuizen, N., Bjerva, J., 2017. The groningen meaning bank, in: Ide, N., Pustejovsky, J. (Eds.), Handbook of Linguistic Annotation. Springer. volume 2, pp. 463–496. [10] Butler,A.,2017.Tree bank semantics parsed corpus. Carnie, A., 2013. Syntax: A Generative Introduction. Introducing Linguistics, Wiley. [11] Dixon,R.,2009.Basic Linguistic Theory Volume 1: Methodology. Basic Linguistic Theory, OUP Oxford. [12] D Prabakaran, S Sriuppili, “Speech Processing: MFCC Based Feature Extraction Techniques-An Investigation”, Journal of Physics: Conference Series, Vol. 1717, No. 1, Pp. 1-7, 2021. [13] Dixon,R.,2012. Basic Linguistic Theory Volume 3: Further Grammatical Topics. Basic Linguistic Theory, OUP Oxford. [14] Fellbaum, C.(Ed.),1998.Word Net: an electronic lexical database. MIT Press. [15] Grimm, S., Hitzler, P., Abecker, A., 2007. Knowledge representation and ontologies, in: Studer, R., Grimm, S., Abecker, A. (Eds.), SemanticWeb Services: Concepts, Technologies, and Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 51–105 [16] Hershcovich, D., Abend, O., Rappoport, A., 2017. A Hierarchical Model Rule Based NLP for Semantic Training… Informatica 48 (2024) 29–38 37 transition-based directed acyclic graph parser for ucca, in: Proc. of ACL, pp. 1127–1138. [17] Kamp, H., Reyle, U., 1993.From Discourse to Logic. Introduction to Model theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Kluwer, Dordrecht. [18] Kultsova, M., Anikin, A., Zhukova, I., 2015.Ontology-based method of electronic learning resources retrieval and integration, in: 2015 6th International Conference on Information, Intelligence, Systems and Applications (IISA), pp.1– 6. [19] McDonald, R., Crammer, K., Pereira, F., 2005.Online large-margin training of dependency parsers, in: Proceedings of the 43rd AnnualMeeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA. pp. 91–98. [20] Palmer, M., Gildea, D., Kingsbury, P., 2005.The proposition bank: An annotated corpus of semantic roles.Computational Linguistics 31. [21] Parsons, P., Parsons, T., 1990. Events in the Semantics of English: A Study in Subatomic Semantics. Current studies in linguistics series, MITPress. [22] Ruppenhofer,J., Ellsworth,M., Petruck,M., Johnson,C., Scheffczyk, J.,2016. Frame Net II: Extended Theory and Practice. Institut fu¨r Deutsche Sprache, Bibliothek. [23] Schuler, K.K., 2005. Verbnet: A Broad-coverage, Comprehensive Verb Lexicon. Ph.D.thesis. Philadelphia, PA, USA. AAI3179808. [24] Taylor, A., Marcus, M., Santorini, B., 2003. The penn tree bank: An overview, in: Treebanks. Text, Speech and Language Technology, vol 20..Springer. [25] D.Prabakaran and H.Sathyapriya, “A Review on Methodologies and Performance Analysis of Device Identity Masking Techniques”, International Journal of Scientific & Technology Research, Vol. 8, No. 12, Pp. 2018-2022, 2019. [26] De Martino, Graziella, Pio Gianvito, Ceci Michelangelo. (2021) “PRILJ: an efficient two-step method based on embedding and clustering for the identification of regularities in legal case judgments.” Artificial Intelligence and Law. https://doi.org/10.1007/s10506-021-09297-1. [27] Devlin Jacob, Chang Ming - wei, Lee Kenton, Toutanova Kristina (2019) “BERT: Pre-training of deep bidirectional transformers for language understanding.” Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171– 4186. [28] Dhivya Chandrasekaran and Vijay Mago. (2021) “Evolution of Semantic Similarity—A Survey.” ACM Comput. Surv. 54, 2, Article 41 (March 2022), 37 pages. DOI:https://doi.org/10.1145/3440755. [29] IRS (2008) “The Complexity of the Tax Code” Taxpayer Advocate Service — 2008 Annual Report to Congress — Volume One, Internal Revenue Service, Washington, DC. [30] D. Prabakaran and R. Shyamala, "A Review On Performance Of Voice Feature Extraction Techniques," 2019 3rd International Conference on Computing and Communications Technologies (ICCCT), Chennai, India, 2019, pp. 221-231. [31] Kumar Ankit, Makhija Piyush, Gupta Anuj. (2020) “Noisy text data: Achilles’ heel of BERT”. Proceedings of the Sixth Workshop on Noisy User- generated Text (W-NUT 2020), pp 16–21. [32] Leland McInnes, John Healy, James Melville. (2018) “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction”, ArXiv e- prints 1802.03426, 2018 [33] Leleand McInnes, John Healy, Steve Astels. (2017) “hdbscan: Hierarchical density based clustering”. Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017. [34] Mandal Arpan, Chaki Taktim, Saha Sarbajit, Ghosh Kripabandhu, Pal Arindam, Ghosh Saptarshi, (2017) “Measuring similarity among legal court case documents” Proceedings of the 10th Annual ACM India Compute Conference, Association for Computing Machinery, Compute ’17, pp 1–9. [35] Mandal Arpan, Chaki Taktim, Ghosh Kripabandhu, Ghosh Saptarshi, Mandal Sekhar. (2021) “Unsupervised approaches for measuring textual similarity between legal court case reports”. Artificial Intelligence and Law volume 29, pp 417– 451. 2021. [36] Mellinkoff David (1963) “The Language of the Law.” Little, Brown and Co. 1963 pp. xiv, 454. [37] Reimers, Nils, and Iryna Gurevych. (2019) "Sentence-bert: Sentence embeddings using 38 Informatica 48 (2024) 29–38 F. Liu et al. siamesebert-networks." arXiv preprint arXiv:1908.10084. [38] Ricardo J. G. B. Campello, Davoud Moulavi & Joerg Sander. (2013) “Density-Based Clustering Based on Hierarchical Density Estimates.”, In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37456-2_14. [39] Shao Yunqiu, Mao Iaxin, Liu Yiqun, Ma Weizhi, Satoh Ken, Zhang Min, Ma Shaoping. (2020) “BERT-PLI: Modeling paragraph-level interactions for legal case retrieval.” Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, pp 3501–3507. [40] Strąk Tomasz, Tuszyński Michał. (2020) “Quantitative analysis of a private tax rulings corpus.”. Procedia Computer Science 2020. [41] Waerzeggers, Christophe and Cory Hillier (2016) “Introducing An Advance Tax Ruling (ATR) Regime” Tax Law IMF Technical Note 2016 (2) (International Monetary Fund, Washington, DC, 20016).