https://doi.org/10.31449/inf.v48i5.5337 Informatica 48 (2024) 41–54 41 Web News Media Retrieval Analysis Integrating with Knowledge Recognition of Semantic Grouping Vector Space Model Wenting Xiong School of Journalism and communication, Hunan Mass Media V ocational and Technical College, Changsha, Hunan, 410100, China E-mail: 2016150180@jou.edu.cn Keywords: vector space model, semantic grouping, searching for information, web news media retrieval Received: October 26, 2023 Traditional Web news media retrieval technology can only meet the specific requirements of customers. Because of its universal characteristics, it cannot meet the needs of different environments, different purposes, and different times simultaneously. Researchers have proposed a search method for online news media, which is used for computing the semantic grouping vector space model. The customer's interest model is analyzed through the characteristics of the user's different classification areas. In this paper, we propose a vector space model that performs semantic grouping based on feature words. The model divides four groups that are relatively independent in the meaning of feature words in a news report: time, place, person, and event, and then forms four vector spaces and calculates the weight value and similarity of each vector space. Theoretical analysis and experimental results show that the improved model is suitable for searching Web news information and improves the calibration rate, query speed, and calibration rate. Povzetek: V študiji je predstavljen izboljšan model vektorskega prostora za iskanje spletnih novic, ki vključuje semantično grupiranje po ključnih besedah v nekaj kategorij. 1 Introduction Early search engines generally used content-based search methods, which were developed on the basis of the theory and technology of traditional information retrieval. The primary consideration is the relationship between web pages and search terms and the frequency and location of users querying documents. This method improves the search quality and accuracy to a certain extent, but these methods are based on keyword queries. Synonyms and polysemous words in natural language cannot be retrieved, so the search rate is not high and has certain limitations[1-2]. In recent years, the rapid progress of information technology has ushered in the digital age, and the digital and electronic files of past paper files are gradually being replaced. How to quickly and accurately query the user's demand information in the Web news media database? The current significant problem is how to deal with the issue of archive information supply and demand. Web news media retrieval is a helpful solution. By using a personalized search system to refer to the same information among users, new calculation methods for the semantic grouping vector space model can be discovered. The most important thing is the user type, but the participation of users is required because of the calculation method and collaboration Personalized search methods of semantic grouping vector space model have their characteristics. With the continuous development of the multimedia entertainment field, Web news media has gradually integrated into people's lives. However, how large-capacity Web news media data can better adapt to the environment and different user characteristics and introduce Web news media suitable for users is a challenge facing people. This paper presents the knowledge recognition of the semantic grouping vector space model in the Web news media retrieval process. It accurately conducts the Web news media retrieval by using the deep belief network algorithm in the audio segmentation stage of the Web news media. Finally, the results of the experimental analysis show that the method proposed in this paper can quickly retrieve the corresponding type of Web news media according to the user's preferences. 42 Informatica 48 (2024) 41–54 W. Xiong References Key findings Methodology Limitation 10 The study focused on the Hermes Framework for Personalized News Services, with an emphasis on implementing Semantic Web standards and utilizing topic data. The study included approaches such as Ontology-based Knowledge Representation, Natural Language Processing (NLP) for Semantic Text Analysis, and Semantic Query Languages for Information Specification. The challenges included complexity, difficulty in acquiring knowledge, reliance on Semantic Web standards, and scalability concerns. 11 The study examined the effective extraction of events from a news corpus, emphasizing the strength of Latent Dirichlet Allocation (LDA) in collecting semantic connections. It also evaluated the efficacy of Continuous Bag of Words (CBOW) and Skip-gram models in obtaining semantic similarity. The process of preparing the Online news corpus involved the utilization of diverse research approaches. The study addressed limitations within the news corpus, generalization difficulties, and reliance on algorithmic parameters. 12 The research focused on gathering data using web crawling and extracting features using NLP techniques. The study comprised primary stages: Online News Corpus Preparation, NLP for Feature Extraction, and the Selection of Machine Learning Models. The study examined many factors pertaining to generalization, the ever-changing nature of fake news, and ethical implications. 13 The study discussed the challenges of collecting high-level information, proposed a compact event representation, and introduced a hypothesis for the facilitation of social and geopolitical analysis in the context of the rise of the Social Web. The study emphasized the use of data mining tools to examine event representations, integrating case studies and user evaluation procedures. Factors including data velocity and volume, the study's dependence on data quality, and the extent of user examination were among the constraints. 14 The study focused on the deployment of the Hermes Framework for Personalized News in the Hermes News Portal. Three main approaches were examined in the study: the implementation of the Hermes News Portal, semantic query languages, and semantic text analysis. The complex nature of complexity and the learning curve, as well as its dependence on the quality of ontology. 15 The study concentrated on the challenges associated with traditional search engines and the integration of NLP. The study involved using Web Crawling and NLP techniques to gather and preprocess data, followed by applying Sentiment Analysis algorithms. The study factored in reliance on data sources, sentiment analysis limitations, and resolving plurality. 16 The paper focused on the Semantic Model that employed Term Frequency-Inverse Document Frequency (TF-IDF) and evaluated its performance using evaluation metrics. The research focuses on three key aspects: data preprocessing, the utilization of machine learning and deep learning algorithms, and the creation of a robust prediction model. The study presented challenges in generalizing findings, allocating resources, and accounting for temporal variables. 17 The study focused on analyzing the level of financial area analysis and the importance of material on news websites. The study utilized various methods such as data collection, the implementation of a supervised machine The Study complexities of the Lithuanian language presented research restrictions due to its algorithmic sensitivity. Web News Media Retrieval Analysis Integrating with Knowledge… Informatica 48 (2024) 41–54 43 learning model, and hyper-parameter optimization using Grid Search. 18 The study focused on the various challenges faced by conventional search engines, the integration of sentiment analysis technologies, and the subsequent deployment of intelligent search capabilities. The research involved various methodologies, such as data collection, web crawling, Natural Language Processing (NLP) for text preprocessing, and the implementation of sentiment analysis algorithms. The research involved depending on the data source and carefully considering temporal aspects. 19 The study's primary findings included an analysis of annual publication trends and subject distribution, highly cited literature and research hotspots, and an evaluation of module functionality and integration. The research involved analyzing annual trends and subject distribution, as well as creating a deep-learning model. The study's findings are limited due to assumptions made in vote prediction, the complexity of the model, and resource constraints. To overcome these challenges, we proposed a vector space model that focuses on semantic grouping based on feature words. The paper aims to organize news information into different categories based on the meaning of specific words. 2 Semantic grouping vector space model The knowledge recognition process based on the semantic grouping vector space model is mainly for the problems that occur during database retrieval. If this problem is completely solved, the semantic grouping vector space model knowledge algorithm needs to be used to optimize the parameters of the model. This paper combines the strategy of weight sharing. Weight sharing refers to making the semantic grouping vector space model different in connection mode and parameter sharing mode to the conventional vector model. The semantic grouping vector space model can be locally connected and data information can be shared. Weight sharing mainly refers to the collection of parameter data based on multiple nodes in the hierarchical process. The feasibility analysis of shared data parameters is primarily related to various goals in the calculation process. Different from the traditional method, the semantic grouping vector space model mainly uses the initial feature value of the input signal of the collected data and learns the input signal according to the hierarchical retrieval [3]. Usually, it includes the average time amplitude difference of the time domain features. This article uses the energy in a short time, etc., for the initial input characteristics of the model. The Web news media search signal can be represented by x(n), and the short-term energy balance can be expressed as follows: ( ) ( ) ( ) ( ) ( ) ( ) 2 22 n mm E x n w n m x m h n m x m h n  = −  = −  = − = − =     (1) However, the high dimensionality of the time-domain features under initialization will cause a lot of interference and noise. Therefore, the input search signal needs to be reduced in dimensionality. The primary cause analysis method is used to make statistics on the multiple variables of the investigation, and the internal structure among various variables can be analyzed by studying multiple main components. After the data of Web news media is processed by dimensionality reduction, the input data information can be retrieved. In deep learning, the semantic grouping vector space model can perform data processing on the output data. Among them, the Web news media retrieval structure at this stage is shown in Figure 1. 44 Informatica 48 (2024) 41–54 W. Xiong Figure 1: Semantic grouping vector space structural diagram. 2.1 Deep belief network algorithm in the audio segmentation stage The Deep Belief Network (DBN) algorithm, utilized in audio segmentation, employs a hierarchical, generative model consisting of multiple layers of stochastic, latent variables. Initially trained layer by layer, the DBN captures intricate patterns in audio data through unsupervised learning. The top layer of the network functions as a discriminative model, facilitating the identification of relevant features for segmentation. Through iterative fine-tuning of weights during training, the DBN learns hierarchical representations, enabling the extraction of meaningful audio features. This hierarchical approach enhances the algorithm's efficiency in discriminating between various audio segments, thereby improving its ability to segment and classify different components within the audio signal accurately. 2.2 Knowledge recognition algorithm of semantic grouping vector space model A news report contains four elements: time, place, person, and event. Therefore, for Web news information, at first, feature words are distinguished based on these four elements, respectively define the associated semantic groups to form 4 vectors, and then determine which vector space each feature word belongs to, and establish an inverse index corresponding to each vector space, calculate the weight value and similarity of feature words for each meaning group. Finally, the weighted sum of similarity is obtained, and the search results that are greater than a specific valve value are sorted using link analysis[4-5]. 2.3 Weight and similarity of feature word In Web news information, the position of the feature word on the document is different, and the importance of the document ability expressed is also other. The feature word level describes this characteristic. If the feature word TE appears n times in the document, as in formula 2, the feature word level score k T is the k-th appearance level of the feature word T in the document. ( ) ln 1 1 2 k n T k score T = =  (2) The weight of the feature word is ( ) ( ) k w T score T idf = (3) In the case of calculating the weighted similarity between the query Q and a specific news feature word group D, due to the large amount of calculation and time overhead of the conventional VSM similarity, the ratio of the weight value of the QD cross part to the sum of the QD weight value is used for calculation (4). ( ) ( ) ( ) ( ) ( ) ( ) 1 11 , QD qk dk k QD qk dk kk w T w T sim Q D w T w T = == + =  +      (4) 2.4 Semantic grouping vector space calculation method For the vector space model, the conventional method of semantic grouping vector space calculation is to calculate the cosine similarity between vectors. The semantic grouping vector space of user u and Web news media d can be defined as: ( ) , ud Sim u d ud  =  (5) Regarding the probability model, the cosine similarity of vectors cannot be calculated by self-connection[6-7]. The following propositions are proposed to express the diversity of user interests. Proposition 1. Assuming that user you have conditions independent of multimedia digital archive d in the predetermined classification model   12 , , , n C c c c = , the probability that multimedia digital archive d recommends to user u is: ( ) ( ) ( ) ( ) ( ) 1 n jj j j p c u p c d p u d p u pc = =  (6) Web News Media Retrieval Analysis Integrating with Knowledge… Informatica 48 (2024) 41–54 45 Proof: It can be known from the total probability formula, ( ) ( ) ( ) 1 ,, n jj j p u d p u d c p c = =  (7) Assuming that the user u exists independently in the multimedia digital file d under condition C, so ( ) ( ) , jj p u d c p u c = , and then ( ) ( ) ( ) , j j j p u d c p u c p d c = is obtained. Therefore, formula (7) can be transformed into ( ) ( ) ( ) ( ) 1 , n j j j j p u d p u c p d c p c = =  (8) Get ( ) ( ) ( ) ( ) ( ) 1 n j j j j p u c p d c p c p u d pd = =  (9) The purpose is to transform the semantic grouping vector space problem of the probability model into a situation of seeking conditional probability, presenting the diversity of user interests. The adopted system has the memory of recording the user's search history and clicks and continues to search for the data information source of the user's operation behavior model. The system automatically completes this coherent operation, and the user experience is not disturbed. First, the user's historical search information in the browser is saved to learn the user's interest, and then the user's interest in the search information through the user's operation on the search results. Add time stamps to the data of interest. Therefore, update the points of interest that users need to be more interested. In the user interest model, the design process is shown in Figure 2. Figure 2: Design process of user modeling. After completing the Chinese word segmentation with the IK Analyzer Chinese, the vector space model is constructed, and the weight i w can be obtained through the frequency of occurrence of keywords i k in the document through the calculation formula of TF-IDF. idf i i i w tf = (10) Among them, i tf represents the frequency of the keyword i k appearing in all generated texts and id i f represents the frequency of i k in the reverse order of all generated texts. The calculation method is as follows: idf log i N n = (11) 46 Informatica 48 (2024) 41–54 W. Xiong Where N is the number of generated texts, and n is the number of all texts containing the keyword i k . The method of calculating the rights of keywords i k can be adjusted as follows. t ii w w e − = (12) ( ) ( ) ( ) ( ) 1 1 2 2 , , , , , , ii d k w k w k w    = (13) When comparing the data of the model and the document ( ) 12 , , , n X x x x that the user is interested in ( ) 12 , , , n W w w w , the size of θ is evaluated by calculating the angle θ of the vector and ( ) 12 , , , n X x x x composition, which is inversely proportional to the degree of user concern. The smaller the θ, the higher the correlation between this file and the user's interests and preferences. The calculation formula is as follows. ( ) 1 22 11 , cos n ii i nn ii ii XW sim X W XW  = == ==               (14) 2.5 Web news media retrieval analysis In multimedia digital archives, user Web news media retrieval is completed by three stages: collecting and analyzing user data, constructing user interest model, and updating user interest model. Suppose Web News Media wants to obtain user information. In that case, it must first get the user's obvious information, such as the user's registered account number, age, education, occupation, unit, keywords of interest, etc[8-9]. The user can modify and reply to this significant information, to gradually improve the information. However, some users are unwilling to provide accurate registration information due to personal privacy or time issues. To solve this problem, you can set up implicit information to extract the user's knowledge. The test will remove the bookmark of the keyword searched for, download and save the files, etc. According to the bookmarks maintained by the user, the downloaded and saved document information, the user's long-term concern, research field, and other issues can be determined, which can become important sources of information for establishing a model. Because the user's interest is not fixed, it is necessary to establish an update mechanism when building a model to remove the forgotten topics in time to add new content, calculate the weight of the user's interest, and rank them according to the proportion of the weight. People's forgetting value is the trend of forgetting from the beginning and gradually becoming late. In the interest model system, the weight of the keyword of interest is multiplied by the update time, the weight of the phrase is sorted, and the forgotten interest topics are deleted. Complete the tracking of the effective behavior of the user and harvest a new keyword to recalculate the proportion. If the weight exceeds the threshold, it is added to the user interest model to complete the model update. It can be proved from Proposition 1 that according to the results of the ranking query based on the recommended ratio, the semantic grouping vector space model calculation can be used to query the media digital archive users. Based on p(u) of inequality (9), since p(u) does not interfere with the results of the recommendation probability, according to this method, the detailed explanation of the retrieval calculation of the multimedia digital archive users is performed. Algorithm 1. Web news media retrieval analysis algorithm based on knowledge recognition of semantic grouping vector space model. Input: domain classification model, user interest model, retrieval keywords, search engine, output: multimedia digital archive users' Web news media search results. (1) According to the search keywords, the search engine is used to generate a preliminary search result set X. (2) Set the number of iterations i=0. (3) For the i-th Web news media in the set X, formula (1) is used to calculate the probability distribution in the field's classification model. (4) Equation (9) is used to calculate the probability that the multimedia digital file i is recommended to the current user and added to the list Y . (5) If the multimedia digital file i is the last multimedia digital file in the set X, go to (6); otherwise, set i=i+1 and return to (3). (6) Sort and output the multimedia digital files according to the probability in the list Y in descending order. Because the algorithm is actually based on another search engine, for each multimedia digital archive of search results, the probability distribution in the domain classification model must be calculated. This has a great impact on the performance of the algorithm. If the search engine calculates the probability distribution in the domain classification model of each Web news media in advance, the performance of the algorithm will be Web News Media Retrieval Analysis Integrating with Knowledge… Informatica 48 (2024) 41–54 47 significantly improved to meet the needs of real-time processing. 3 Web news media retrieval analysis process 3.1 Web news media retrieval The four parts of the browser plug-in, personal manager, user model learner, and information personalized searcher, constitute the experimental system; as shown in Figure 3, it is the browser plug-in that provides users with convenient tools. After the user logs in and register information, the browser plug-in can be used to complete the Web news media retrieval of multimedia digital archives—no need to log in to the server. In addition, the browser plug-in mainly collects the user's personal information and transmits it to the server. The personal manager is used to manage the user's personal information, hobbies, and bookmarks through the personal manager. The purpose of tracking user behavior is to learn user interests. The information Web, a news media retriever, can complete the user's query and recommendation in the multimedia numbers calculated by the semantic grouping vector space model. Figure 3: System architecture. Our system can track the behavior of guests; it is distributed on the edge of the client and server and will not affect the customer's reading and system performance. 3.2 News page judgment and related information extraction information extraction Through retrieval and learning, the link weights and node offsets of each layer are obtained, and the network initialization is completed. The reverse conduction algorithm (BP)is adopted, and the deep trust network model monitored from top to bottom is fine-tuned to overcome the shortcomings of local optimization and long search time. Although the performance of the deep belief network model shows strong characteristic learning ability, from the above principles, Internet search requires a large amount of sample data to generate more parameter values. On the other hand, based on the problem of Web news media recommendation search, there is a lack of a large amount of sample data, and it is found that the generation of a large number of parameters takes a long time, which is not good for practical applications. During the calculation process, the feature vector is used to represent the web page, and if the keyword weight We are determined by the TF*IDF method, and the term item is determined to be a named entity, the weight value shall be appropriately enhanced. The specific definition is as follows. , , e e e is named en i tity idf e is df ot e W s e hr    =   (15) Among them, α is the weighting factor, which is 5 in this experiment. Finally, if m word items with large weight values are selected to generate web page feature vectors and applications. The number of shared term items in the two webpage feature vectors is used as the basis for judging similarity. If the number of shared terms is greater than the threshold, the two web pages are similar. After determining the reprinting or similar relationship, relevant information is extracted and recorded. The leading information recorded is the reprinted website, the source site of the reprinted website, the number of responses to the reprinted website, and the time of the news release. The reprinted website and the source site here are only records of the reprinting relationship, not the finalized accurate source site and reprinted website. The last source site will be determined in the next step. 3.3 Judgment of news reprinting relationship and calculation of authority of news source sites ()   reprint times News reprint rate denoted as Trams source website clicks = (16) However, since the reprinting relationship of Web news has two types, direct reprinting, and indirect reprinting, the source site cannot be determined initially, and the two attribute values of all nodes are initialized to 1 in the entire network layer. Then, in pt →qt, the website pt describes that the news of the website qt is reproduced. The content quality attribute value and the reprint attribute value are calculated using the following repetitive formula. The attribute value of all web pages is normalized to 1 when each iteration is completed. ( ) ( ) 01 qt pt A pt A qt → =  (17) 48 Informatica 48 (2024) 41–54 W. Xiong ( ) ( ) 10 pt qt A pt A qt → =  (18) ( ) ( ) ( ) ( ) 0 0 1 2 2 0 pt A pt A pt A pt  =     (19) ( ) ( ) ( ) ( ) 1 1 1 2 2 1 pt A pt A pt A pt  =     (20) Iteratively update the attributes of each node ( ) 0 A pt ( ) 1 A pt according to the above formula. The extracted reprinting information is used first to extract the relationship between news reprinting sites, calculate the authority value of each reprinting site, and use the website with the most reprinting times as the source site, including the relationship between direct reprinting and indirect reprinting, and that authority value is treated as the value of the reprint rate of the news. 3.4 Calculation of new response rate The response rate (denoted as Rep) directly reflects people's reaction to Web news. usually amount of responses Response rate number of clicks = (21) Observation results show that most news pages only provide the number of answerers rather than the number of clicks/viewers. The number of clicks/views on the page is stored on the page server-side and cannot be obtained through simple capture and information extraction. Based on a large number of observations, a response rate ratio is summed up based on the relative number of news responses, and this ratio is used as the news response rate. Here, the number of responses is the total of the number of responses from the source site and the number of responses from the reprint website[8-9]. Figure 4 shows the distribution of the number of news responses. Figure 4: Statistics of the person-time number of responses. It can be seen from Figure 4 that the number of responses to most news is within 1,000 people. There is very few news with more than 3,000 people. According to the statistical rules in the figure above, the relative recovery rate values are shown in Table 1. As an example, the number of responses (0~500) indicates the range of the number of people who responded to this event, and the relative response rate indicates that the number of people who responded to this event is between (0~500). It is considered that the number of people who responded to this event accounted for 5% of the number of viewers. If there are more than 5,000 respondents, those who have read this report will basically give answers, and the relative answer rate is 100%. Table 1: List of relative response rates. Number of responses Relative response-rate (%) 5000~ 100 4500~5000 90 4000~4500 80 3500~4000 70 3000~3500 60 2500~3000 50 2000~2500 40 1500~2000 30 1000~1500 20 500~1000 10 0~500 5 Web News Media Retrieval Analysis Integrating with Knowledge… Informatica 48 (2024) 41–54 49 3.5 The influence of time factors on news ranking There are usually two trends in people's interest in news, as shown in Figure 5. The attention here is measured by the number of news viewers per unit of time. The first is the slow-growing type of interest in knowledge such as national policy news. The timeliness of news in these categories is not strong, and people's concern is slowly increasing with time. The other is the type that grows rapidly and declines. It is mainly for news on current events; this kind of news is very time-sensitive. People's attention to this kind of news has increased rapidly in a short period, and after some time, the attention has quickly dropped [10-12]. Therefore, the sorting of news must first be classified and judged, taking into account the influence of time. From this perspective, the importance of news is inversely proportional to the time of publication. Figure 5: News attention. In addition, the longer the release time, the higher the probability of reprinting and replying and the greater the number of responses and reprints. If the time factor is not taken into consideration, it is unfair to the newly published news. Therefore, when selecting parameters, the time factor will affect the importance of news. For reports with a long submission period, the number of responses and reprints will be reduced. Summarizing the above two points, combined with the definition of the news decay time parameter in literature [4], the definition of the time parameter is as follows. ( ) ( ) , s tt s D t t e  −− = (22) Among them is the publication time of the news. The determination of α depends on the recession time of the news category to which the news belongs. Recession time refers to the time from the news release to the intermediate experience that no one cares about, and it is defined here. The relationship between news and recession time is s t s tt  0 .5, 0.5, Current affairs news Non current e ws e ne    − − −  =  =  =   (23) Here β is the decline time of current affairs news. γ is the decline time of non-current affairs news. 3.6 Judgment of news influence Through the above steps, data on news reprint rate, news response rate, and influence factor of news source sites can be obtained( S W ), as well as the time parameter of the news release ( ) , s D t t . Reprinting and replying to news are considered to be the recognition of news, so the Web news recognition rate (denoted as Rec) is defined as news recognition rate = a. ×Reload rate+b×Recovery rate. In order to ensure that the authorization rate is less than 1, the relationship between a and b is defined as a+b=1. Since there is no suitable corpus, the values of a and b cannot be obtained through the training method so these decisions can be obtained according to the 80/20 rule. There may be a lot of people watching the news, but few people answer it, and even fewer people understandably do repost. So, I think the reprint rate can better reflect the influence of news. Experiments show that this definition method is feasible[13-14]. Finally, combining the above information, define the influence of news (NF) as follows. ( ) ( ) Rep s tt FS N e W a Trans b  −− =    +  (24) 4 Experiment results and analysis The performance evaluation of the information retrieval system is generally used as a benchmark, and the comprehensive evaluation rate F can also be used for evaluation (25) 2 precision recall F precision recall  = + (25) 4.1 Experimental data set The top nine news information websites were tracked on the Chinese website rankings for a week. As many features based on the summary of Chinese webpages appeared in the algorithm research, Chinese webpages were still used as experimental subjects in the experiment. The news on the homepage of these nine websites is captured every hour. The list of captured experimental data is shown in Figure 6. 50 Informatica 48 (2024) 41–54 W. Xiong 0 500 1000 1500 2000 Tencent News Sina News Xinhua PRC News Sohu News NetEase News Dongfang News Tom News China News News page volume Figure 6: List of experimental web pages. After the internal deduplication of the website, these pages are classified into six types according to their content. The news distribution of each category is shown in Figure 7. Here, the strength of news timeliness is obtained from the attention model to which news belongs. 0 10 20 30 40 50 60 Events/News Business Health Sports/Entertainment Military Technology Percentage of total (%) Figure 7: Classification list of experimental web pages. 4.2 Experimental result The recommended task is executed using Python 3.5. Numpy, Pandas, Scikit-learn, Natural Language Toolkit (NLTK), and Matplotlib software are required to be installed alongside Python to carry out the procedure (1) News influence ranking for the week from September 10 to 16th, 2007 Figure 8 shows the top 10 news and their influence values in the week from September 10 to 16th, 2007. Here, the recession time of current affairs news is defined as 72 hours instead of 120 hours. Figure 9 shows the distribution of news influence values within a week. 0.6366 0.6168 0.6168 0.6108 0.6108 0.6108 0.6093 Shangha … Chinese … Chinese … Japanes … A study … Chinese … The CPC … 4 5 6 7 8 9 10 Force value 0.682 0.6517 0.6517. Figure 8: TOP10 news list from September 10 to 17th, 2007. Figure 9: Distribution of news influence values collected from September 10, 2007, to September 16, 2007. (2) Designated news influence ranking The algorithm is also suitable for the sorting of designated news, giving some news on different topics, and using the topics of these news as keywords to search for relevant news pages using popular search engines. Select the top 50 from the search results for statistical calculations. The reason for ranking in the top 50 is that basically all reprinted pages and almost all similar pages, as well as the top 50 websites with news information in the Chinese website rankings, are included. It is sufficient to determine the source website of the news and the reprint rate of the news. After browsing many web pages, it can be found that all the comments of netizens are on the top websites. Netizens on other websites make almost zero responses, so selecting these pages can also get a more accurate news response rate value. After obtaining each news topic and reprinting the page, the relevant information is extracted to analyze the influence of each topic news according to the above algorithm, find the influence coefficient, sort according to the influence coefficient, and obtain the ranking result of the quantitative analysis. Next, we investigated the sorting results of these topics by multiple people. After synthesis, we got the sorting results of manual qualitative analysis. Finally, the consistency of the two results is compared. From the results of multiple comparisons, it is found that the sorting results calculated by this method are almost the same as the manual sorting results. The comparison results are shown in Figure 10 and Figure 11. Here is a comparison of the influence of news on non-related topics. Experiments show that this method is also applicable to related topics. Web News Media Retrieval Analysis Integrating with Knowledge… Informatica 48 (2024) 41–54 51 7.27 7.27 7.24 7.19 7.25 7.19 7.2 7.25 7.21 7.29 NPC … Seventy … Basic … The … The SDR … Shandon … Pork … Chinese … The … Iraq held … 1 2 3 4 5 6 7 8 9 10 Post time Figure 10: Results of manual ranking of designated news influence. 7.27 7.27 7.24 7.19 7.25 7.29 7.19 7.2 7.25 7.21 0.6014 0.5774 0.0833 0.072 0.0341 0.0043 0.0042 0.0038 0.0037 0.0035 NPC … Seventy … Basic … The … The SDR … Iraq held … Shandon … Pork … Chinese … The … 1 2 3 4 5 6 7 8 9 10 Force value Post time Figure 11: Specify the results of Web news media retrieval analysis ranking. Query speed is a metric that measures the efficiency of a system in processing and responding to user queries. It quantifies the time it takes for a system to execute a search or retrieval operation and deliver the relevant results to the user. Table 2 and Figure 12 show the query speed result. While comparing the proposed method (SGVSM - 5 sec) with the other existing methods (Generalized Vector Space Model (GVSM )- 10 sec, TF- IDF -12 sec), it shows that our proposed method is superior for prediction accuracy in Web News Media Retrieval to other methods. Generalized Vector Space Model TF-IDF Vector Space Model [proposed] Semantic Vector Space Model 4 5 6 7 8 9 10 11 12 13 Query Speed (sec) Model Figure 12: Query speed Table 2: Query speed Model Query Speed (sec) Generalized Vector Space Model [20] 10 seconds TF-IDF Vector Space Model [21] 12 seconds Semantic Vector Space Model[proposed] 5 seconds Calibration rate is a metric used to assess the accuracy of predicted probabilities in a predictive model. It measures how well the predicted probabilities align with the actual outcomes or events. Table 3 and Figure 13 show the Calibration rate results. While comparing the proposed method (SGVSM - 82%) with the other existing methods (GVSM – 70%, TF- IDF -68%), it shows that our proposed method is superior for prediction accuracy in Web News Media Retrieval to other methods. Figure 13: Calibration rate Table 3: Calibration rate Model Calibration Rate (%) Generalized Vector Space Model[20] 70 TF-IDF Vector Space Model[21] 68 Semantic Vector Space Model[proposed] 82 Accuracy is a metric that assesses the correctness of a proposed model's predictions by comparing them to observed values. The correctness of case predictions is measured as a percentage of complete occurrences. Figure 14 and Table 4 depict the comparative evaluation of accuracy in suggested and traditional methods. When compared to currently existing methods such as GVSM 52 Informatica 48 (2024) 41–54 W. Xiong and TF-IDF, which have accuracy values of 85% and 82%, respectively, the suggested SGVSM achieves an accuracy value of 90%. Our proposed method provided superior results for Web News Media Retrieval. Generalized Vector Space Model TF-IDF Vector Space Model [proposed] Semantic Vector Space Model 64 72 80 88 96 Accuracy (%) Model Figure 14: Accuracy Table 4: Accuracy Model Accuracy (%) Generalized Vector Space Model[20] 85 TF-IDF Vector Space Model[21] 82 Semantic Vector Space Model[proposed] 90 Precision is a fundamental parameter utilized in the field of statistics to evaluate performance. Figure 15 and Table 5 depict the comparative evaluation of precision in suggested and traditional methods. When compared to currently existing methods such as GVSM and TF-IDF, which have Precision values of 85% and 82%, respectively, the suggested SGVSM achieves an accuracy value of 90%. Our proposed method provided superior results for Web News Media Retrieval. Figure 15: Precision Table 2: Precision Model Precision (%) Generalized Vector Space Model[20] 82 TF-IDF Vector Space Model[21] 78 Semantic Vector Space Model[proposed] 92 Recall is a performance metric used in data categorization that represents the proportion of actual positives that a model correctly retrieves. The recall result is shown in Table 6 and Figure 16. When compared to currently existing methods such as GVSM and TF-IDF, which have Recall values of 88% and 85%, respectively, the suggested SGVSM achieves an accuracy value of 92%. Our proposed method provided superior results for Web News Media Retrieval. Figure 16: Recall Table 6: Recall Model Recall (%) Generalized Vector Space Model[20] 88 TF-IDF Vector Space Model[21] 85 Semantic Vector Space Model[proposed] 92.4 Computational time, also known as execution time or runtime, refers to the amount of time it takes for a computer program or algorithm to complete its execution. It is a crucial metric in evaluating the efficiency and performance of computational processes. The recall result is shown in Table 7and Figure 17. When compared to currently existing methods such as GVSM and TF-IDF, which have Recall values of 10.9 sec and 8.7 sec, respectively, the suggested SGVSM achieves an accuracy value of 6.4 sec. Our proposed method provided superior results for the Retrieval of news media from the web. Web News Media Retrieval Analysis Integrating with Knowledge… Informatica 48 (2024) 41–54 53 Generalized Vector Space Model TF-IDF Vector Space Model [proposed] Semantic Vector Space Model 6 7 8 9 10 11 Computational Time (sec) Model Figure 17: Computational time Model Computational time Generalized Vector Space Model[20] 10.9 TF-IDF Vector Space Model[21] 8.7 Semantic Vector Space Model[proposed] 6.4 Table 7: Computational time 5 Implication The combination of Web News Media Retrieval Analysis and Knowledge Recognition of Semantic Grouping Vector Space Model has significant practical applications for information retrieval and analysis in the field of online news media. This strategy seeks to improve the efficiency and relevancy of web news searches by integrating sophisticated approaches in information retrieval, semantic grouping, and knowledge recognition. The model enhances the extraction of significant insights by identifying semantic relationships within the material, allowing for more precise categorization and collection of articles based on their underlying knowledge structures. This integration enhances a news retrieval system by making it more sophisticated and contextually aware. Additionally, it has the potential to improve the user experience by delivering more coherent and informative results. Furthermore, this approach may be utilized in several domains, such as journalism, research, and data analysis, providing a valuable instrument for effectively navigating and understanding the extensive and ever-changing realm of web-based news media. 6 Discussion The generalized Vector Space Model may struggle to capture semantic nuances and ambiguity in language. Words with multiple meanings or contexts may be represented by a single vector, leading to potential confusion. TF-IDF Vector Space Model faces challenges with synonymy and polysemy, where different words may have similar meanings or a single word may have multiple meanings. When implementing a proposed Semantic Vector Space Model, achieving superior performance in capturing semantic nuances and addressing the challenges of ambiguity becomes possible. This model endeavors to represent words in a multidimensional space, taking into account their semantic relationships, thereby offering a more nuanced and context-aware representation. By considering the inherent meanings and associations between words, the Semantic Vector Space Model aims to enhance accuracy and effectiveness. 7 Conclusions Compared with previous vector space models, the vector space model based on the semantic grouping of feature words proposed in this paper is more accurate for Web news information and is suitable for the retrieval of Web news information systems. The calibration rate of document query, the comprehensive evaluation rate F, and query speed have been significantly improved. The current situation is personalized services. The general retrieval system in the past can no longer meet the retrieval requirements in different environments, purposes and different times. This paper has carried out a series of research and analysis on Web news media retrieval. Through experiments, we can see the interference factors for the calculation of the semantic grouping vector space model. Experiments have proved that the accuracy of analysis has been improved, and the interests and needs of users can be correctly expressed so that the accuracy of Web news media retrieval is further enhanced. Keeping up with real-time updates and delivering the latest news can be a challenge. Systems might struggle to provide up-to-the-minute information due to processing delays or the dynamic nature of news. In future research, systems may focus on improving semantic understanding to provide more accurate and contextually relevant results. This could involve advanced natural language processing techniques, including sentiment analysis, entity recognition, and topic modeling. Acknowledgements 2022 Hunan Natural Science Foundation "Research on the Construction of Teaching Innovative Team of Higher V ocational Teachers Based on the Background of Improving Quality and Cultivating Excellence" (No.: 2022JJ60020 ) 54 Informatica 48 (2024) 41–54 W. Xiong Data Availability The data used to support the findings of this study are available from the corresponding author upon request. Conflicts of Interest The authors declare no conflicts of interest Funding Statement This study did not receive any funding in any form. References [1] Solainayagi, P., & Ponnusamy, R. (2019). Trustworthy media news content retrieval from web using truth content discovery algorithm. Cognitive Systems Research, 56(AUG.), 26-35. [2] Solainayagi, P., & Ponnusamy, R. (2019). Trustworthy media news content retrieval from web using truth content discovery algorithm. Cognitive Systems Research, 68(5), 566-588. [3] Kai, S., Wang, S. , & Liu, H. . (2018). Understanding User Profiles on social media for Fake News Detection. FakeMM'18 Workshop, 31(6), 1-7. [4] D Corney, Gonzalo, J., Martinez, M. , Poblete, B. , & Valochas, A. . (2018). Recent Trends in News Information Retrieval, 24, 1012 –1017. [5] Davies R. J. (2016). Digital computing techniques in the manufacture and operation of engine management systems. Aeronautical Journal, 79(776), 349-353. [6] Solainayagi, P. , & Ponnusamy, R. . (2019). Trustworthy media news content retrieval from web using truth content discovery algorithm. Cognitive Systems Research, 56(8), 26-35. [7] Chen, H. , Huang, B. , Liu, W. Z. , Gao, Y. B. , & Jiang, X. Y. . (2019). Python-based web news crawler and retrieval. Software Guide, 47(1):1-38. [8] [Dewandaru, A. , Supriana, I. , & Akbar, S. . (2017). Keyword and event extraction for thematic map retrieval from indonesian online news site. Journal of Physics Conference, 38(8), 10320-10342. [9] George Dimitrakopoulos, Panagiotis Demestichas, & Vera Koutra. (2012). Intelligent management functionality for improving transportation efficiency by means of the car pooling concept. IEEE Transactions on Intelligent Transportation Systems, 13(2), 424-436. [10] Fei-Yue Wang. (2011). Intelligent systems and social management. IEEE Intelligent Systems, 26(6), 2-3. [11] Meesad, P. . (2021). Thai fake news detection based on information retrieval, natural language processing and machine learning. SN Computer Science, 2(6), 1-17. [12] Jing Fan, Tianyang Dong, Xinxin Guan, & Ying Tang. (2013). A rapid simulation system for decision making in intelligent forest management. IEEE Intelligent Systems, 28(5), 2-9. [13] Pe?A-Araya, V. , Quezada, M. , Poblete, B. , & Parra, D. . (2017). Gaining historical and international relations insights from social media: spatio-temporal real-world news analysis using twitter. Epj Data Science, 6(1), 25. [14] Abbas Rajabifard, Russell G. Thompson, & Yiqun Chen. (2015). An intelligent disaster decision support system for increasing the sustainability of transport networks. Natural Resources Forum, 39(2), 83-96. [15] Frasincar, F., Borsje, J. and Levering, L., 2009. A semantic web-based approach for building personalized news services. International Journal of E-Business Research (IJEBR), 5(3), pp.35-53. [16] Nkongolo Wa Nkongolo, M., 2023. News Classification and Categorization with Smart Function Sentiment Analysis. International Journal of Intelligent Systems, 2023. [17] Bangyal, W.H., Qasim, R., Rehman, N.U., Ahmad, Z., Dar, H., Rukhsar, L., Aman, Z. and Ahmad, J., 2021. Detection of fake news text classification on COVID-19 using deep learning approaches. Computational and mathematical methods in medicine, 2021, pp.1-14. [18] Štrimaitis, R., Stefanovič, P., Ramanauskaitė, S. and Slotkienė, A., 2021. Financial context news sentiment analysis for the Lithuanian language. Applied Sciences, 11(10), p.4443. [19] Sun, N. and Du, C., 2021. News text classification method and simulation based on the hybrid deep [20] Tsatsaronis, G. and Panagiotopoulou, V., 2009, April. A generalized vector space model for text retrieval based on semantic relatedness. In Proceedings of the Student Research Workshop at EACL 2009 (pp. 70-78). [21] SUBA, C., Retrieval of Information Document Using TF-IDF Algorithms and Vector Space Model Representation.