ELEKTROTEHNI ˇ SKI VESTNIK 90(3): 75–89, 2023 ORIGINAL SCIENTIFIC PAPER Feature embedding in click-through rate prediction Samo Pahor 1 , Davorin Kopiˇ c 1 , Jure Demˇ sar 2, † 1 Outbrain Slovenia, Dunajska cesta 5, 1000 Ljubljana 2 Faculty of Computer and Information Science, University of Ljubljana, Veˇ cna pot 113, 1000 Ljubljana † E-mail: jure.demsar@fri.uni-lj.si Abstract. We tackle the challenge of feature embedding for the purposes of improving the click-through rate prediction process. We select three models: logistic regression, factorization machines and deep factorization machines, as our baselines and propose five different feature embedding modules: embedding scaling, FM embedding, embedding encoding, NN embedding and the embedding reweighting module. The embedding modules act as a way to improve baseline model feature embeddings and are trained alongside the rest of the model parameters in an end-to-end manner. Each module is individually added to a baseline model to obtain a new augmented model. We test the predictive performance of our augmented models on a publicly accessible dataset used for benchmarking click-through rate prediction models. Our results show that several proposed embedding modules provide an important increase in predictive performance without a drastic increase in training time. Keywords: real-time bidding, click-through rate prediction, feature embedding, feature transformation Vloˇ zitve znaˇ cilk pri napovedovanju verjetnosti klika Cilj priˇ cujoˇ ce raziskave je bil ugotoviti ali lahko izboljˇ samo modele strojnega uˇ cenja za napovedovanje verjetnosti klika skozi napredne vloˇ zitve znaˇ cilk. Evalvirali smo pet razliˇ cnih pristopov za vloˇ zitve znaˇ cilk (vloˇ zitve s skaliranjem, FM vloˇ zitve, vloˇ zitve s kodiranjem, NN vloˇ zitve ter vloˇ zitve z uteˇ zevanjem) v kombinaciji s tremi modeli strojnega uˇ cenja (logistiˇ cna regresija, faktorizacijske metode in globoke fak- torizacijske metode). Vsi razviti pristopi so modularni in jih lahko treniramo loˇ ceno ter pripojimo poljubnim modelom v naˇ sem napovednem cevovodu. Skozi medsebojne primerjave smo temeljito evalvirali vse prej omenjene modele ter mod- ele nadgrajene z dodanimi moduli za vloˇ zitve znaˇ cilk. Naˇ si rezultati so pokazali, da lahko s pomoˇ cjo modulov za vloˇ zitve znaˇ cilk signifikantno izboljˇ samo napovedne rezultate modelov, ne da bi drastiˇ cno podaljˇ sali ˇ cas uˇ cenja. 1 INTRODUCTION Online advertising, as opposed to traditional advertising (e.g., in newspapers, on billboards, on television, on radio, etc.) is typically featured on websites or mobile applications. Online advertising allows companies to reach a worldwide user base and engage key demo- graphics to market their products. It offers a way for companies to increase brand awareness as well as their understanding of target audiences. For online advertising to succeed, ads need to be prominently featured and presented to relevant consumers. The vast majority of online advertising space is sold through programmatic advertising. In programmatic advertising, whenever a Received 22 May 2023 Accepted 9 June 2023 user opens a web page or an app, an auction is executed in the background for each ad space on that page. The ad space in these real-time bidding (RTB) auctions is thus being dynamically sold to the highest bidder [1], [2]. Bidders are typically specialized companies that offer their services to advertisers in order to participate in RTB auction as efficiently as possible. Since loading of web pages needs to be as fast as possible and any delays would ruin the user’s browsing experience, a key characteristic of RTB auctions is that they are close to in- stantaneous, occurring in less than 100 milliseconds [3]. The advertisement space in programmatic advertising is completely automatized, the whole advertising process is being performed, managed and optimized by software directly. The ad that wins the auction is displayed to the user visiting the website, in the hopes that said user will respond positively [4] and potentially click on the ad. Since such a click directly translates to online traffic and therefore value for the advertiser, predicting click- through rate (CTR) is a key part of the advertising businesses [5]. CTR prediction demands very quick response rates, so it is able to function within the RTB environment. Additionally, the amount of constantly generated information in the online advertising space is overwhelming and impossible to organize or oversee manually. Both issues, quick response rate and over- whelming amount of data, are addressed with sophisti- cated machine learning models, which are able to extract important knowledge from past data, are trained on-the- fly and able to perform extremely fast predictions. 76 DEM ˇ SAR, PAHOR, KOPI ˇ C Models predict user clicks based on a variety of inputs, including contextual data related to the user, his- torical data and other data which varies across domains. Mentioned input features are predominantly categorical and can usually take on a very large amount of different values. A typical approach with such data is to transform them into high-dimension sparse binary vectors via one- hot encoding. These sparse vectors can be problematic for some of the more popular and modern machine learning techniques, such as deep learning. In these cases further embedding and dimensionality reduction is necessary to produce dense numeric feature vectors that can then be used in model training. Recent research in the field of CTR focuses on optimizing prediction models, such as logistic regression (LR) [6], factorization machines (FM) [7], deep neural networks (DNN) [8] or hybrid approaches, like DeepFM [9]. LR is essentially a linear model, assigning a specific weight to each observed feature and using a logis- tic function for final prediction. A more sophisticated method are FMs, where the linear model is upgraded with an additional interaction term. This gives FMs the ability to capture 2nd order interactions between different features by approximating the weights for any given co-occurring feature pair, while the main quality of DNN models is their ability to capture higher order feature interactions, which is not feasible for LRs or classic (also known as 2nd order) FM versions. Be- sides DNN, FM and LR are among the most popular models in the CTR prediction space. Because of their good performance and relative simplicity, they were selected as our model baselines. Finally, we also selected the more sophisticated DeepFM, which combines the functionality of FM and DNN models and is currently considered one of the state-of-the-art approaches for CTR prediction. While the model is clearly the most important aspect of CTR predictions, some recent studies show that the way we handle features should rightfully get a signif- icant portion of our attention. Since dimensionality of the mentioned categorical features is typically extremely high, a dimensionality reduction approach called the hashing trick was proposed by Weinberger et al. [10]. The hashing trick aims to reduce feature dimensionality by using a hashing function to map features from their original space to a smaller feature space. Different from random projections, the hashing trick introduces no additional overhead to store projection matrices. Even more, it actually helps reduce storage needs by reducing feature dimensionality. Our research builds on the work by He et al. [11] and Zhou et al. [12] which suggests that intelligent feature embedding and extraction could increase pre- dictive performance of models and reduce the need for manual feature engineering. The main goal of our re- search is thus to implement and evaluate whether various feature embedding approaches can indeed improve the efficiency in training and accuracy of predictive models. To achieve this goal, we explore different feature embed- ding approaches in the context of CTR prediction. While the presented findings are primarily focused around CTR prediction, they could be easily transferred and used in other fields as well, particularly with other types of tabular data classification or regression problems. This research is also a direct continuation of [13], where we explored performances of similar embedding improvements on a private dataset provided by Zemanta. 2 METHODS In this section we first present the steps traditionally required for performing model prediction. Next, we present the five embedding modules we developed: the embedding scaling module, the FM embedding module, the embedding encoding module, the NN embedding module and the embedding reweighting module. Finally, we present all of the details of our experimental setup. 2.1 The general predictive framework To adequately investigate different embedding and en- richment approaches, we will first formulate the typical steps that we have to take when using a prediction model. We can break the process into four steps: in- put, hashing, embedding and prediction. The following subsections briefly describe each step. 2.1.1 Input: The first step is serving raw input data to our model. Since raw input data is usually gathered from different sources and tries to capture as much information as possible, it features different data types. In the context of CTR, both numerical as well as categorical features are common. While some predictive models, such as decision trees, are able to learn directly from raw categorical data, other approaches (e.g., FM, DNN, etc.) require numerical inputs. In such cases, some kind of feature embedding is usually required to encode categorical data so it can be used for model training and prediction. 2.1.2 Hashing: Categorical variables often have a very large dimensionality, which can make model train- ing and prediction problematic. A simple and practical solution to this issue is feature hashing, also called the hashing trick [10]. The hashing trick is formally described as follows. For a given feature with value x i , where i∈ N, implying that feature has dimensionality |N| , define a hashing functionh hashing : h hashing : N→ M x i 7→ j, where j∈ M and|M|≪| N| . The hashing trick es- sentially maps a high dimensional feature into a smaller FEATURE EMBEDDING IN CLICK-THROUGH RATE PREDICTION 77 dimensional space by computing a hash value of the original feature value. A scenario where two different values get mapped to the same hash value is called a collision. Collisions occur because we are mapping from a larger to a smaller space. However, if we assume a sufficiently large hashing space, the performance loss due to collisions becomes negligible, making this approach very suc- cessful in practice. Furthermore, since feature hashing essentially reduces model complexity, it can be con- sidered a form of regularization [14]. We can perform feature hashing in two ways: either hash each feature separately and treat values from different columns dif- ferently, or hash the entire sample. In both cases, we obtain a set of index values, which are forwarded to the embedding layer. The difference is that separate feature hashing avoids cases where a collision occurs between two feature values of different columns that happen to have the same original value. Conversely, hashing the entire sample is faster and easier to implement. Our experiments use the latter example of hashing the entire sample. 2.1.3 Embedding: Since predictive models like LR, FM and DeepFM require numerical inputs, categorical data needs to be mapped into a numerical space. We first describe the process of one-hot encoding, which shows how we transform categorical information to numerical vectors and afterwards describe an analogous index-based approach that avoids generating large sparse vectors. The most popular approach for categorical data em- bedding is called one-hot encoding, which maps cate- gorical values to sparse vectors. We formally describe single-feature one-hot encoding as follows. For a given categorical feature f, define set N, which contains all possible values of f: N ={ x 1 ,x 2 ,··· ,x |N|} . For example, if our categorical feature is a de- vice type, then our possible corresponding set N is { PC, laptop, mobile, tablet} . Using N, we define the following map: h one-hot : N→{ 0,1} |N| x i 7→ v, where v is an|N| -dimensional vector of zeroes, with the i-th component equal to 1. In our example above, this translates to: h one-hot (PC) = [1, 0, 0, 0] T , h one-hot (laptop) = [0, 1, 0, 0] T , h one-hot (mobile) = [0, 0, 1, 0] T , h one-hot (tablet) = [0, 0, 0, 1] T . After obtaining such a sparse vector for each categorical feature, we concatenate them. The concatenated vector can then be multiplied with our model weights and used to compute the final prediction. This is however computationally expensive, since we always store all values of each of our one-hot encoded vectors. To avoid storing large vectors, we can instead simply use the relevant index values to retrieve the relevant model weights directly and avoid multiplication. For example, instead of embedding our feature value x i into a vector e i where the i-th component equals 1 and multiplying it with our weight vector w: w = [w 0 ,w 1 ,··· ,w i− 1 ,w i ,w i+1 ,··· ,w |N| ], which would obtain componentw i , we instead use index value i to obtain the component directly. We illustrate the advantage of the the index-based approach in the following example. Let’s say we are dealing with a categorical feature device type, with its corresponding value set: N ={ PC, laptop, mobile, tablet} . We wish to embed this categorical feature as a 2- dimensional numeric vector. Note that this implies that the weight vector from the above definition actually becomes a weight matrix with dimensions (2× 4). An example of such an embedding/weight matrix can be seen below: W = 0.33 0.12 2.57 3.04 1.43 0.50 1.26 7.55 . Obtaining a numerical embedding for the value of “lap- top” requires the following steps. We first obtain the appropriate one-hot vector: h one-hot (laptop) =v laptop =     0 1 0 0     , then we proceed to multiply the embedding/weight ma- trix with the one-hot vector to obtain the final numeric feature embedding: W·v laptop = 0.33 0.12 2.57 3.04 1.43 0.50 1.26 7.55 ·     0 1 0 0     = 0.12 0.50 . The index-based approach simplifies this process by obtaining the index value of the feature value laptop. In this context, the index value indicates the position of value laptop in set N: h index (laptop) = 1. Finally, the index is used to extract * the embedding directly from the embedding/weight matrix: col 1 (W) = col 1 0.33 0.12 2.57 3.04 1.43 0.50 1.26 7.55 = 0.12 0.50 . ∗ Here, the function col i (M) returns thei-th column of matrixM. 78 DEM ˇ SAR, PAHOR, KOPI ˇ C Not generating the one-hot vectors is an important optimization, since these vectors are typically very large, due to the high dimensionality of categorical features. 2.1.4 Prediction: We selected logistic regression (LR), factorization machines (FM) and deep factor- ization machines (DeepFM) as our baseline models. The following subsections provide a basic theoretical description of the mentioned models. LR is a type of predictive analysis that attempts to explain the relationship between one dependent bi- nary variable and one or more independent variables. When performing LR on m-dimensional real vectors, the model consists of two components: • an m-dimensional weight vector θ , • a bias θ 0 . For a given samplex∈ R m , our model predictionf(x) equals: f(x) =σ (θ 0 +θ 1 x 1 +θ 2 x 2 +...θ m x m ) =σ (θ 0 + m X i=1 θ i x i ), whereσ is the logistic function. The logistic function is a member of the sigmoid function family and is defined by the formula: σ (x) = 1 1+e − x . LR is essentially a linear model that outputs the prob- ability of a positive outcome for a binary event given a set of dependent real variables. It is attractive in the CTR prediction context due to its simplicity, training and prediction speeds and decent performance [6]. The LR model is visualized in Figure 1. FMs attempt to capture interactions between features by using factorized parameters. They can be utilized to model any order of feature interactions, although second order interactions are the most common. When performing prediction onm-dimensional real vectors via a second order FM, we require three components: • an (m× k)-dimensional factorized interaction ma- trix V ; here k denotes the size of the interaction vectors, • an m-dimensional weight vector θ , • a bias θ 0 . For a given samplex∈ R m , our model predictionf(x) equals: f(x) =σ (θ 0 + m X i=1 θ i x i + m X i=1 m X i