https://doi.org/10.31449/inf.v46i1.3875 Informatica 46 (2022) 13–17 13 Improving Modeling of Stochastic Processes by Smart Denoising Jakob Jelenˇ ciˇ c and Dunja Mladeni´ c E-mail: jakob.jelencic@ijs.si, dunja.mladenic@ijs.si Jozef Stefan Institute, Ljubljana, Slovenia Jozef Stefan International Postgraduate School, Ljubljana, Slovenia Keywords: de-noising, deep learning, stochastic process Received: December 16, 2021 This paper proposes a novel method for modeling stochastic processes, which are known to be notoriously hard to predict accurately. State of the art methods quickly overfit and create big differences between train and test datasets. We present a method based on smart noise addition to the data obtained from unknown stochastic process, which is capable of reducing data overfitting. The proposed method works as an addition to the current state of the art methods in both supervised and unsupervised setting. We evaluate the method on equities and cryptocurrency datasets, specifically chosen for their chaotic and unpredictable nature. We show that with our method we significantly reduce overfitting and increase performance, compared to several commonly used machine learning algorithms: Random forest, General linear model and LSTM deep learning model. Povzetek: V tem ˇ clanku je predstavljena metoda, s katero se lahko omeji preprileganje nevronskih mrež na ˇ casovnih vrstah. 1 Introduction Time series prediction has always been an interesting chal- lenge. Deep learning structures that are designed for time series are prone to overfitting. Especially if the underly- ing time series is stochastic by nature. Every young re- searcher’s first attempt when dealing with time series, was trying to learn a time series model that will predict future prices; whether in equities, commodities, forex or cryp- tocurrencies. Unfortunately it is not that simple. One can easily build a near perfect model on the train dataset just to find it is completely useless on the test dataset. Overfitting is a difficult problem, especially in deep learning models with many variables, where optimizers based on gradient descent are prone to it. There are many ways to limit it, like regularizes in loss functions that pun- ish high values in variables, or random masking of training data and adding noise it. However most efficient way to prevent overfit is to obtain more data. Unfortunately some- times this is not easily done or even possible. We have proposed a novel method that is capable of effectively combatting the overfitting [5], especially this proves to be a difficult task when one is dealing with a problem directly applicable in practical situations. In this work we expand on the denoising part of our method and provide additional parameter analysis. The main idea is to add noise from the same distribution as the training data, which is applicable for both supervised and unsupervised problems. The longer the training goes, the lower is the amplitude of noise and the less focus is on the denoising process. We have evaluated the proposed method on an equities dataset and a cryptocurrency dataset, in both cases achiev- ing extraordinary results on the test dataset. We have also shown the importance of noise distribution and how the de- noising fails if the distributions of the data and noise do not align. The rest of the paper is organised as follows. Section 2 describes the data we were using. In section 3 we introduce the proposed method. In section 4 we present empirical results. In section 5 we conclude by pointing out the main results and defining guidance for the future work. 2 Data The proposed method works well for stochastic processes. Equities are supposed to follow some form of stochastic process [10], either the Black-Scholes one or some more complex process with unknown formulation. In order to evaluate our method, we have collected daily data of more than 5000 equities listed on NASDAQ from 2007 on. The data is freely available on the Yahoo Finance website [2]. We transformed the data using technical analysis [11] and for test set took every instance that happened after 2019. We calculated moving average using 10 days closing price then tried to predict the direction of the change of this trendline. As a second experiment we tried to predict change of the equity in the following day. The smoothed equity data turned out to be a little bit timid, not chaotic enough to demonstrate the full ability of the proposed method in the unsupervised part of the exper- iment. This is why we also collected minute data of cryp- tocurrencies Ethereum and Bitcoin and used the method 14 Informatica 46 (2022) 13–17 J. Jelenˇ ciˇ c et al. on them as well. Data is available on the crypto exchange Kraken [1]. We used the same transformation as for the eq- uities, but with a bit quicker trend. For the test set we took every instance that has time stamp after December 2020. The reader should note that the end goal is not to accu- rately predict future equity price, since that is next to im- possible. As soon there is a pattern, someone will profit from it and then the pattern will change. By predicting the future trend line, one can obtain a significant confidence in- terval and estimates of where the price could be, and then design for example a derivative strategy that searches for favourable risk versus rewards trades. 3 Proposed method We propose the method designed for prediction of stochas- tic processes. The method achieves significant results im- proving the metrics and loss functions on unseen data, where standard deep learning is prone to over-fit. The main advantage is reducing the gap between training data and testing data, sometimes to a degree where one sacrifices a little bit on the train side to actually have the model outper- forming it on test data. This is very important in time series, where a prediction model is usually just one part of a bigger strategy and where the train over-fit is the biggest issue. For example, designing a trading strategy on over-fitted predic- tions, that kind of mistake can lead to huge capital losses. The proposed method can be broken down into 2 impor- tant parts: normalization and noise addition. Each part can be easily integrated into an already existing pipeline of both supervised or unsupervised settings. 3.1 Empirical normalization Normalization plays an important role in deep learning models. It was shown that normalization significantly speeds up the gradient descent, almost independently of where normalization takes place. It can be weight normal- ization [12] during the actual optimization, or it can be the batch normalization [9], or just normalization of the whole input data [8]. In the proposed method it is important that the 3 dimen- sional input data comes from the same distribution as the generated noise. Since it is fairly straightforward to sample data from a 3 dimensional normal distribution, we normal- ize input data using an empirical cumulative distribution function [13] and empirical copula [4] [6]. We align all central moments of the unknown distribution to the ones from centered and standardised normal distribution. The normalization takes place before the data is reshaped to 3 dimensional tensor. 3.2 Noise addition Introduction of the noise is not new in unsupervised learn- ing and it was shown that it has a positive effect [15]. Adding noise to input data and then forcing the model to learn how to ignore it has a lot of success in generative adversarial networks [3], where convergence can be very tricky to achieve. We transformed that idea and embedded it into supervised learning procedure. The noise addition is described in Algorithm 1. In Algorithm 1 we will use the following abbreviations. – X = [bs;ts;np] stands for the input tensor with 3 dimensions; batch size, time steps and number of fea- tures used for predictions. – , are parameters that control how fast noise will decrease during the training procedure. They should be between 0 and 1, where lower value correspond to a faster decrease in the amplitude of the added noise. – mvn stands for function sampling from a two dimen- sional correlated Gaussian distribution, where is the covariance.matmul stands for matrix multiplication. Algorithm 1 Noise definition 1: Inputs:X, , ,epoch 2: Y = [ts;ts;np] . Array for holding Cholesky decompositions of time correlation matrices. 3: fort2f1;:::;npg do 4: t =cov(X[;;t]) 5: Y [;;t] =chol( t ) . In practice the closest positive definite matrix of t is computed before the Cholesky decomposition. 6: end for 7: Z = [bs;ts;np] . Array for holding noise samples. 8: fori2f1;:::;tsg do 9: i =cov(X[;i; ]) 10: Z[;i; ] =mvn(bs; i ) 11: end for 12: forj2f1;:::;npg do 13: Z[;;j] =matmul(Z[;;j];Y [;;j]) . Correct- ing initially independent noise samples with respect to time. 14: end for 15: forw2f1;:::;tsg do 16: Z[;w; ] =Z[;w; ] (( ts w epoch ) sd) . Decrease the noise during the training procedure. 17: end for 18: R =X +Z 19: ReturnR. The most common issue with deep learning optimization is falling into a local optimum and being unable to move past it [14]. We expect that the addition of noise will force the model to learn how to ignore the noise that we added and the noise that is already in the data by nature of the stochastic process [16] and with that escaped some of the local optimum and converged deeper. It is important to tune the noise amplitude by fine tuning the parameter. We optimized the model using the Adam optimizer [7]. Improving Modeling of Stochastic Processes by. . . Informatica 46 (2022) 13–17 15 4 Results We have divided the results section into 3 parts: unsuper- vised, supervised and parameter analysis. In the first we demonstrate why the noise distribution is important. For the unsupervised part, due to hardware constraints, we have only used the cryptocurrency dataset since we deemed it more demanding than the equity one. In the second, we demonstrate how the our method increases test metric on both datasets and provide some additional parameter anal- ysis. In the third we focus on how the method parameters affect method’s performance. 4.1 Unsupervised learning results In order to test the efficiency of distributed noise versus just random noise, we created 3 models. The baseline model was a deep learning model with 3 stacked LSTM layers, encoded layer, then again 3 stacked LSTM for decoded out- put. We have used Adam as optimizer. As loss function we used mean-squared error. We have stopped the learning after there was no improvement for 25 epochs on the val- idation set. The validation set was randomly taken out of the train set. Parameters and were both set to 0.99 and sd was initially set to 1:25. The noise decreases with learn- ing procedure. Interestingly keeping noise constant did not achieve any results. Figure 1: Test loss of autoencoder model with random noise (green) versus no noise (blue). Initially we have tested baseline model versus de-noising model but with uncorrelated noise. In the Figure 1 is plot- ted the de-noising test loss function in green colour and the baseline test loss function in blue. Training was stopped relatively early compared to Figure 2 and it is also obvi- ous that de-noising test loss is even worse than that of the classic autoencoder. In the second example we switched from uncorrelated noise to the noise with same distribution as input data. As is apparent on Figure 2, where again we have de-noising test loss plotted with green and classic test loss with blue, the de-noising autoencoder achieved lower test loss than the classic one. What we expected is that then the train and validation losses will be worse than with the classic autoencoder. Sur- prisingly, that was not the case. With the de-noising au- toencoder using noise with the same distribution as the in- put data, both train and validation losses were better than Figure 2: Test loss of autoencoder model with correlated noise (green) versus no noise (blue). with classic one. This result is definitely worth further in- vestigation and experimentation. 4.2 Supervised learning results In the previous section we have shown that the distribution of the noise matters in the unsupervised setting. In this sec- tion we will show that the addition of the noise significantly improves metrics on unseen data. Each model has been ran 10 times, and in the tables below we present the average results. Since we operate under the stochastic assumption, this was done to eliminate any doubt that the presented re- sults are good due to luck. We believe 10 runs is enough and more was simply not possibly due to hardware and time constraints. Similarly as before, and were both set to 0.99 andsd was initially set to 1:25 when comparing pro- posed method to existing state of the art structures. Since we now operate in a supervised environment, we can compare our models to the majority class. But to really demonstrate the effectiveness of the method, we chose to compare the following models: – Majority class, which serves as a sanity check. – Random Forest with 500 trees. – Generalized linear model. – Deep learning model with 3 stacked LSTM layers. – Deep learning model with 3 stacked LSTM layers and correlated noise addition. The two deep learning models are identical, both are op- timized with Adam and categorical cross entropy was used as a loss function. Initially we have only tested the models on equities data, but it turned out that the equities were not chaotic enough. By that we mean that especially with deep learning models the difference between train and test loss was not so big that it would be problematic. From previous work experience we know that overfit is a big issue in cryp- tocurrency dataset, so then we decided to test that dataset in a supervised setting as well. In Table 1 we show the results from the equity dataset. Our method managed to improve test accuracy (from 0.672 to 0.677) without decreasing train accuracy (0.679). Main- taining test accuracy and keeping it comparable to test one 16 Informatica 46 (2022) 13–17 J. Jelenˇ ciˇ c et al. is important if one needs to build additional strategy upon predictions. Table 1: Supervised results on equity dataset. Method Train Acc. Test Acc. Majority 0.513 0.537 Random Forest 0.651 0.653 GLM 0.664 0.655 LSTM 0.679 0.672 noise LSTM 0.679 0.677 In Table 2 we show results from the cryptocurrency dataset. Similar as on the equity dataset, our method be- haves as intended on the cryptocurrency dataset as well. We can see reduced overfitting that is apparent in the nor- mal LSTM model. With those results we can conclude that the proof of concept works, but for additional claims we will need more testing and deeper parameter analysis. Table 2: Supervised results on cryptocurrency dataset. Method Train Acc. Test Acc. Majority 0.512 0.556 Random Forest 0.692 0.690 GLM 0.682 0.695 LSTM 0.749 0.693 noise LSTM 0.702 0.706 4.3 Parameter analysis So far we have demonstrated that the noise addition in- creases test metric and reduces overfitting. But we have demonstrated that with fixed = 0:99 and with fixed sd = 1:25. For now we will leave fixed and focus on analysis of noise amplitude and its decrease over the learn- ing process. We have chosen a combination of , sd and noise correlation to show how the method behaves with re- spect to those parameters. For this experiment we have used equity dataset, where the target variable was change of the corresponding equity in the next day. So we have created extremely difficult problems, very likely to overfit in order to really test the method. We did this only on one deep learning structure, where we evaluate each parameter setting 25 times, and display the average result in the table 3. To further demonstrate the necessity of the noise structure, we have evaluated example where the added noise were random and constant trough out the learning process. The results in Table 3 confirm the assumptions we draw earlier. It is clear that some combination of noise and its de- crease reduces overfitting compared to model without noise or the model without the amplitude decrease. Table 3: Parameter analysis on raw equity dataset. SD Alpha Train Acc. Test Acc. 0.5 0.9 0.576 0.527 0.75 0.9 0.577 0.526 1.25 0.9 0.578 0.525 1 0.99 0.577 0.524 1 0 0.577 0.524 1.25 0 0.578 0.524 1.25 0.99 0.578 0.523 0 0 0.579 0.521 5 Conclusions and future work In this work we have introduced and demonstrated how the addition of noise reduces overfitting on time series data. In the unsupervised case we have shown that the distribution of the noise matters and the input data must align to achieve maximum effect from the noise addition. In the future work we have to estimate the effect of the newly introduced parameters on method’s convergence. At the same time we need to explore how the method behaves when embedded into larger models, transformers for exam- ple. We also need to evaluate the method in datasets that are by nature stochastic but do not come from the financial do- main. Finally, we need to evaluate our method on a dataset that is not stochastic. 6 Acknowledgments This work was supported by the Slovenian Research Agency. We also wish to thank prof. dr. Ljupˇ co Todor- ovski for his help, especially with unsupervised results. References [1] Kraken exchange. https://www.kraken. com/. [2] Yahoo Finance. https://finance.yahoo. com/. [3] A. Creswell, T. White, V . Dumoulin, K. Arulku- maran, B. Sengupta, and A. A. Bharath. Generative adversarial networks: An overview. IEEE Signal Pro- cessing Magazine, 35(1):53–65, 2018. [4] P. Jaworski, F. Durante, W. K. Hardle, and T. Rych- lik. Copula theory and its applications, volume 198. Springer, 2010. [5] J. Jelencic and D. Mladenic. Modeling stochastic pro- cesses by simultaneous optimization of latent repre- sentation and target variable. 2020. [6] H. Joe. Dependence Modeling with Copulas. CRC Press, 2014. Improving Modeling of Stochastic Processes by. . . Informatica 46 (2022) 13–17 17 [7] D. Kingma and J. Ba. Adam: A Method for Stochas- tic Optimization. 2014. https://arxiv.org/ abs/1412.6980. [8] K. Y . Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831, 2016. [9] M. Liu, W. Wu, Z. Gu, Z. Yu, F. Qi, and Y . Li. Deep learning based on batch normalization for p300 signal detection. Neurocomputing, 275:288–297, 2018. [10] R. C. Merton. Option pricing when underlying stock returns are discontinuous. Journal of financial eco- nomics, 3(1-2):125–144, 1976. [11] J. J. Murphy. Technical Analysis of the Financial Markets: A Comprehensive Guide to Trading Meth- ods and Applications. New York Institute of Finance Series. New York Institute of Finance, 1999. [12] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural informa- tion processing systems, 29:901–909, 2016. [13] B. W. Turnbull. The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society: Series B (Methodological), 38(3):290–295, 1976. [14] R. Vidal, J. Bruna, R. Giryes, and S. Soatto. Mathematics of deep learning. arXiv preprint arXiv:1712.04741, 2017. [15] P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Man- zagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008. [16] N. Wax. Selected papers on noise and stochastic pro- cesses. Courier Dover Publications, 1954. 18 Informatica 46 (2022) 13–17 J. Jelenˇ ciˇ c et al.