Proceedings of the 27th Computer Vision Winter Workshop Marija Ivanovska, Matej Dobrevski, Žiga Babnik, Domen Tabernik (eds.) Terme Olimia, Slovenia, February 14-16, 2024 Proceedings of the 27th Computer Vision Winter Workshop February 14-16, 2024, Terme Olimia, Slovenia © Slovenian Pattern Recognition Society, Ljubljana, February 2024 Volume Editors: Marija Ivanovska, Matej Dobrevski, Žiga Babnik, Domen Tabernik https://cvww2024.sdrv.si/proceedings/ Publisher Slovenian Pattern Recognition Society, Ljubljana 2024 Electronic edition Slovenian Pattern Recognition Society, Ljubljana 2024 © SDRV 2024 CIP zapis: Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani. COBISS.SI-ID 185271043 ISBN 978-961-96564-0-2 (PDF) 2 Contents Preface 4 Organizers 5 Keynote Speaker 6 Invited Presentations Speaker List 6 Contributed Papers 7 Pose and Facial Expression Transfer by using StyleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Dense Matchers for Dense Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Enhancement of 3D Camera Synthetic Training Data with Noise Models . . . . . . . . . . . . . . . . . 29 Detecting and Correcting Perceptual Artifacts on Synthetic Face Images . . . . . . . . . . . . . . . . . 38 Cross-Dataset Deepfake Detection: Evaluating the Generalization Capabilities of Modern DeepFake Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Video shutter angle estimation using optical flow and linear blur . . . . . . . . . . . . . . . . . . . . . 57 Measuring Speed of Periodical Movements with Event Camera . . . . . . . . . . . . . . . . . . . . . . 66 Weather-Condition Style Transfer Evaluation for Dataset Augmentation . . . . . . . . . . . . . . . . . 75 3 Preface The Computer Vision Winter Workshop (CVWW) is an annual international meeting fostered by computer vision research groups from Ljubljana, Prague, Vienna, and Graz. The workshop aims to encourage interaction and exchange of ideas among researchers and PhD students. The focus of the workshop spans a wide variety of computer vision and pattern recognition topics, such as image analysis, 3D vision, biometrics, human-computer interaction, vision for robotics, machine learning, and applied computer vision and pattern recognition. CVWW2024 was organized by the Slovenian Pattern Recognition Society (SPRS), and held in Terme Olimia, Slovenia, from February 14th to February 16th, 2024. We received a total of 10 contributed paper submissions from multiple countries and institutions. The paper selection was coordinated by the Program Chairs and included a rigorous double-blind review process. The international Program Committee consisted of 38 computer vision experts, who conducted the reviews. Each submission was examined by at least three experts, who were asked to comment on the strengths and weaknesses of the papers and justify their recommendation for accepting or rejecting a submission. The Program Chairs used the reviewers’ comments to render the final decision on each paper. As a result of this review process, 8 original contributed papers were accepted for publication. These were presented at the workshop as oral or poster presentations. Complementing this, we were privileged to host 27 invited presentations from both experienced researchers and researchers in the early stages of their professional careers. These presentation were selected by the Program Chairs in consultation with the Program Committee. The workshop featured a keynote by Prof. Mohamed Daoudi, a Full Professor at IMT Nord Europe and Head of the Image Group at CRIStAL Laboratory. The Workshop Chairs would like to thank the Steering Comittee for their advices, directions and discussions. We also thank the Program Committee for their high-quality and detailed comments, which served as a valuable source of feedback for all authors. Their time and effort made CVWW2024 possible. We thank Prof. Mohamed Daudi for taking time from his busy schedule to deliver the keynote. We also extend our thanks to the Slovenian Pattern Recognition Society, through which the workshop was organized, and we would like to acknowledge and thank our supporters from the Faculty of Electrical Engineering and the Faculty of Computer and Information Science, University of Ljubljana for their contributions. We would also like to thank our sponsor - the SMASH MSCA postdoctoral program. Finnaly, we wish to thank all authors, presenters, and attendees for making the 27th iteration of the Computer Vision Winter Workshop a success! Official sponsor Hosts LM L I M Laboratory for Machine Intelligence 4 Organizers Workshop Chairs Marija Ivanovska, University of Ljubljana Matej Dobrevski, University of Ljubljana Žiga Babnik, University of Ljubljana Domen Tabernik, University of Ljubljana Program Chairs Marija Ivanovska, University of Ljubljana Matej Dobrevski, University of Ljubljana Steering Committee Matej Kristan, University of Ljubljana Janez Perš, University of Ljubljana Danijel Skočaj, University of Ljubljana Vitomir Štruc, University of Ljubljana Program Committee Alan Lukežič, University of Ljubljana Matic Fučka, University of Ljubljana Csaba Beleznai, Austrian Institute of Technology Matthias Wödlinger, Vienna University of Technology Darian Tomašević, University of Ljubljana Nicholas Hockings, University of Veterinary Medicine, Vi-Domen Tabernik, University of Ljubljana enna Friedrich Fraundorfer, Graz University of Technology Oleksandr Shekhovtsov, Czech Technical University in Janez Perš, University of Ljubljana Prague Janez Križaj, University of Ljubljana Ondrej Chum, Czech Technical University in Prague Jer Pelhan, University of Ljubljana Pavel Krsek, Czech Technical University in Prague Jiri Matas, Czech Technical University in Prague Pedro Hermosilla-Casajus, Vienna University of Technol-Jon Muhovič, University of Ljubljana ogy Lea Bogensperger, Graz University of Technology Peter Rot, University of Ljubljana Levente Hajder, Eötvös Loránd University Robert Harb, Graz University of Technology Lojze Žust, University of Ljubljana Robert Sablatnig, Vienna University of Technology Luka Čehovin Zajc, University of Ljubljana Roman Pflugfelder, Vienna University of Technology Marco Peer, Vienna University of Technology Sebastian Zambanini, Vienna University of Technology Marko Rus, University of Ljubljana Tim Oblak, University of Ljubljana Marko Brodarič, University of Ljubljana Torsten Sattler, Czech Technical University in Prague Martin Matoušek, Czech Technical University in Prague Vitjan Zavrtanik, University of Ljubljana Martin Zach, Graz University of Technology Walter Kropatsch, Vienna University of Technology Martin Kampel, Vienna University of Technology Žiga Babnik, University of Ljubljana Matej Kristan, University of Ljubljana 5 Distinguished Keynote Speaker Learning to Synthesize 3D Faces and Human Interactions Prof. Mohamed Daoudi IMT Nord Europe This talk will summarize various aspects of 3D human face and body motion generation. I will first present our recent results on 3D and 4D face synthesis. We propose a new model that generates transitions between different expressions, and synthesizes long and composed 4D expressions. Second, I will present results on two-person interaction synthesis, a crucial element for designing 3D human motion synthesis frameworks. It can open up a wide variety of new applications in entertainment media, interactive mixed and augmented reality, human-AI interaction, and social robotics. Invited Presentations Speaker List Anja Delić, University of Zagreb Christian Stippel, Vienna University of Technology Csongor Csanád Karikó, Eötvös Loránd University Emina Ferzana Uzunović, University of Ljubljana Ivan Sabolic, University of Zagreb Jer Pelhan, University of Ljubljana Jiri Matas, Czech Technical University in Prague Jon Muhovic, University of Ljubljana Lisa Weijler, Vienna University of Technology Marco Peer, Vienna University of Technology Marko Rus, University of Ljubljana Matic Fučka, University of Ljubljana Miroslav Purkrabek, Czech Technical University in Prague Nela Petrželková, Czech Technical University in Prague Nikolaos Efthymiadis, Czech Technical University in Prague Nikolaos-Antonios Ypsilantis, Czech Technical University in Prague Paolo Sebeto, Vienna University of Technology Peter Rot, University of Ljubljana Petr Vanc, Czech Technical University in Prague Rafael Johannes Sterzinger, Vienna University of Technology Shawn-Moses Cardozo, Valeo R&D Prague Sinisa Stekovic, Graz University of Technology Tamás Tófalvi, Eötvös Loránd University Tomáš Jel´ınek, Czech Technical University in Prague Tong Wei, Czech Technical University in Prague Václav Vávra, Czech Technical University in Prague Viktor Kocur, Comenius University Viktoria Pundy, Vienna University of Technology 6 Contributed papers 27th Computer Vision Winter Workshop Terme Olimia, Slovenia, February 14–16, 2024 Pose and Facial Expression Transfer by using StyleGAN Petr Jahoda, Jan Cech Faculty of Electrical Engineering, Czech Technical University in Prague Abstract. We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved. 1. Introduction Animating facial portraits in a realistic and controllable way has numerous applications in image Source Target Generated editing and interactive systems. For instance, a pho-Figure 1. Results of our method. Pose and expression torealistic animation of an on-screen character per-from the source image is transferred onto the identity of forming various human poses and expressions driven the target image. The method generalizes to paintings, de-by a video of another actor can enhance the user spite being trained on videos of real people. experience in games or virtual reality applications. Achieving this goal is challenging, as it requires repration of non-linear editing methods and example-resenting the face (e.g. modeling in 3D) in order to based control of the synthesis remains relatively un-control it and developing a method to map the desired explored. form of control back onto the face representation. This work presents a method that synthesizes a With the advent of generative models, it has be-new image of an individual by taking a source (driv-come increasingly easier to generate high-resolution ing) image and a target (identity) image as input, in-human faces that are virtually indistinguishable from corporating the pose and expression of the person in real images. StyleGAN2 [14] achieves the state-of-the source image into the generated output from the the-art level of image generation with high quality target image, as shown in Fig. 1. and diversity among GANs [11]. Although extensive The main idea of our method is to encode both im-research has been conducted on editing images in the ages into pose/expression and identity embeddings. latent space of StyleGANs, most studies have primar-The embeddings are then mapped into the latent ily explored linear editing approaches. StyleGAN is space of the pre-trained StyleGAN2 [14] decoder that popular for latent space manipulation using learned generates the final output. The model is trained from semantic directions, e.g. making a person smile, ag-a dataset of short video sequences each capturing a ing, change of gender or pose. However, the explo-single identity. The training is self-supervised and 8 does not require labeled data. We rely on neural the input images. These features together with the rendering in a one-shot setting without using a 3D global latent vector predict two 3D warpings. The graphics model of the human face. first warping removes the source motion from the By using pre-trained components of our model, volumetric features, and the second one imposes the we avoid the complicated training of a generative target motion. The features are processed by a 3D model. Our results confirm high flexibility of the generator network and together with the target mo-StyleGAN2 model, which produces various poses tion are input into a 2D convolutional generator that and facial expressions, and that the output can be outputs the final image. Their architecture is com-efficiently controlled by another face of a different plex and is made up of many custom modules that identity. are not easily reproducible. Our model is much sim-Our main contributions are: (1) Method for pose pler since it is composed of well-understood open-and expression transfer with close to real-time infer-source publicly available models. We rely on pre-ence. (2) A Generative model that allows the syn-trained StyleGAN2 [14] to generate the final output thesizing of random identities with controllable pose and pre-trained ReStyle image encoder [4] to project and expression. real input images into the latent space. Regarding image editing in the latent space of 2. Related Work GANs, paper [19] pointed out the arithmetic prop-Before deep learning methods, the problem of erties of the generator’s latent space. Since then, re-expression transfer was often approached using searchers have extensively studied the editing possi-parametric models. The 3D Morphable Model bilities that can be done in this domain. Specifically (3DMM) [5] was used in e.g., [26, 27]. for StyleGAN, many works have been published re-More recently deep models have become promi- garding latent space exploration [12, 23, 3, 2, 18]. nent. For instance, X2Face [33] demonstrates that InterFaceGAN [23] shows that linear semantic di-an encoder-decoder architecture with a large collec-rections can be easily found in a supervised manner. tion of video data can be trained to synthesize hu-However, the latent directions are heavily entangled, man faces conditioned by a source frame without any meaning that one learned latent direction will likely parametric representation of the face or supervision. influence other facial attributes as well. For exam-Furthermore, the paper shows that the expression can ple, given a learned latent direction of a pose change, be driven not only by the source frame but also by au-when applied, the person might change expression, dio to some degree of accuracy. Similarly, [36] em-hairstyle, or even identity. However, manipulating ploys a GAN architecture with an additional embed-real input images requires mapping them to the gending network that maps facial images with estimated erator’s latent space. facial landmarks into an embedding that controls the The process of finding a latent code that can gen-generator. This allows for conditioning the generated erate a given image is referred to as the image in-image only on facial landmarks. version problem [7, 38, 30]. There are mainly two The approach proposed in [32] enables the gen-approaches to image inversion. Either through di-eration of a talking-head video from a single input rect optimization of the latent code to produce the frame and a sequence of 3D keypoints, learned in specified image [2, 1, 21, 39] or through training an an unsupervised way, that represent the motions in encoder on a large collection of images [20, 4, 28]. the video. By utilizing this keypoint representation, Typically, direct optimization gives better results, but the method can efficiently recreate video conference encoders are much faster. In addition, the encoders calls. Moreover, the method allows for the extrac-show a smoother behavior, producing more coherent tion of 3D keypoints from a different video, enabling results on similar inputs [29]. cross-identity motion transfer. Another reason why we chose to use an encoder Recently, Megaportraits [9] have achieved an im-for the image inversion is that we require many train-pressive level of cross-reenactment quality in one ing images to be inverted and direct optimization of shot. Their method utilizes an appearance encoder, each training sample would not be computationally which encodes the source image into a 4D volumet-feasible. We chose ReStyle [4], which uses an itera-ric tensor and a global latent vector, and a motion tive encoder to refine the initial estimate of the latent encoder, which extracts motion features from both of code. This approach is a suitable fit for our purpose, 9 as it leverages smoother behavior over similar inputs through the mapping network into a latent code z from encoders as well as better reconstruction quality ∈ W+ that is then used as an input for the generator from iterative optimization. Currently, the encoders that finally produces an output image g. Formally, supported in ReStyle are pSp (pixel2style2pixel) [20] and e4e (encoder4editing) [28]. Although both en-g M E , coders embed images into the extended latent space s→t = G m(s) ⊕ Ei(t) W+, Tov et al. [28] argue that by designing an encoder that predicts codes in where symbol W+ which reside close to ⊕ denotes concatenation. ResNet-IR SE 50 has been shown to embed vari- W they can better balance the distortion-editability trade-off. However, we chose to use ReStyle with a ous entities into the latent space of StyleGAN2 such pSp encoder in our network as the baseline method as cartoons [20], hair [25] and much more. There-with the e4e encoder had trouble preserving the tar-fore, we utilize this network as encoder Em. For the get identity. encoder Ei, we use a pre-trained ReStyle with the An approach similar in spirit to ours, in the sense pSp configuration. For the mapping network M, we of using StyleGAN for expression transfer, is taken employ a single fully connected linear layer. For the by Yang et al. [35]. Nevertheless, they do not trans-generator, we use the pre-trained StyleGAN2 which fer the pose, but the expression only. Moreover, their produces high-resolution images of 1024 × 1024 px. method relies on optimization, which is much slower. 3.2. Training They report running times for a single image in minutes, while our method runs in fractions of seconds We employ self-supervised training to optimize and is thus more practical for generating videos. the parameters of the encoder Em and the mapping network M, while keeping the parameters of the gen-3. Method erator G and the encoder Ei fixed. The training is performed on an unlabeled dataset of short video Our framework takes two face images as input, a clips, each containing a single person. source (driving) face image, and a target (identity) During each iteration of the training procedure, we face image. The network produces an output image randomly sample two pairs of frames (sA, tA) and where the pose and expression from the source face (sB, tB) from two video clips of identities A and B, image are transferred onto the target identity. respectively. We then generate two images gsA→tA where the source and target frames are of identity 3.1. Architecture A and gs where the source is of identity A and the A→tB Fig. 2 depicts the proposed architecture. The net-target is of identity B. We employ the following loss work consists of a motion (pose+expression) encoder functions: Em, an identity encoder Ei, a mapping network M , and a generator network G. The encoder E Pixel-wise loss. It is Euclidean distance between the i embeds the identity of the target face image. The encoder source and generated image intensities Em embeds motion, the pose and expression of the source face image. The mapping network then mixes L2 = ∥sA − gs ∥ A→tA 2. (1) the two embeddings and projects the output into the latent space of the pre-trained StyleGAN2 genera-where sA is the source frame of identity A and tor. This approach offers the advantage of generating gs is a generated image using both inputs from A→tA high-quality images through StyleGAN while avoid-identity A. ing the intricate GAN training process. The network Perceptual loss. LPIPS (Learned Perceptual Image architecture is inspired by [25]. Patch Similarity) [37] was shown to correlate with Specifically, a source image s and a target image t human perception of image similarity. In praticular, are aligned and resized to 256 × 256 pixels and then fed into their corresponding encoders, where they LLPIPS = 1 − ⟨P (sA), P (gs )⟩, (2) A→tA are embedded in the extended latent space W+ of 18 × 512 dimensions. Embeddings zs for pose and where P is a perceptual feature extractor expression of source image s and zt for the identity of (AlexNet) [16] that outputs unit-length normal-target image t are then concatenated and transformed ized features and ⟨., .⟩ denotes dot product. 10 Motion Encoder Map StyleGAN2 Identity Encoder Pose and Expression transfer Figure 2. The architecture of the proposed model. The Motion encoder and Mapping network weights are trained, while the Identity encoder and StyleGAN2 weights stay fixed during training. Identity loss. To ensure that the generated image area of 188 × 188 pixels of the original 256 × 256 preserves the identity of the target image, we employ aligned image. The losses L2 crop and LLPIPS crop the pre-trained facial recognition model ArcFace [8]. are used exactly as their aforementioned counter-We calculate it in a similar fashion to the previous parts. The cropped losses turned out to be important. loss: Otherwise, we observed the model struggled to transfer the expression precisely, probably being disturbed LID = 1 − ⟨D(tB), D(gs )⟩, (3) A→tB by the complex texture of hair and background. where D produces unit-length normalized embed- The total loss which is used to train the network is dings of respective frames. the weighted sum of the individual losses CosFace loss. Finally, we implement the CosFace L = wL L 2 2 + wLP IP S LLP IP S + wIDLID loss [31] that we use in a similar way to Megapor- +wcosLcos + wL L (6) 2 crop 2 crop traits [9]. The purpose of the loss is to make the +wLPIPS cropLLPIPS crop. embeddings of coherent pose and expressions similar, while maintaining the embeddings of indepen-3.3. Dataset dent pose and expressions uncorrelated. For this For our goal, we need a dataset consisting of nu-loss, only motion descriptors embedded by Em, are merous unique identities and a wide range of images necessary. We calculate motion descriptors zA = with varying poses and facial expressions for each Em(sA), zB = Em(sB) of the inputs, and of the identity. To meet this requirement, it was necessary outputs fed to the encoder zA→A = Em(gs ), to use video data despite a potential trade-off in im-A→tA zA→B = Em(gs ). We then arrange them into age quality. A→tB positive pairs P that should align with each other: We decided to use the VoxCeleb2 dataset [6] P = (zA→A, zA), (zA→B, zA), and negative pairs: which was collected originally for speaker recogni-N = (zA→A, zB), (zA→B, zB). These pairs are then tion and verification. It has since been used for talk-used to calculate the following cosine distance: ing head synthesis, speech separation, and face generation. It contains over a million utterances from d(zi, zj) = a · (⟨zi, zj⟩ − b), (4) 6 112 identities, providing us with a vast array of where both a and b are hyperparameters. Finally, subjects to work with. The dataset is primarily composed of celebrity interview videos, offering a broad X exp{d(z L k, zl)} spectrum of poses and expressions to utilize. The cos = − log P . exp{d(zk, zl)} + exp{d(zi, zj)} videos are categorized by identity and trimmed into (zk,zl)∈P (zi,zj)∈N (5) shorter utterances that range from 5 to 15 seconds in duration. They have also already undergone pre-Furthermore, we used cropped versions of the L2 processing that includes cropping the frames to the loss and the LLPIPS losses. The crop is the central bounding boxes around each speaker’s face. On 11 top of that, we use the official preprocessing script and expression and zA is the latent code of the same 1 provided by StyleGAN to normalize the images to person with a different pose and expression. Scalar α 224 × 224 pixels [13]. represents the magnitude of the edit and the resulting As the number of videos per individual differs, we latent code zA1→B fed into StyleGAN generates the balanced it out by only using a maximum number of output, ideally a person B with the pose and expres-videos per person. We extracted 10 frames at half-sion of A1. In our case, we always set α to one, to second intervals from each video. Subsequently, we get the same expression and pose. eliminate images with extreme poses that would be However, this approach requires the initial pose difficult to generate with StyleGAN. The final train-and facial expression in frame A0 to match the pose ing set contains around 6k different identities, each and expression of the person in frame B. This is with around 10 images from 5 different video clips, a very strict requirement, as there will probably be resulting in a little under 300k images. The dataset no frame in a video where the pose and expression was split into disjoint training-validation-test sets 80-match perfectly. 10-10 percent, respectively. No identity appears in Instead of searching for two frames that match any of the splits simultaneously. pose and expression the best, we utilize an arithmetic 3.4. Implementation details property of the latent space. We flip each frame in a video by the vertical axis and invert them along with The model was trained for about a million steps their non-flipped counterparts. Then we calculate the with a batch size of 8. The best model checkpoint mean latent code for all the frames. This results in a was selected based on the error statistics measuring frontal pose with an average expression across the the expression transfer fidelity and identity preserva-video, typically a neutral expression. We do this tion, see Sec. 4.3. for both videos, which provides us with the same We used the ranger optimizer [34], which compose and a similar expression for the initial frames. bines the Rectified Adam algorithm and Look Ahead. We then used the aforementioned method to transfer We set the learning rate to 1 · 10−5. For our model pose and expression from one person to another. The with the best performance, we used the following hy-downside of this method is that it does not work with perparameters for the losses: wL = 0, w single images, but requires a short video of each in-2 LP IP S = 0.05, wID = 0.3, wcos = 0, wL = 2, dividual. Moreover, inverting all the frames within 2 crop wLPIPS crop = 0.3. We set parameters a = 5 and the videos is required, which is computationally de-b = 0.2 in the CosFace loss. manding. 4. Experiments We consider two versions of the baseline method. Both invert all the images with ReStyle [4], but one 4.1. Comparison of methods with the pSp encoder configuration [20] and the other Baseline method. To the best of our knowledge, with the e4e configuration [28]. we are not aware of any publicly available implementation of our problem. Therefore, we compare Variants of our method. Besides the default the proposed method with a linear StyleGAN latent model presented in Sec. 3 denoted as (Ours), we space manipulation as the baseline method. tested the other two variants. (Ours-Gen) does not Given two frames A0 and A1 (sampled from the have the StyleGAN generator fixed, but its weights same video) where the pose and expression of the are optimized during the training of the entire model. person differ, the edit vector is represented by the (Ours-Cos) is the model where the CosFace loss difference between the latent codes corresponding to is engaged during training. CosFace loss has zero the inverted frames. The pose and expression can weight and the SyleGAN generator is fixed in the de-then be imposed on a different person in image B by fault model. adding the edit vector to the latent code of image B. Formally, 4.2. Qualitative evaluation In Fig. 3 we present several examples of pose and zA − z ), (7) 1→B = zB + α · (zA1 A0 expression transfer between a variety of identities. where zB is the latent code of the target person, zA The pairs are challenging since the input frames dif-0 is the latent code of the person A with the initial pose fer in ethnicity, gender, and illumination. Another 12 Target Source Identity 1 Identity 2 Identity 3 Identity 4 Identity 5 Figure 3. Pose and expression transfer results. The top row depicts the target (identity) input images, leftmost column the source (driving) input images. The grid shows the transfer results. The identities are preserved column-wise, and the poses and expressions are preserved row-wise. challenge is the accessories that people wear such as ferred correctly. This can be seen in the second and glasses or earrings. last columns of the Fig. 4. Our best model represents The pose and expression are seen to be transferred eye movement better than other variants while also while still preserving the input identity. The model generating more realistic images. learned to transfer pose, expression, and eye movement. The network also correctly identifies that if eyeglasses are present in the identity image, they are preserved in the output image. Surprisingly, the net-Expression transfer to synthetic faces. Our work is able to model eye movement even behind method allows for transferring pose and expression glasses. However, the model is not perfect for pre-onto a randomly generated identities via StyleGAN. serving hair or background. We sample a random latent code z from the Gaus- In Fig. 4, we compare the results of the base- sian distribution, which is then mapped by StyleGAN line method with several variants of our proposed mapping network to w ∈ W. To obtain a valid iden-method. The baseline method does not use the tartity latent code for our network, we first generate an get image, but rather a frontal representation with an image using StyleGAN with w and then invert it us-average expression across the video of the identity, ing ReStyle. This is due to the fact that ReStyle en-as explained in Sec. 4.1. The figure shows that the codes the images into a specific subspace of Style-baseline methods have trouble preserving the iden-GAN’s latent space and our model is trained to opertity of the target person and several visual artifacts ate in this subspace. Feeding w directly into our map-are present. Some expressions are transferred rel-ping network M often results in certain artifacts. In atively faithfully. However, it can happen that the this way, we can efficiently generate images of ran-average expression in one video is not the same as dom identities with a specific pose and expressions in the other, and then the expressions are not trans-given an example. 13 Source Target Base pSp Base e4e Ours Gen Ours Cos Ours Figure 4. Pose and expression transfer comparison. The top two rows represent the input: source and target images. The next row shows the results. The baseline methods, pSp and e4e inversion. The three variants of our method, Ours-Gen with optimized generator weights, Ours-Cos with CosFace loss, and Ours as our best model. 4.3. Quantitative evaluation ated images and their corresponding source (driving) images. We evaluate the proposed method on pose and expression transfer fidelity, as well as on identity For the evaluation of identity preservation, we use preservation. We then compare the results with the the ArcFace [8]. The ID error is the cosine similarity baseline methods and other variants of our method. between the generated and the target (identity) frame The evaluation is done on the test split of the Vox-descriptors. Celeb2 dataset [6] that contains 120 different identi-To the best of our knowledge, there is no straight-ties. Our evaluation focuses on a cross-reenactment forward method for measuring expression transfer fi-scenario, i.e., the source and target images are from delity. In theory, the expression independent of iden-different identities. In particular, for each video in tity and pose should be described by activation of Fa-the test set, every frame is one by one taken as the cial Action Units (FAU) [10]. However, using a resource (driving) image, and a single random frame cent state-of-the-art FAU extractor [17] did not yield of another video is taken as the target image (of a meaningful results in our data. The reason is proba-different identity) and fed into the model to generate bly that only strong activations are detected and sub-output videos. tle expression changes are not captured at all. ThereFor pose transfer evaluation, we use a pre-trained fore, we opted to utilize Facial Landmarks (FL). To CNN estimator [22]. The network predicts yaw, detect Facial Landmarks we utilize the Dlib library pitch, and roll; however, we consider only yaw and [15] which predicts 68 landmarks on a human face. pitch since all the pre-processed and generated im-We first calculate the aspect ratios of certain facial ages have the same roll. The pose error is the mean features following [24]. Specifically, we calculate absolute error of yaw and pitch between the gener-the aspect ratios of both eyes, the mouth, and mea-14 Method Pose(MAE)↓ FL(CORR)↑ ID(CSIM)↑ Computational demands. The speed of infer- Base pSp 8.491 0.656 0.671 ence is very important in practical applications. Base e4e 8.720 0.621 0.563 Our method needs to invert the identity image via Ours Gen 8.325 0.556 0.760 ReStyle, which takes approximately half a second on Ours Cos 7.968 0.528 0.762 Ours 7.673 0.620 0.801 a modern GPU. Then it can generate up to 20 images per second with that identity, given all the images Table 1. Quantitative comparison of the baseline method are already aligned. On the other hand, the baseline and variants of our method. Pose error, expression fidelity method requires the inversion of all the images from (measured by facial landmarks), and identity preservation the source video and target video but then can gen-are evaluated. Symbol ↑ indicates that larger is better and erate up to 50 images per second. Given two short ↓ that smaller is better. 5-sec videos with 24 frames per second, which are typical for the VoxCeleb2 dataset, our method gener-sure the movement of the eyebrows by calculating ates the entire video in less than 6 secs, whereas the the aspect ratio between the eyebrows and the eyes. baseline method would require a little over 2 mins. Instead of measuring expression fidelity between single images, we calculate cross-correlation of aspect 5. Conclusions ratios between (source and generated) videos, to be insensitive to individual facial proportions. In par-We presented a method for transferring the pose ticular, each aspect ratio in the source and generated and expression of a source face image to a target videos is calculated for all the frames of the videos. face image while preserving the identity of the tar-This gives us two signals of the same length that get face. The proposed method is self-supervised and are cross-correlated. Finally, the cross-correlations does not require labeled data. We reviewed the exist-of all aspect ratios are averaged, giving us the final ing methods and proposed a new one that is based on FL statistic. the StyleGAN generator. We extensively evaluated This is a proxy statistic, since it does not capture our method on pose and expression transfer fidelity eyeball movements and does not measure well asym-as well as on identity preservation. We compare our metric facial expressions, but seems to correlate with method to the baseline that utilizes the arithmetic subjective quality of facial expression transfer. property of StyleGANs latent space. We showed Tab. 1 shows the quantitative comparison of the that our model transfers pose, expression, and even baseline method and variants of our method on the eye movement under challenging conditions such as VoxCeleb2 test set. The baseline methods strug-different ethnicity, gender, pose, or illumination be-gle to preserve the identity of the generated person tween the source and target images. Our method can and generate a correct pose, while they are good be used to generate images of random identities with or comparable in expression transfer fidelity. Our controllable pose and facial expressions by coupling best model achieves ArcFace cosine similarity of our model with the StyleGAN generator. The infer-0.8, which is very good considering that the cosine sim-ence runs in close to real-time; thus, it is practically ilarity between the original and inverted images via usable to generate videos having a driving video and ReStyle with pSp configuration is a single still image of a target face. 0.83. Therefore, our method achieves identity preservation close to The limitation is that certain expressions are not the maximum possible with ReStyle encoder. transferred faithfully. For instance, problematic are fully closed eyes, which is probably due to the diffi-Our method performs worse with the CosFace loss culty of StyleGAN in producing such images. Face function (Ours Cos). While the loss function appears images with eyes completely closed were probably to improve image illumination, as reported by [9], it not often seen when StyleGAN was trained. The significantly slowed training and hindered expression remedy could be a fine-tuning of the generator on and eye movement transfer. The variant with (Ours problematic images and a certain regularization of Gen) optimized generator weights produces overall the loss function. inferior output compared to the default model, where We will make the code and the trained model pub-the generator is fixed. The generated images suffer licly available. from unpleasant artifacts while also having a less realistic color scheme. This is probably a consequence Acknowledgement The research was supported by of overfitting. project SGS23/173/OHK3/3T/13. 15 References [13] T. Karras, S. Laine, and T. Aila. Flickr-faces- hq dataset (ffhq). https://github.com/ [1] R. Abdal, Y. Qin, and P. Wonka. Image2StyleGAN: NVlabs/ffhq-dataset, 2019. 5 How to embed images into the StyleGAN latent space? In Proceedings of the IEEE/CVF Interna- [14] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehti-tional Conference on Computer Vision, pages 4432– nen, and T. Aila. Analyzing and Improving the Im-4441, 2019. 2 age Quality of StyleGAN. In Proceedings of the IEEE/CVF conference on computer vision and pat- [2] R. Abdal, Y. Qin, and P. Wonka. Image2stylegan++: tern recognition, pages 8110–8119, 2020. 1, 2 How to Edit the Embedded Images? In Proceedings [15] D. E. King. Dlib-ml: A Machine Learning of the IEEE/CVF conference on computer vision and Toolkit. The Journal of Machine Learning Research, pattern recognition, pages 8296–8305, 2020. 2 10:1755–1758, 2009. 7 [3] R. Abdal, P. Zhu, N. J. Mitra, and P. Wonka. Style- [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-flow: Attribute-conditioned exploration of stylegan-agenet classification with deep convolutional neural generated images using conditional continuous nor-networks. In F. Pereira, C. Burges, L. Bottou, and malizing flows. ACM Transactions on Graphics K. Weinberger, editors, Advances in Neural Infor- (ToG), 40(3):1–21, 2021. 2 mation Processing Systems, volume 25. Curran As- [4] Y. Alaluf, O. Patashnik, and D. Cohen-Or. Restyle: sociates, Inc., 2012. 3 A Residual-based StyleGAN Encoder via Iterative [17] C. Luo, S. Song, W. Xie, L. Shen, and H. Gunes. Refinement. In Proceedings of the IEEE/CVF In- Learning Multi-dimensional Edge Feature-based au ternational Conference on Computer Vision, pages Relation Graph for Facial Action Unit Recognition. 6711–6720, 2021. 2, 5 arXiv preprint arXiv:2205.01782, 2022. 7 [5] V. Blanz and T. Vetter. A morphable model for the [18] N. Petrželková. Face image editing in latent space synthesis of 3d faces. In Proceedings of the 26th of generative adversarial networks, Prague, 2021. annual conference on Computer graphics and inter-Bachelor thesis. CTU in Prague, Faculty of Electri-active techniques, pages 187–194, 1999. 2 cal Engineering, Department of Cybernetics. 2 [6] J. S. Chung, A. Nagrani, and A. Zisserman. Vox- [19] A. Radford, L. Metz, and S. Chintala. Unsu- Celeb2: Deep Speaker Recognition. In Proc. Inter-pervised representation learning with deep convo-speech 2018, pages 1086–1090, 2018. 4, 7 lutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 2 [7] A. Creswell and A. A. Bharath. Inverting the Generator of a Generative Adversarial Network. IEEE [20] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, transactions on neural networks and learning sys-Y. Azar, S. Shapiro, and D. Cohen-Or. Encoding tems, 30(7):1967–1974, 2018. 2 in Style: a StyleGAN Encoder for Image-to-Image Translation. In Proceedings of the IEEE/CVF con- [8] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arc-ference on computer vision and pattern recognition, face: Additive Angular Margin Loss for Deep Face pages 2287–2296, 2021. 2, 3, 5 Recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, [21] D. Roich, R. Mokady, A. H. Bermano, and pages 4690–4699, 2019. 4, 7 D. Cohen-Or. Pivotal Tuning for Latent-based Editing of Real Images. ACM Transactions on Graphics [9] N. Drobyshev, J. Chelishev, T. Khakhulin, (TOG), 42(1):1–13, 2022. 2 A. Ivakhnenko, V. Lempitsky, and E. Zakharov. [22] N. Ruiz, E. Chong, and J. M. Rehg. Fine-grained Megaportraits: One-shot megapixel neural head head pose estimation without keypoints. In Proceed-avatars. arXiv preprint arXiv:2207.07621, 2022. 2, ings of the IEEE conference on computer vision and 4, 8 pattern recognition workshops, pages 2074–2083, [10] P. Ekman and W. V. Friesen. Facial action coding 2018. 7 system. Environmental Psychology & Nonverbal Be- [23] Y. Shen, J. Gu, X. Tang, and B. Zhou. Interpreting havior, 1978. 7 the Latent Space of GANs for Semantic Face Edit- [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, ing. In Proceedings of the IEEE/CVF conference D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-on computer vision and pattern recognition, pages gio. Generative adversarial networks. In Advances in 9243–9252, 2020. 2 Neural Information Processing Systems, volume 27, [24] T. Soukupova and J. Cech. Eye blink detection us-pages 2672–2680. Curran Associates, Inc., 2014. 1 ing facial landmarks. In 21st computer vision winter [12] E. Härkönen, A. Hertzmann, J. Lehtinen, and workshop, Rimske Toplice, Slovenia, page 2, 2016. S. Paris. Ganspace: Discovering interpretable gan 7 controls. Advances in Neural Information Process- [25] A. Subrtova, J. Cech, and V. Franc. Hairstyle trans-ing Systems, 33:9841–9850, 2020. 2 fer between face images. 2021 16th IEEE Interna-16 tional Conference on Automatic Face and Gesture [38] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Recognition (FG 2021), pages 1–8, 2021. 3 Efros. Generative visual manipulation on the natural [26] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, image manifold. In Computer Vision–ECCV 2016: M. Stamminger, and C. Theobalt. Real-time Expres-14th European Conference, Amsterdam, The Nethersion Transfer for Facial Reenactment. ACM Trans. lands, October 11-14, 2016, Proceedings, Part V 14, Graph., 34(6):183–1, 2015. 2 pages 597–613. Springer, 2016. 2 [27] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, [39] P. Zhu, R. Abdal, Y. Qin, J. Femiani, and P. Wonka. and M. Nießner. Face2Face: Real-time Face Capture Improved StyleGAN Embedding: Where are the and Reenactment of RGB Videos. In Proceedings of Good Latents? arXiv preprint arXiv:2012.09036, the IEEE conference on computer vision and pattern 2020. 2 recognition, pages 2387–2395, 2016. 2 [28] O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, and D. Cohen-Or. Designing an Encoder for StyleGAN Image Manipulation. ACM Transactions on Graph- ics (TOG), 40(4):1–14, 2021. 2, 3, 5 [29] R. Tzaban, R. Mokady, R. Gal, A. Bermano, and D. Cohen-Or. Stitch it in Time: GAN-based Facial Editing of Real Videos. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2 [30] A. Šubrtová, D. Futschik, J. Čech, M. Lukáč, E. Shechtman, and D. S´ykora. ChunkyGAN: Real image inversion via segments. In Proceedings of European Conference on Computer Vision, pages 189– 204, 2022. 2 [31] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large Margin Cosine Loss For Deep Face Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018. 4 [32] T.-C. Wang, A. Mallya, and M.-Y. Liu. One-shot Free-view Neural Talking-head Synthesis for Video Conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021. 2 [33] O. Wiles, A. Koepke, and A. Zisserman. X2face: A Network For Controlling Face Generation Using Images, Audio, and Pose Codes. In Proceedings of the European conference on computer vision (ECCV), pages 670–686, 2018. 2 [34] L. Wright and N. Demeure. Ranger21: a synergistic deep learning optimizer. CoRR, abs/2106.13731, 2021. 5 [35] C. Yang and S.-N. Lim. Unconstrained facial expression transfer using style-based generator. arXiv preprint arXiv:1912.06253, 2019. 3 [36] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9459–9468, 2019. 2 [37] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 3 17 27th Computer Vision Winter Workshop Terme Olimia, Slovenia, February 14–16, 2024 Dense Matchers for Dense Tracking Tomáš Jel´ınek, Jonáš Šer´ych, Jiř´ı Matas CMP Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague {tomas.jelinek,serycjon,matas}@fel.cvut.cz Abstract. Optical flow is a useful input for various computes optical flow not only for consecutive applications, including 3D reconstruction, pose es-frames but also for pairs of more temporally distant timation, tracking, and structure-from-motion. De-frames, including flow computation between the ref-spite its utility, the problem of dense long-term trackerence and every other frame of the video. At eving, especially over wide baselines, has not been ery frame, optical flow is computed w.r.t. the pre-extensively explored. This paper extends the convious, first, and a constant number of logarithmically cept of combining multiple optical flows over log-spaced frames. Such approach is linear in the number arithmically spaced intervals as proposed by MFT. of frames and thus not computationally prohibitive. We demonstrate the compatibility of MFT with two In the original MFT[38], all optic flow computa-dense matchers, DKM and RoMa. Their incorpora- tions are based on RAFT [50], which has performed tion into the MFT framework optical flow networks well in both standard benchmarks [2, 36] and in ap-yields results that surpass their individual perfor-plications. However, the RAFT optical flow network mance. Moreover, we present simple yet effective en-was trained on pairs of consecutive frames, which is sembling strategies that prove to be competitive with likely sub-optimal for large baselines. more sophisticated, non-causal methods in terms of Recently, dense matchers such as DKM [14] and position prediction accuracy, highlighting the poten-RoMa [15] have been published. This development tial of MFT in long-term tracking applications. opens the possibility to apply the MFT framework with different dense matchers, or to use RAFT for 1. Introduction pairs of frames with short temporal, and thus probably spatial, baseline. The only requirement of the Obtaining point-to-point correspondences is a MFT “meta optic flow algorithm” is that the basis classical task in computer vision, useful for a wide dense two view optic flow or matcher provides con-range of applications including tracking, structure-fidence in its predictions. from-motion, and localization. Despite the extensive In this paper, we evaluate the MFT approach with research in wide baseline stereo methods, including the DKM and RoMa matchers instead of RAFT. We those with a time baseline, the domain of dense point show that both of these matchers provide accurate correspondences in videos has not been explored un-matches, but inaccurate occlusion predictions. Ad-til recently [38, 53]. The emergence of the TAP-Vid dressing the strengths and weaknesses of optical-dataset [11] has further fueled interest in long-term flow-based and dense-matching-based methods, we point-tracking methods. propose a combined tracker, that outperforms the Point-trackers usually [12, 27, 45, 11] track sparse original MFT design. sets of points. However, dense correspondences are In summary, our contributions are: (1) We show useful in various applications, such as video editing, how to adapt dense matchers DKM and RoMa for object tracking, and 3D reconstruction. While optical use in the MFT framework, and experimentally eval-flow techniques provide dense correspondences, they uate their performance. (2) We show that the MFT are typically limited to pairs of consecutive frames. algorithm outperforms both direct flow between the Long-term dense tracking has been recently ad-first and the current frame, and the chaining of opti-dressed by Neoral et al. [38] MFT tracker, which cal flows computed on consecutive frames for RAFT, 18 DKM, and RoMa. (3) Based on better results of limitations of concatenation-based approaches for RoMa over DKM in our experiments, we propose a long-term dense point tracking. These algorithms dense long-term tracker that combines the strengths create extended dense point tracks by merging op-of RAFT-based MFT and RoMa-based MFT. tical flow estimates across variable time steps, effectively managing temporarily occluded points by 2. Related Work bypassing them until their re-emergence. However, Tracking, 3D Reconstruction, and SLAM Object their dependence on the brightness constancy as-tracking algorithms [1, 26, 10] traditionally outputted sumption renders them less effective over distant the track of an object specified in the first frame in frames. Subsequent works in multi-step-flow, such the form of bounding boxes. Later, the focus shifted as the multi-step integration and statistical selection towards segmentation-based tracking [29, 41, 34]. (MISS) approach by Conze et al. [4, 5], further re-Modern model-free trackers based on differen- fine this process. This approach relies on generating tiable rendering [54, 43], that can simultaneously a multitude of candidate motion paths from random track and reconstruct any object specified in the first reference frames, with the best path selected through frame are naturally able to provide point-to-point a global spatial smoothness optimization process. correspondences for the tracked object; however, to However, this strategy makes these methods compu-the best of our knowledge, they can track a sintationally demanding. Although certain optical flow gle object only or require multi-camera input [33]. techniques [24, 39, 22, 59, 31, 58] address occlusions Additionally, recent methods [56, 57, 55], involv-and flow uncertainty, most leading optical flow mething differentiable rendering of neural radiance fields ods, influenced by standard benchmarks like those in (NeRFs) [37], show potential in creating deformable Butler et al. [3] and Menze et al. [35], do not detect 3D models for point tracking. Nonetheless, the occlusions. Jiang et al. [25], building on RAFT [50], extensive computational demands of these methods has taken a different approach in which they handle limit their practical applicability in real-world sce-occlusion implicitly by computing hidden motions of narios. the occluded objects. However, the method still falls The traditional SLAM methods [46] produced short in the context of tracking dynamic, complex sparse point clouds. Later on, semi-dense [16, motions. 51] SLAM methods appeared. Some SLAM-based We now describe in greater depth three methods trackers, [17] can densely estimate point positions that are most relevant to our paper: RAFT [38], in static scenes, and recent advances in differen-DKM [14], and RoMa [15]. While the latter two are tiable rendering opened the avenue for differentiable-in fact dense matchers, we will use the term inter-rendering-based monocular SLAMs [42] but their changeably with long-ranged optical flow estimation application remains constrained to static scenes. with occlusion prediction. Optical Flow estimation is a classical problem in MFT extends optical flow into dense long-term tra-computer vision, with the early works [32, 20] re-jectories by constructing multiple chains of optical lying on the brightness-constancy assumption. With flows and selecting the most reliable one [38]. The the advent of deep neural networks, the focus shifted flow chains consist of optical flow computed both be-towards learning-based approaches [13, 49, 23, 50, tween consecutive frames, and between more distant 21] trained on synthetic data. frames, which allows for re-detecting points after oc-Optical flow estimation in state-of-the-art clusions. The intervals between distant frames are methods, exemplified by RAFT [50] and Flow-chosen to be logarithmically spaced. Former [21], is achieved through the analysis of a MFT extends the RAFT optical flow method with 4D correlation cost volume, considering features of two heads, estimating occlusion and uncertainty for all pixel-pairs. These techniques excel in densely each flow vector. Like the optical flow, the uncer-estimating flow between consecutive frames, yet tainty and the occlusion are accumulated over each they encounter challenges in accurately determining chain, and the non-occluded flow chain with the least flow across distant frames, particularly in scenar-overall uncertainty is selected as the most reliable ios with large displacements or significant object candidate. The long-term tracks of different points deformation. thus chain possibly different sequences of optical Multi-step-flow algorithms [7, 6, 8] address the flows. This strategy on one hand takes into account 19 that changes in appearance and viewpoint gradually ness within fixed-sized temporal windows, making accumulate over time, which makes it more reli-it unable to re-detect the target after longer occlu-able to chain flows on easier-to-match frames rather sions. TAPIR [12] combines TAP-Net’s track ini-than estimating matches directly between the tem-tialization with PIPs’ refinement while removing the plate and the current frame. On the other hand, short PIPs’ temporal chunking, using a time-wise convo-chains containing longer jumps with low uncertainty lution instead. CoTracker [27] models the temporal result in less error accumulation. correlation of different points via a sliding-window transformer, modeling multiple tracks’ interactions. DKM proposed by Edstedt et al. [14], a dense point-While these methods are designed for sparse track-matching method, employing a ResNet [19]-based ing, they can provide dense tracks by querying all encoder pre-trained on ImageNet-1K [44] for gen-points in the first frame. erating both fine and coarse features. The coarse Notably, differentiable rendering has been lever-features undergo sparse global matching, modeled aged in recent approaches, with OmniMotion repreas Gaussian process regression, to determine embed-senting 3D points’ motion implicitly using learned ded target coordinates and certainty estimates. Fine bijections [53] enabling it to provide dense tracks. features are refined using CNN refiners, following a Alternative methods like [33] which models the methodology similar to Truong et al. [52] and Shen et scene as temporally-parametrized Gaussians[28]. al. [47]. DKM’s match certainty estimation relies on However, these methods have their limitations, such depth consistency, necessitating 3D supervision. The as OmniMotion’s quadratic complexity and the process concludes by filtering matches below a cer-multi-camera requirement of [33]. tainty threshold of 0.05 weighted sampling for match selection. Edstedt et al. [14] released outdoor and in-3. Method door models trained on MegaDepth [30]) and ScanNet [9] respectively. For a stream {I1, ..., IN} of N video frames defined on a common image domain Ω, we denote RoMa similarly to DKM, RoMa [15] is a dense the optical flow between frames i and j as F(i,j). matching method that provides pixel displacement Moreover, we use σ(i, j) ∈ RΩ to denote the esti- + vectors along with their estimated certainty, building mated flow variance, and ρ(i, j) ∈ [0, 1]Ω to repre-upon the foundation set by DKM [14]. RoMa differsent the estimated certainty of F(i,j). Finally, oc-entiates itself by employing a two-pronged approach clusion score o(i, j) ∈ [0, 1]Ω denotes the estimated for feature extraction: using frozen DINOv2 [40] probability of pixels appearing in frame i being oc-for sparse features and a specialized ConvNet with cluded in frame j. To simplify notation, although a VGG19 backbone [48] for finer details. Unique F(i, j), ρ(i, j), and σ(i, j) are 2D or 3D tensors, we will to RoMa is their transformer-based match decoder, use these symbols to denote their values at a specific which matches features through a regression-by-point p = (x, y) in the image. Moreover, for every classification approach, better handling the multi-point p in frame in frame i i, its predicted position pj modal nature of coarse feature matching. In con-j relates to the optical flow F(i, j) as follows: trast to DKM, RoMa’s pipeline omits the use of dense depth maps for match certainty supervision, pj = pi + F(i, j)(pi). (1) relying instead on pixel displacements for match su-Let us denote by ϕRAFT, ϕDKM, ϕRoMa, ϕMFT the pervision. Their model is trained on datasets like functions computed by RAFT, DKM, RoMa, and MegaDepth [30] and ScanNet [9], similar to DKM. MFT respectively. By RAFT we mean the MFT’s adaptation of RAFT with additional uncertainty and Long-Term Point Tracking aiming to track a set of occlusion heads [38]. The output vectors of these physical points in a video has emerged significantly methods are as follows: since the release of TAP-Vid [11]. The dataset’s baseline method TAP-Net [11] computes a cost vol- ϕRAFT = (F(i, j), σ(i, j), ρ(i, j)) (2) ume for each frame, employing a technique akin ϕW = (F(i, j), ρ(i, j)) (3) to RAFT’s approach [50]. It focuses on tracking individual query points. PIPs [18] takes this ap- ϕMFT = (F(i, j), σ(i, j), o(i, j)), (4) proach to an extreme by completely trading off spa-where W is one of the wide-baseline methods, either tial awareness about other points for temporal aware-DKM or RoMa. 20 3.1. MFT Flow Chaining F(j−∆K, j), p1 s(j−∆K , j) MFT [38] achieves long-term optical flow esti- mation by combining multiple optical flows. These (1, j−∆ . K−1) . flows are obtained from ϕ FMFT . RAFT over logarithmi- cally spaced distances. When estimating the flow F(j−∆K−1, j), F(1, j), MFT utilizes a sequence of intermediate pj−∆K−1 F(1, j−∆1) s(j−∆K−1, j) M F T flows. This sequence, denoted as S, comprises flows F(j−∆ . 1, j), ..., F(j−∆K, j). Here, ∆i represents log- .. arithmic spacing and is defined as 2i−1 for i < K, F(j−∆1, j), with s(j−∆1, j) ∆K = j − 1. We limit the number of interme- p diate flows, denoted by j−∆1 K, to a maximum of 5 and ensure that j − ∆K−1 > 1. . Additionally, MFT employs a scoring function .. for evaluating the quality of the intermediate flows chaining for each image point P = {j − ∆ p in the reference k | 1 ≤ k ≤ K} p 1 j frame i 1. The scoring function s(j−∆k, j) utilizes M = arg maxi∈P s(i, j)(p1) chaining of estimated flow variances and occlusion scores over an intermediate frame i: F(iM , j) σ(1, i, j)(p Figure 1: Illustration of the MFT flow chaining as 1) = σ(1, i) (p M F T 1) + σ(i, j)(pi), (5) defined in Equation 7. The optical flows and scoring o(1, i, j)(p1) = max{o(1, i) MFT (p1), o(i, j)(pi)}, (6) functions are evaluated on the points in the outbound The point p is computed using and the rela- nodes of their respective arcs. i F(1, i) M F T tion in Equation 1. The scoring function is then defined as s(j−∆k, j)(p1) = −σ(1,j−∆k,j)(p1). If the chained occlusion score o(1, j−∆ 3.2. Integration of DKM and RoMa k , j)(p1) exceeds an occlusion threshold θo, we set s(j−∆k, j)(p1) = −∞. As we mentioned in the Introduction, we make the This score is used to select the best flow for every conjecture that training RAFT for optical flow pre-point p , that is the flow with the lowest estimated 1 diction on consecutive video frames is suboptimal variance computed on chains that do not contain oc-for wide baselines. We therefore integrate DKM and cluded points. RoMa, capable of handling wider baselines. How- MFT computes long-term flow for any point p in 1 ever, integrating these methods with MFT poses cer-the reference frame 1 iteratively via chaining as tain challenges due to their incompatible outputs. F(1, j) (p (p ), (7) In the first place, neither RoMa nor DKM provides M F T 1) = F (1, iM ) M F T 1) + F (iM , j)(piM an occlusion score o, but only an estimate of the flow where iM ∈ {j − ∆k | 1 ≤ k ≤ K} such prediction certainty ρ. We therefore artificially set that the score s(iM, j)(p1) is maximal. Again, the their occlusion scores as o = 1 − ρ. Furthermore, point p is obtained using i F(1, iM)(p M M F T 1) and Equa- although σ and ρ both represent the quality of esti-tion 1. F(iM,j) is the flow obtained from an arbitrary mated optical flow, they are not directly comparable. method that can also estimate its variance σ(iM, j) But in order to integrate them into the MFT frame-and occlusion score o(iM, j). The flow chaining is work, we need to converse between them. visualized in Figure 1. Through empirical analysis, we established a flow The estimated variance and occlusion score for certainty threshold θ frame ρ. When ρ exceeds this thresh- j are then obtained from the chain over old, we deem the optical flow reliable, assigning frame iM as σ(1, j) (p M F T 1) = σ(1, iM , j)(p1), respec- σ = 0. Conversely, when ρ is below this thresh- tively o(1, j) MFT (p1) = o(1, iM , j)(p1). A pixel observed old, σ is set to 1000, correlating higher uncertainties in frame i is considered occluded in frame j if its with increased variances in predicted flow. Addition-value o(i, j) MFT is above a threshold θo . In practice, we ally, we observed that while oMFT, oDKM, and oRoMa set different thresholds for different backbone net-fundamentally represent the same concept, their re-works as we discuss in Subsection 3.2. spective occlusion thresholds θo and θ vary. RAFT oRoMa 21 In our experiments in Section 4, we use Accuracy (OA) evaluates the accuracy of classifying the points as occluded. We measure the quality of θo = 0.02, θ = θ = 0.95. (8) RAFT oDKM oRoMa the predicted positions, using average displacement error, denoted as <δx . This metric calculates the For a visual comparison between the original MFT avg fraction of visible points with a positional error be-and the integration of RoMa into MFT, see Figure 2. low specific thresholds, averaged over thresholds of 3.3. Ensembling 1, 2, 4, 8, and 16 pixels. These accuracies for individual thresholds are denoted as < i with i representing We observed that, in terms of occlusion predic-the threshold. Additionally, the Average Jaccard (AJ) tion, MFT’s modification of RAFT achieves higher as defined in [11] index is used to collectively assess accuracy compared to RoMa. Conversely, RoMa ex-both occlusion and position accuracy. hibits better performance in optical flow prediction relative to RAFT. Based on these findings, we de-4.1. MFT Chaining veloped an integrated approach that combines the A key aspect of our analysis involves contrasting strengths of both methods. Specifically, our method the performance of RAFT, DKM, and RoMa within utilizes occlusion data from RAFT, while RoMa is the MFT framework against direct optical flow pre-employed for position prediction, with both prodiction with the first frame serving as a reference, cesses executed in parallel within the MFT frame-and chaining of the optical flows computed on con-work. As detailed in Section 4, our most effec- secutive video frames. The results presented in Tative strategy involves employing RAFT for occlusion ble 1 clearly show that for each base method (RAFT, score prediction and RoMa for position prediction, DKM, RoMa), the MFT strategy consistently outper-provided the point is not predicted as occluded; in forms the other strategies in all metrics by a large cases of occlusion, RAFT’s predictions are preferred. margin. These results underscore the effectiveness of MFT in handling complex motion trajectories over 4. Experiments extended periods, surpassing the limitations of di-In this section, we evaluate our proposed method. rect prediction and simple chaining methods. A key Initially, we compare the MFT framework with di-observation exemplified in Figure 2 is that RoMa is rect optical flow prediction and simple optical flow substantially less prone to predict mismatches in the chaining. Subsequently, we explore RoMa’s optical background than RAFT. flow prediction performance within the MFT frame-The results in Table 1 also show that RoMa within work depending on whether it predicts the point as the MFT paradigm achieves arguably the best results occluded or non-occluded, which serves as a founda-in position prediction, while RAFT outperforms all tional finding for our most effective ensembling strat-other methods in the occlusion classification accu-egy. The final part of our experimentation serves as racy. This finding serves as a foundation for our en-a comparison of different ensembling strategies, jus-semble strategies in Subsec. 3.3. Due to the consis-tifying the design of our most effective architecture, tently better performance of RoMa over DKM in the and comparing it to other tracking methods. evaluation benchmark in all, average Jaccard, average displacement error, and occlusion accuracy we from now on focus our experiments on RoMa even if Evaluation setup Our experiments were con-DKM runs slightly faster. ducted on all 30 tracks of the TAP-Vid-DAVIS dataset [11] with a resolution of 512×512 using the 4.2. RoMa Visibility first evaluation mode. This approach aligns with the While RoMa demonstrates high accuracy in posi-methodology described in MFT [38]. It is important tion prediction, its capability in occlusion detection is to stress that in the dataset, the tracks are annotated relatively limited in comparison to RAFT. However, only sparsely with more focus on the foreground ob-the quality of occlusion prediction is vital for scoring jects rather than the static background. the optical flows as described in Subsec. 3.1, and thus for computing new flows. We hence conjecture that Evaluation metrics In assessing the performance if we only use the RoMa’s optical flow predictions of our approach, we employ three key metrics as de-that are predicted as not occluded, we can achieve fined by the TAP-Vid benchmark. The Occlusion even better tracking results. The results, as shown 22 (a) Reference frame (b) RAFT-based MFT Strategy. (c) RoMa-based MFT Strategy. (d) DKM-based MFT Strategy. (e) Direct matching between frames #0 and #140 using (f) Direct matching between frames #0 and #140 using RAFT. RoMa. (g) Combined RAFT and RoMa strategy. (h) Selective RoMa position prediction. Figure 2: Visual comparison of selected dense tracking methods: (a) reference frame #0; (b)-(h) predicted positions of points in frame #140. All blue points are invisible in frame #140; blue points in (b)-(h) thus indicate false matches. Green points are visible both in frame #0 and frame #140. Red points highlight the points on the body of the lioness. Different shades are used to identify different points. The sequence is available at https://cmp.felk.cvut.cz/˜serycjon/MFT/visuals/ugsJtsO9w1A-00.00. 24.457-00.00.29.462_HD.mp4. 23 main metrics base strategy AJ <δx OA avg < 1 < 2 < 4 < 8 < 16 direct 38.4 50.8 65.6 29.0 44.1 54.6 60.4 65.7 RAFT chain 38.7 55.0 69.5 25.2 43.8 59.4 70.4 76.3 MFT 47.4 67.1 77.7 34.0 57.3 74.3 82.8 86.9 chain 27.3 63.5 48.2 36.4 56.2 69.4 76.0 79.6 DKM direct 34.0 60.7 52.8 37.0 54.5 65.3 70.9 76.0 MFT 47.8 72.0 70.2 43.0 65.8 79.0 84.5 87.8 direct 37.7 63.7 57.6 37.5 55.9 67.8 75.5 81.5 RoMa chain 40.3 63.1 60.7 36.8 55.3 68.1 75.5 79.8 MFT 48.8 72.7 71.7 43.0 65.5 79.2 85.5 90.1 Table 1: TAP-Vid DAVIS evaluation of different optical flow combination strategies. The MFT strategy outperforms both simple chaining and direct matching for all base optical flow methods on all the metrics. predicted <δx diction accuracy, highlighting the trade-offs between avg < 1 < 2 < 4 < 8 < 16 occluded 47.4 18.7 32.7 52.0 62.6 71.1 these two aspects. visible 77.2 46.9 70.9 84.5 89.8 93.7 Combined RAFT and RoMa Strategy Our next any 72.7 43.0 65.5 79.2 85.5 90.1 strategy involved a simple combination of RAFT and RoMa: RAFT for occlusion prediction and RoMa for Table 2: TAP-Vid DAVIS evaluation of MFT- position prediction. This hybrid approach resulted RoMa separated by the occlusion prediction. Us-in enhanced performance across all metrics, outper-ing only the points predicted as not occluded leads to forming the aforementioned individual strategies. improved position accuracy on all error thresholds. Selective RoMa Position Prediction However, further refinement was achieved by integrating findings from Subsection 4.2. We found that RoMa’s in Tab. 2, indicate a marked improvement in tracking position predictions are more accurate for points it accuracy when measured only on points predicted as identifies as visible. Therefore, we devised a strategy non-occluded. where MFT-RoMa’s position predictions are used 4.3. Ensembling Strategies only if the points are marked as visible; otherwise, RAFT’s predictions are utilized. This selective stratIn the concluding part of our experimental analy-egy led to improvements in both position predic-sis, we compare various ensembling strategies within tion accuracy and occlusion accuracy. We visually the MFT framework, building on the insights from compare this strategy with other two best-performing the previous sections. The results, detailed in Ta-strategies and MFT with RAFT in Figure 3. ble 3, demonstrate the effectiveness of the ensemble strategy. Comparison with Point Trackers We observe RAFT-based MFT Strategy For comparison we that our approach closely rivals or exceeds the per-show the original MFT strategy, utilizing RAFT for formance of established sparse point tracking meth-both position and occlusion predictions. This ap-ods like CoTracker and TAPIR in the average posi-proach, while achieving the highest occlusion accu-tion accuracy while achieving worse performance in racy among all ensembling strategies tested, exhibits the occlusion prediction accuracy. It is noteworthy suboptimal performance in position precision. that our method attains these results within a strictly causal framework, contrasting with CoTracker and RoMa-based MFT Strategy Substituting RAFT TAPIR, which utilize attention-based temporal re-entirely with RoMa, we observed an improvement in finement strategies. Moreover, it is important to position prediction accuracy. However, this modifi-highlight that, unlike our approach, CoTracker and cation led to a significant decrease in occlusion pre-TAPIR are designed as sparse trackers. 24 MFT base main metrics visibility position occlusion AJ <δx OA precision recall avg (1) RAFT RAFT 47.4 67.1 77.7 78.0 91.5 (2) RoMa RoMa 48.8 72.7 71.7 74.5 85.3 (3) RoMa RAFT 50.2 72.7 77.7 78.0 91.5 (4) RAFT/RoMa RAFT 51.6 73.4 77.7 78.0 91.5 TAPIR 56.2 70.0 86.5 CoTracker 61.0 75.9 89.4 Table 3: TAP-Vid DAVIS evaluation of combinations of two trackers. We run MFT-RAFT and MFT-RoMa independently in parallel, using the two outputs for the final position and occlusion prediction. RAFT-based MFT (1) has good occlusion accuracy (OA), RoMa-based MFT (2) has good position accuracy <δx . Using avg MFT-RAFT to predict occlusion and MFT-RoMa to predict position (3) achieves better AJ. The best results (4) are achieved when the position is predicted by MFT-RoMa, but only when it predicts visible (see Tab. 2). 5. Conclusion We have showcased the benefits of employing the MFT framework over direct optical flow computation and optical flow chaining. We have also demonstrated the flexibility of the MFT paradigm which can be readily used together with different optical flow computation methods. Without complex ar- (a) RAFT-based MFT Strategy chitectural modifications and using simple ensemble strategies, we were able to demonstrate position prediction accuracy on the Tap-Vid dataset competing with that of state-of-the-art sparse trackers that utilize non-causal tracking refinement. Limitations and Future Work Our current ap- (b) RoMa-based MFT Strategy proach does not take into account the speed of the baseline optical flow networks. The main limitation is the need for two optical flow networks to operate concurrently within the ensemble strategy. Exploring co-training strategies that enable a single network to deliver similar performance could be a viable solution. A key task is to bridge the existing gap in occlusion prediction accuracy between our method and the state-of-the-art. We also put forward the need (c) Combined RAFT and RoMa Strategy for new datasets featuring dense annotations of point Figure 3: Images show the first frames of two se-tracks in both the foreground and background. lected TAP-Vid DAVIS sequences. Dots represent ground-truth tracking points, with shades of green showing the improvement in <δx achieved by the avg Acknowledgments This work was supported by Selective RoMa Position Prediction ensemble over Toyota Motor Europe and by the Grant Agency methods (a)-(c), shades of red show the converse. of the Czech Technical University in Prague, grant No.SGS23/173/OHK3/3T/13. 25 References [12] C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman. [1] D. S. Bolme, J. R. Beveridge, B. A. Draper, and TAPIR: Tracking any point with per-frame initializa-Y. M. Lui. Visual object tracking using adaptive cor-tion and temporal refinement. In Proceedings of the relation filters. In 2010 IEEE computer society con-IEEE/CVF International Conference on Computer ference on computer vision and pattern recognition, Vision (ICCV), pages 10061–10072, October 2023. pages 2544–2550. IEEE, 2010. 2 1, 3 [2] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. [13] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, A naturalistic open source movie for optical flow C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cre-evaluation. In A. Fitzgibbon et al. (Eds.), editor, Eu-mers, and T. Brox. FlowNet: Learning optical flow ropean Conf. on Computer Vision (ECCV), Part IV, with convolutional networks. In Proceedings of the LNCS 7577, pages 611–625. Springer-Verlag, Oct. IEEE international conference on computer vision, 2012. 1 pages 2758–2766, 2015. 2 [3] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. [14] J. Edstedt, I. Athanasiadis, M. Wadenbäck, and Black. A naturalistic open source movie for op-M. Felsberg. DKM: Dense kernelized feature match-tical flow evaluation. In Computer Vision–ECCV ing for geometry estimation. In Proceedings of the 2012: 12th European Conference on Computer Vi-IEEE/CVF Conference on Computer Vision and Pat- sion, Florence, Italy, October 7-13, 2012, Proceed-tern Recognition, pages 17765–17775, 2023. 1, 2, 3 ings, Part VI 12, pages 611–625. Springer, 2012. 2 [15] J. Edstedt, Q. Sun, G. Bökman, M. Wadenbäck, [4] P.-H. Conze, P. Robert, T. Crivelli, and L. Morin. and M. Felsberg. RoMa: Revisiting robust Dense long-term motion estimation via statistical losses for dense feature matching. arXiv preprint multi-step flow. In 2014 International Conference arXiv:2305.15404, 2023. 1, 2, 3 on Computer Vision Theory and Applications (VIS- [16] J. Engel, J. Sturm, and D. Cremers. Semi-dense vi-APP), volume 3, pages 545–554. IEEE, 2014. 2 sual odometry for a monocular camera. In Proceed- [5] P.-H. Conze, P. Robert, T. Crivelli, and L. Morin. ings of the IEEE international conference on com-Multi-reference combinatorial strategy towards puter vision, pages 1449–1456, 2013. 2 longer long-term dense motion estimation. Com- [17] M. Gladkova, N. Korobov, N. Demmel, A. Ošep, puter Vision and Image Understanding, 150:66–80, L. Leal-Taixé, and D. Cremers. Directtracker: 2016. 2 3d multi-object tracking using direct image align- [6] T. Crivelli, P.-H. Conze, P. Robert, M. Fradet, and ment and photometric bundle adjustment. In 2022 P. Pérez. Multi-step flow fusion: Towards accurate IEEE/RSJ International Conference on Intelligent and dense correspondences in long video shots. In Robots and Systems (IROS), pages 3777–3784. British Machine Vision Conference, 2012. 2 IEEE, 2022. 2 [7] T. Crivelli, P.-H. Conze, P. Robert, and P. Pérez. [18] A. W. Harley, Z. Fang, and K. Fragkiadaki. Particle From optical flow to dense long term correspon-video revisited: Tracking through occlusions using dences. In 2012 19th IEEE International Conference point trajectories. In Computer Vision–ECCV 2022: on Image Processing, pages 61–64. IEEE, 2012. 2 17th European Conference, Tel Aviv, Israel, October [8] T. Crivelli, M. Fradet, P.-H. Conze, P. Robert, and 23–27, 2022, Proceedings, Part XXII, pages 59–75. P. Pérez. Robust optical flow integration. IEEE Springer, 2022. 3 Transactions on Image Processing, 24(1):484–498, [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual 2014. 2 learning for image recognition. In Proceedings of [9] A. Dai, A. X. Chang, M. Savva, M. Halber, the IEEE conference on computer vision and pattern T. Funkhouser, and M. Nießner. Scannet: Richly-recognition, pages 770–778, 2016. 3 annotated 3d reconstructions of indoor scenes. In [20] B. K. Horn and B. G. Schunck. Determining optical Proc. Computer Vision and Pattern Recognition flow. Artificial Intelligence, 17(1):185–203, 1981. 2 (CVPR), IEEE, 2017. 3 [21] Z. Huang, X. Shi, C. Zhang, Q. Wang, K. C. Che- [10] M. Danelljan, G. Bhat, F. S. Khan, and M. Fels-ung, H. Qin, J. Dai, and H. Li. Flowformer: A trans-berg. Atom: Accurate tracking by overlap maxi- former architecture for optical flow. arXiv preprint mization. In Proceedings of the IEEE Conference arXiv:2203.16194, 2022. 2 on Computer Vision and Pattern Recognition, pages [22] J. Hur and S. Roth. Iterative residual refinement 4660–4669, 2019. 2 for joint optical flow and occlusion estimation. In [11] C. Doersch, A. Gupta, L. Markeeva, A. R. Conti-Proceedings of the IEEE Conference on Computer nente, L. Smaira, Y. Aytar, J. Carreira, A. Zisserman, Vision and Pattern Recognition, pages 5754–5763, and Y. Yang. TAP-Vid: A benchmark for tracking 2019. 2 any point in a video. Advances in Neural Informa- [23] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovit-tion Processing Systems, 2022. 1, 3, 5 skiy, and T. Brox. Flownet 2.0: Evolution of optical 26 flow estimation with deep networks. In 2017 IEEE [35] M. Menze and A. Geiger. Object scene flow for au-Conference on Computer Vision and Pattern Recog-tonomous vehicles. In Proceedings of the IEEE con-nition (CVPR), pages 1647–1655, 2017. 2 ference on computer vision and pattern recognition, [24] E. Ilg, T. Saikia, M. Keuper, and T. Brox. Occlu-pages 3061–3070, 2015. 2 sions, motion and depth boundaries with a generic [36] M. Menze, C. Heipke, and A. Geiger. Object scene network for disparity, optical flow or scene flow es-flow. ISPRS Journal of Photogrammetry and Remote timation. In Proceedings of the European conference Sensing (JPRS), 2018. 1 on computer vision (ECCV), pages 614–630, 2018. [37] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Bar-2 ron, R. Ramamoorthi, and R. Ng. NeRF: Represent- [25] S. Jiang, D. Campbell, Y. Lu, H. Li, and R. Hart-ing scenes as neural radiance fields for view syn-ley. Learning to estimate hidden motions with thesis. Communications of the ACM, 65(1):99–106, global motion aggregation. In Proceedings of the 2021. 2 IEEE/CVF International Conference on Computer [38] M. Neoral, J. Šer`ych, and J. Matas. MFT: Long-Vision, pages 9772–9781, 2021. 2 term tracking of every pixel. In Proceedings of [26] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-the IEEE/CVF Winter Conference on Applications learning-detection. IEEE transactions on pat- of Computer Vision, pages 6837–6847, 2024. 1, 2, tern analysis and machine intelligence, 34(7):1409– 3, 4, 5 1422, 2011. 2 [39] M. Neoral, J. Šochman, and J. Matas. Continual oc- [27] N. Karaev, I. Rocco, B. Graham, N. Neverova, clusion and optical flow estimation. In Asian Confer-A. Vedaldi, and C. Rupprecht. CoTracker: It is better ence on Computer Vision, pages 159–174. Springer, to track together. arXiv preprint arXiv:2307.07635, 2018. 2 2023. 1, 3 [40] M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, [28] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Dret-M. Szafraniec, V. Khalidov, P. Fernandez, D. Haz-takis. 3d gaussian splatting for real-time radiance iza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang, field rendering. ACM Transactions on Graphics, H. Xu, V. Sharma, S.-W. Li, W. Galuba, M. Rab- 42(4), July 2023. 3 bat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bo- [29] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, janowski. Dinov2: Learning robust visual features R. Pflugfelder, J.-K. Kämäräinen, M. Danelljan, without supervision, 2023. 3 L. Č. Zajc, A. Lukežič, O. Drbohlav, et al. The [41] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, eighth visual object tracking vot2020 challenge re-M. Gross, and A. Sorkine-Hornung. A benchmark sults. In European Conference on Computer Vision, dataset and evaluation methodology for video ob-pages 547–601. Springer, 2020. 2 ject segmentation. In Computer Vision and Pattern [30] Z. Li and N. Snavely. Megadepth: Learning single-Recognition, 2016. 2 view depth prediction from internet photos. In Com- [42] A. Rosinol, J. J. Leonard, and L. Carlone. Nerf-slam: puter Vision and Pattern Recognition (CVPR), 2018. Real-time dense monocular slam with neural radi-3 ance fields. In 2023 IEEE/RSJ International Con- [31] S. Liu, K. Luo, N. Ye, C. Wang, J. Wang, and ference on Intelligent Robots and Systems (IROS), B. Zeng. Oiflow: Occlusion-inpainting optical flow pages 3437–3444. IEEE, 2023. 2 estimation by unsupervised learning. IEEE Trans- [43] D. Rozumnyi, J. Matas, M. Pollefeys, V. Ferrari, and actions on Image Processing, 30:6420–6433, 2021. M. R. Oswald. Tracking by 3d model estimation of 2 unknown objects in videos. In Proceedings of the [32] B. D. Lucas and T. Kanade. An iterative image reg-IEEE/CVF International Conference on Computer istration technique with an application to stereo vi-Vision, pages 14086–14096, 2023. 2 sion. In Proceedings of the 7th international joint [44] O. Russakovsky, J. Deng, H. Su, J. Krause, conference on Artificial intelligence-Volume 2, pages S. Satheesh, S. Ma, Z. Huang, A. Karpathy, 674–679, 1981. 2 A. Khosla, M. Bernstein, et al. Imagenet large scale [33] J. Luiten, G. Kopanas, B. Leibe, and D. Ra-visual recognition challenge. International journal manan. Dynamic 3D gaussians: Tracking by per-of computer vision, 115(3):211–252, 2015. 3 sistent dynamic view synthesis. arXiv preprint [45] P. Sand and S. Teller. Particle video: Long-range arXiv:2308.09713, 2023. 2, 3 motion estimation using point trajectories. Interna- [34] A. Lukezic, J. Matas, and M. Kristan. D3S – a dis-tional Journal of Computer Vision, 80:72–91, 2008. criminative single shot segmentation tracker. In Pro-1 ceedings of the IEEE/CVF Conference on Computer [46] J. L. Schönberger and J.-M. Frahm. Structure-from-Vision and Pattern Recognition, pages 7133–7142, Motion Revisited. In Conference on Computer Vi-2020. 2 sion and Pattern Recognition (CVPR), 2016. 2 27 [47] X. Shen, F. Darmon, A. A. Efros, and M. Aubry. with learnable occlusion mask. In Proceedings of Ransac-flow: generic two-stage image alignment. the IEEE/CVF conference on computer vision and In Computer Vision–ECCV 2020: 16th European pattern recognition, pages 6278–6287, 2020. 2 Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 618–637. Springer, 2020. 3 [48] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 3 [49] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018. 2 [50] Z. Teed and J. Deng. RAFT: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision, pages 402–419. Springer, 2020. 1, 2, 3 [51] Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021. 2 [52] P. Truong, M. Danelljan, and R. Timofte. Glu-net: Global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6258–6268, 2020. 3 [53] Q. Wang, Y.-Y. Chang, R. Cai, Z. Li, B. Hariharan, A. Holynski, and N. Snavely. Tracking everything everywhere all at once. arXiv:2306.05422, 2023. 1, 3 [54] B. Wen, J. Tremblay, V. Blukis, S. Tyree, T. Muller, A. Evans, D. Fox, J. Kautz, and S. Birchfield. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. CVPR, 2023. 2 [55] S. Wu, T. Jakab, C. Rupprecht, and A. Vedaldi. DOVE: Learning deformable 3d objects by watch- ing videos. IJCV, 2023. 2 [56] G. Yang, D. Sun, V. Jampani, D. Vlasic, F. Cole, H. Chang, D. Ramanan, W. T. Freeman, and C. Liu. Lasr: Learning articulated shape reconstruction from a monocular video. In CVPR, 2021. 2 [57] G. Yang, M. Vo, N. Neverova, D. Ramanan, A. Vedaldi, and H. Joo. Banmo: Building animat- able 3d neural models from many casual videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2863–2873, June 2022. 2 [58] C. Zhang, C. Feng, Z. Chen, W. Hu, and M. Li. Parallel multiscale context-based edge-preserving optical flow estimation with occlusion detection. Signal Processing: Image Communication, 101:116560, 2022. 2 [59] S. Zhao, Y. Sheng, Y. Dong, E. I. Chang, Y. Xu, et al. Maskflownet: Asymmetric feature matching 28 27th Computer Vision Winter Workshop Terme Olimia, Slovenia, February 14–16, 2024 Enhancement of 3D Camera Synthetic Training Data with Noise Models Katar´ına Osvaldová1 Lukáš Gajdošech1 Viktor Kocur1 Martin Madaras1,2 1Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava 2Skeletex Research, Slovakia lukas.gajdosech@fmph.uniba.sk, viktor.kocur@fmph.uniba.sk, madaras@skeletex.xyz Abstract. The goal of this paper is to assess the im-thetic data creation may involve the intentional ad-pact of noise in 3D camera-captured data by moddition of artificial noise. Noise is, however, a com-eling the noise of the imaging process and apply-plex topic. Countless factors influence its behaviour, ing it on synthetic training data. We compiled a from the technology employed by the device, design dataset of specifically constructed scenes to obtain and quality, through environmental variables, such as a noise model. We specifically model lateral noise, ambient light and temperature, to properties of the affecting the position of captured points in the image scene. The presence of some noise can be avoided, plane, and axial noise, affecting the position along and in some cases, the noise can be modelled. the axis perpendicular to the image plane. The esti-Noise has been the topic of numerous studies [1, mated models can be used to emulate noise in syn-5, 9, 17, 18]. Most of them, however, focus on in-thetic training data. The added benefit of adding ar-vestigation of one specific device or principle. Some tificial noise is evaluated in an experiment with ren-works focus on theoretical models of noise. These dered data for object segmentation. We train a se-models serve as a guide for investigation of the noise ries of neural networks with varying levels of noise of specific devices, as the parameters of the devices in the data and measure their ability to generalize on needed for the employment of the models are usually real data. The results show that using too little or not publicly available and are subject to trade secrets. too much noise can hurt the networks’ performance Axial and lateral noise of 3D cameras were chosen indicating that obtaining a model of noise from real for a comprehensive investigation, as the theoreti-scanners is beneficial for synthetic data generation. cal models describing their behaviour rely heavily on knowledge of publicly undisclosed parameters. We have collected a dataset of several thousands scans 1. Introduction from three different devices to fit probabilistic mod-In the past, 3D cameras were rare and expensive. els of noise with respect to the distance of the imaged Nowadays, a plethora of 3D cameras of various qual-objects and angles of their surface. ity and price are commercially available. As is the We also performed an experiment with a segmen-case with any camera, range data captured by these tation neural network trained on synthetic data. We devices suffer from the presence of noise. varied the amount of noise added to the generated The intersection of machine learning and com-data. Evaluation on real scans shows that using too puter vision has emerged as a dynamic field with di-little or too much noise can hurt the network’s perfor-verse applications. Notably, the synthesis of these mance. The knowledge of noise parameters of real domains has become increasingly prominent. Ma-devices can thus be beneficial when employing syn-chine learning requires training data, manual creation thetic data for training deep neural networks. of which is not only time-consuming but also expen-2. Related Work sive. The advent of computer graphics has facilitated the cost-effective generation of synthetic data. Various approaches have been explored to en- Nevertheless, synthetic data lacks the inherent noise hance the accuracy and efficiency of 3D scanning present in real 3D camera-captured data, leading to a technologies, with a particular focus on training domain gap. To bridge this gap, the process of syn-datasets that fuel machine learning models behind 29 the processing pipelines. Understanding the artifacts 2.3. Sources of Noise and Errors in 3D Scanning inherent in scanning technologies is crucial for gen-Scanning devices in real life are prone to various erating training data that accurately reflects the real sources of noise and errors, related to the environ-world’s variance. Different 3D scanning methods, mental conditions and limitations of underlying tech-such as structured light triangulation and time-of-nology. flight measurements, introduce unique artifacts that Temporal noise in 3D scanning devices refers to can impact data quality. Some common artifacts invariations in the captured data over time, introduc-clude noise, distortions, and systematic errors. ing fluctuations or inconsistencies in the measurements. Temporal noise is often correlated between 2.1. Structured Light Scanning consecutive scans and can arise from a range of factors, including electronic instability, sensor char-Structured Light (SL) triangulation is based on the acteristics, or environmental conditions [17]. The principles of two-view geometry. One camera is re-amount of temporal noise can also be influenced placed by a light source that projects a sequence of by colour and material properties of the observed patterns onto the scene. The patterns projected get objects [9, 22, 24, 25] and the geometry of the deformed by the geometric shapes of the objects in scene [17]. the scene. A camera situated at a fixed distance from The presence of a different source of similar radia-the projector then captures the scene with the pro-tion can interfere with the device’s ability to correctly jected pattern [21]. By analysing the distortion of the calculate the distance of the objects in the scene. pattern, information about position of the objects in Such interference can be caused by ambient light [8], the scene can be determined. radiation from other active imaging devices [2, 3] or Various patterns have been proposed [20]. For even radiation emitted from the device itself when example, the Kinect v1 camera uses a fixed dot the scene contains reflective surfaces [20]. pattern [10]. Photoneo’s MotionCam-3D camera Systematic errors may also arise during 3D scan-utilises parallel structured light technology which enning. This type of errors result in consistent differ-ables the device to capture the scene depth at high ences between the scans and the actual scene geom-resolution and frame-rate at the same time [16]. etry. For SL cameras, it is mainly caused by inad-equate calibration, low resolution, and coarse value 2.2. Time-of-Flight Scanning quantisation [14]. In the case of ToF cameras, the measurement is based on mixing of different opti-Time-of-Flight (ToF) measurement technology is cal signals and approximation of their shapes. The based on the principle of calculating the distance of mentioned approximation is one of the contributions an object in the scene by measuring the time it takes to the effect referred to as wiggling [20], periodic for an emitted signal to travel to the object and back. change of the systematic error with distance. Both The distance is calculated from measurements of SL and ToF cameras may also suffer from temper-phase difference [13]. The exact type of waves em-ature drift [22, 25]. Systematic error of devices can ployed varies based on the application. RADAR and be modelled well when precise information about the LIDAR include ToF measurements [21]. The most scene is known [22]. common approach in ToF cameras is the continuous-wave intensity modulation IR LIDAR [12]. The dis-2.4. Training NNs using Synthetic Data tance is calculated from the observed phase delay of In the context of machine learning and neural net-the amplitude envelope of the reflected light [22]. work training, the fusion of synthetic data genera-The range and accuracy of ToF devices are pri- tion, domain randomization and data augmentation marily influenced by the wavelength and energy of can be leveraged as powerful tools to avoid expen-the emitted light, necessitating safety precautions, insive creation of real datasets. cluding energy capping in human environments [21]. A widely recognized tool for generating synthetic However, such devices often exhibit reduced preci-data is for example NVIDIA replicator1. Synthetic sion outdoors due to sunlight interference, as sun-data can further be enhanced by GANs [7], analyti-light has higher power compared to the emitted sig-1https://developer.nvidia.com/omniverse/ nal [13, 22]. replicator 30 while the axial noise can be observed in the individual depth values themselves. An example of axial noise is presented in Figure 1b. Multiple factors are known to influence axial noise, from geometry of the scene to properties of the material of the surfaces in the scene [17, 18, 22]. For SL cameras, according to pin-hole camera (a) (b) model and the disparity-depth model, the standard Figure 1: (a) Cropped range image of a white pa-deviation of axial noise σz increases quadratically per (blue rectangular area) positioned 1.25 m away with increasing depth and can be calculated as [17]: from the camera at a 20° angle captured by Kinect m v1. White pixels represent missing values. Lateral σz = z2σρ , (1) noise can be seen at the paper boundaries which are f b straight in the real scene. (b) Cropped range image where z refers to depth, σρ to the standard deviation of a planar wall captured by Kinect v2 at 90 cm dis-of normalised disparity values, f to the focal length, tance with notable axial noise. b to the length of the baseline, and m to the parameter of internal disparity normalisation. In this paper cal emulation of known imaging errors, artifacts and we estimate the noise levels for both axial and lateral noise [15], or domain randomization which intro-noise directly from the observed data without relying duces variability by altering key factors such as ob-on knowledge of the camera intrinsics. ject properties, lighting conditions, and camera perspectives [23]. 3.2. Custom Dataset In order to estimate the levels of lateral and axial 3. Estimating 3D Camera Noise Parameters noise in various 3D scanning devices we collected a In this section we describe the process of estimat-custom dataset. The dataset consists of scenes with ing the parameters of two types of noise occurring in a large planar surface (white rectangular cardboard) real 3D scans and their dependence on the distances under various rotations. of objects as well as the angle of the imaged surfaces. We captured the scene using three 3D cameras: In section 4 we perform an experiment showing that • Kinect v1 utilises IR SL projector combined the estimated parameters can be used to improve the with a monochrome CMOS sensor for depth performance of models trained on synthetic data. sensing, supplying range images with 640×480 3.1. Lateral and Axial Noise resolution at 30 fps. Its default depth range is 0.8 m - 4.0 m, 0.4 m - 3.0 m in near mode. We specifically investigate two types of noise: lat- • Kinect v2 employs a ToF camera for depth eral and axial. These are the two most dominant sensing. Compared to its predecessor, it has a types of noise present in real 3D scans. wider field of view and offers depth measure- ments with greater accuracy and wider depth Lateral noise Lateral noise refers to error in the range, 0.5 m - 4.5 m. The resolution of the range reported position in the camera’s xy-plane. Even images is, however, slightly smaller, 512 × 424. though lateral noise affects all measurements, it is • MotionCam-3D camera by Photoneo is based most visible at object boundaries, as illustrated in on SL range sensing. Thanks to parallel struc-Figure 1a. Existing research [18, 17] suggests the tured light technology [16], the camera is able distance of the object and its angle influences the to capture dynamic scenes. Overall, the cam-amount of lateral noise. era offers resolution up to 1680 × 1200. The MotionCam-3D can run in two different modes, Axial noise Axial noise refers to noise orthogonal the static scanner mode where the resolution to the imaging plane, parallel to the z-axis of the and scanning time are higher, and dynamic cam-camera. The lateral noise presents itself by alter-era mode where the scanning time and the out- ing the positions of depth values in the range image, put resolution are lower. 31 (a) stand (b) Kinect v1 and v2 (c) MotionCam-3D Figure 4: Normalised histograms of lateral error val-Figure 2: Physical setup for capturing surface at dif-ues. Collected from 200 images by Kinect v1 and ferent angles and distances. Kinect v2, and 100 images by MotionCam-3D. Each histogram represents a scene containing the white paper at 0° angle. The distances differ for each camera, being the shortest at which the paper was captured completely; 1m for Kinect v1, 0.75 m for Kinect v2, (a) Kinect v1 (b) Kinect v2 (c) MotionCam-3D 0.5 m for MotionCam-3D. Each histogram contains fitted normal distribution (dashed line). Figure 3: Range images captured by devices. The scene contains a white paper at 1 m distance and 30° angle captured as portrayed in Figure 2. To mitigate the effects of thermal drift the devices were warmed up by capturing range images of a blank wall in 1-minute intervals for 60 minutes prior to collecting the samples in the dataset. To investigate the influence of surface distance and angle on noise a set of range images containing a white planar paper at various positions was captured. To minimise any distortion of the paper, heavy-weight card stock was mounted on a rigid stand, displayed in Figure 2a. The stand is comprised of two plastic boards mounted to two wooden beams Figure 5: Visualisation of the relationship between attached to a wide plastic pipe with a plug. The rub-the standard deviation of lateral noise, measured in ber seal between the pipe and the plug was shaved mm, surface angle (left column), and distance (right to allow smooth rotation while preserving the posi-column). Each row contains data from a different de-tion when idle. The stand was constructed to have vice. The plots in each column and row share the x the centre of rotation in the horizontal centre of the and y axes respectively. In the plots of the left col-paper, with markings noting the rotation angle. umn, the underlying angle values are all multiples of With a mounted paper, this stand was positioned 10. A random shift of horizontal position between at various distances from the cameras and was ro-frames was added for legibility. tated for the capture of various scenes. For each such stationary scene 200 range images were captured by In order to estimate the effects of angle and dis-each camera. To minimise the impact of temporal tance of planar surfaces on noise levels we segment noise, for each set of range images capturing one the paper in the range images using manual annota-scene, an average range image was computed by av-tion in conjunction with the Canny edge detection [4] eraging captured depth values for individual pixels. and Hough transformation [6]. To ensure all the cameras captured the same scene, the entire process was repeated, as all the cameras 3.3. Lateral Noise Estimation did not reasonably fit into the same space at once. To estimate the lateral noise levels we focus on The setups are portrayed in Figure 2, while examples the paper boundary. We first estimate the position of of captured range images are displayed in Figure 3. the boundary by fitting a line using orthogonal dis-32 tance regression on the edge pixels. We perform this regression jointly for all scans with a given scene setup. We then calculate the distances of the edge pixels from the estimated boundary line. Example histograms of the distances from the line fit are shown in Figure 4. The Kolmogorov-Smirnov test rejected the normality of the distribution, probably due to the effects of quantization in pixel positions. However, we note that the error distributions closely resemble normal distributions. As seen in Figure 5, the level of lateral noise does not significantly change with surface angle. Previous research indicates hyperbolic increase of lateral noise at angles greater than 60° for Kinect v1 [18]. Our Figure 6: Visualisation of the relationship between experiments did not indicate such increase, however, standard deviation of axial noise, the surface angle thanks to large number of invalid pixels, we were not (left column), and the distance (right column). Each able to capture data for angles greater than 70°, and row contains data from a different device. The plots subsequently extract lateral noise. MotionCam-3D in each column and row share x and y axes. exhibited similar inability to capture surfaces at extreme angles. Contrastingly, Kinect v2 had no prob-much closer to the centre than the left edge. On mul-lem with 80° angle and exhibited no increase in noise tiple occasions, the right edge was captured as a per-with increasing angle. fectly vertical line in all 200 images captured for the A slight decline in the standard deviation with ris-scene, while the left edge was not. This can be ob-ing angle can be observed. We note that this decline served in Figure 5 as some values are reported with may not be caused by the change in angle directly, standard deviation of 0. From our limited data, a but as a result of presence of other noise causing a correlation of lateral noise with the pixel’s position great number of invalid pixels and thus preventing seems likely. Further experimentation would be re-lateral noise analysis. This type of noise increases by quired to fully explore this relationship. rising distance, as surfaces with progressively lower The results show that MotionCam-3D exhibits angles with the camera view are affected. Hence, sur-overall lower levels of lateral noise than both Kinect faces at greater angles are harder to measure from cameras with Kinect v2 achieving lower noise levels greater distances, leaving fewer samples resulting in of the two. lower standard deviation. Unlike in the case of the paper’s angle, the stan-3.4. Axial Noise Estimation dard deviation of the errors is not constant through-To obtain the distributions of axial noise we first out all distances, as seen in Figure 5. Notewor-performed low-pass filtering jointly on all scans of thy is the elevated standard deviation at shorter dis-scenes with the same scanner distances and angles. tances, between 50 cm and 1 m, for Kinect v2 and We then calculated the standard deviations of dif-MotionCam-3D. Kinect v1 was not able to capture ferences of depth from the values obtained by fil-the paper at such short distances at all. The standard tering. We have opted for this approach as despite deviation of errors in millimetres increases linearly using heavy stock paper, the paper surface was not with increasing distance, at different rate for each perfectly planar. We have also tested different types camera, depending on the camera’s physical param-of filtering which led to similar results. eters [18]. Note that this is equivalent to the standard Similar to lateral noise, the relationship between deviation remaining constant under varying distances angle and distance on the standard deviation of noise when measured in pixel coordinates. has been investigated. The results are visualised in By aiming to capture the scenes simultaneously Figure 6. MotionCam-3D exhibits least axial noise, with multiple cameras, the position of the paper was followed by Kinect v2 and Kinect v1 with the great-not always perfectly centred for all cameras. As a est magnitude of noise. result, for the Kinect v2, the paper’s right edge was From the right column in Figure 6, the influence 33 4.1. Real Evaluation Dataset To create our real data we manufactured five 3D models of the Stanford Armadillo. The objects were printed on J750 using the Vero family of materials. This allowed us to capture 55 real scans using 3 different variants of the MotionCam-3D. The real data contain samples from a close distance of around 70 cm, mid-range captures from around 100 cm, and longer-range shots from 150 cm. This should model the various use cases of the 3D scanning device, with Figure 7: Fitted polynomial function of degree 2 for varying amounts of noise. Apart from the Armadil-axial noise of MotionCam-3D, displayed as the sur-los, various cuboid-shaped objects were included in face, with the measured values, displayed as points the scene, some of which had a slightly reflective ma-colored by respective standard deviation of the sam-terial causing further noise. The real data was split ple. into a validation set with 20 samples, a test set of 25 captures, and 10 samples were used for training. of surface distance on the standard deviation can be 4.2. Training Data clearly seen for both SL cameras, Kinect v1 and MotionCam-3D. For Kinect v2 camera, the standard To evaluate the benefit of adding axial and lateral deviation does not change much with increasing dis-noise into synthetic data, we have rendered training tance compared to the other two cameras. The influ-data for the task of object segmentation using spe-ence of surface angle can also be seen. In the case cialized data generator [11], implemented to simu-of Kinect v1, the values of standard deviation seem late the MotionCam-3D and other Photoneo scan-to fluctuate unpredictably with changing angle. This ners. We are dealing with a simplified setting - seg-may be caused by different sources of noise such as mentation of a singular object, the Armadillo fig-systematic noise arising from the imaging process. urine.2 Due to a current limitation of the renderer, we were unable to account for the angle of the surface, thus the amount of noise is only affected by dis-3.5. Noise Models tance. This simplification should not hinder the eval-In previous subsections we have shown that stanuation, as per our analysis the surface angle does not dard deviations of both types of noise depend on greatly affect the standard deviation of the noise, but both the distance of objects to the scanners as well the amount of missing samples instead. Some sam-as the angle of the imaged surface. To model the ples also contain cuboid-shaped walls of containers, noise we fit the data shown in Figure 5 and Figure 6 which served as boundaries for the physical simula-with degree two polynomials using the ordinary least tion of placing the Armadillos into the scene. squares method. The resulting coefficients for both The dataset contains 180 synthetic samples. Ad-lateral σ ditionally, we have included 10 real samples, which L and axial noise σz are in Table 1. The resulting fit for the axial noise of MotionCam-3D is helped to avoid over fitting and permitted longer shown in Figure 7. . training. The dataset was designed to empirically evaluate the generalization of UNet-like CNN [19]. 4. Enhancement of Synthetic Training Data As different types of noise are abundant in the real samples, a network trained on clean rendered data is with Emulated Noise often unable to generalize. In this section we present an experiment that ver-4.3. Training ifies the importance of selecting an optimal level of noise when generating synthetic training data for A 4-channel input image with surface normals deep neural network training. We evaluate the effects and range image was used as an input to the U-Net of noise on a simple segmentation task. We train the shaped CNN. We have performed purely stochastic networks on synthetic data and evaluate them on real-2http://graphics.stanford.edu/data/ world scans. 3Dscanrep/ 34 Table 1: Fitted standard deviations of lateral noise (σL - in pixels) and axial noise (σz - in millimeters). Parameter θ represents the surface angle and z the distance from the camera center. Kinect v1 σL (z, θ) [px] = 0.94 + 4.51 · 10−5 · z + 6.20 · 10−4 · θ Kinect v2 σL (z, θ) [px] = 0.736 − 6.20 · 10−4 · z + 5.35 · 10−3 · θ + 2.13 · 10−7 · z2 − 1.40 · 10−6 · z · θ − 4.13 · 10−5 · θ2 MotionCam-3D σL (z, θ) [px] = 0.915 − 6.91 · 10−5 · z + 2.84 · 10−3 · θ Kinect v1 σz (z, θ) [mm] = −0.422 + 6.89 · 10−4 · z + 2.24 · 10−2 · θ + 5.99 · 10−7 · z2 − 2.70 · 10−6 · z · θ − 1.52 · 10−4 · θ2 Kinect v2 σz (z, θ) [mm] = 1.17 + 9.72 · 10−5 · z − 1.37 · 10−2 · θ − 6.35 · 10−9 · z2 + 7.86 · 10−6 · z · θ + 1.17 · 10−4 · θ2 MotionCam-3D σz (z, θ) [mm] = 0.599 − 1.43 · 10−3 · z − 8.94 · 10−3 · θ + 8.84 · 10−7 · z2 + 1.27 · 10−5 · z · θ + 2.75 · 10−5 · θ2 (a) Mn = 0 (b) Mn = 1 (c) Mn = 2 (d) Mn = 3 Figure 8: Synthetic sample from our data with varying amount of lateral noise added to range images. training with batch size = 1, Adam optimizer with results. We hypothesize that by the slight increase of 10−4 initial learning rate, and binary cross-entropy the σsynth arising from analysis, other noise types are as the loss function. The number of epochs was de-implicitly modeled, effectively making the network termined by a training callback. It observed the IoU more robust to noise uncaptured in the synthetic por-on the real validation set and picked the best model, tion of the training data. This results indicates that which was then evaluated on the test set. adding slightly more noise than estimated allows the network to be more robust while retaining good ac-4.4. Varying Noise Levels curacy. To verify the effect of various levels of emulated The results also show an interesting second peak lateral and axial noise we train multiple models each at M using a different level of added noise. To control the n = 2. By qualitative evaluation, we have verified that the network performance was better for noise we define a noise multiplicator: scans captured from larger distances. In such cases, σ large amounts of interference noise is present, arisM synth n = , (2) σest ing from the character of structured light technology where and the presence of ambient lightning. This case is σest denotes the estimated standard deviation of noise as presented in subsection 3.5 and visualized in Figure 10, where the network trained σsynth denotes the standard deviation of noise added to the on data without noise wrongly segments the rough, training data. Note that the estimated standard de-noisy surfaces. viations of noise depend on the surface angles and On the other hand, we have samples captured from distances and the type of noise (axial, lateral), but the a close distance, where the captured surfaces are ratio Mn is independent of these variables. The value smoother and objects have sharper boundaries. In of Mn = 0 indicates no noise added and Mn = 1 in-these situations, networks trained with greater noise dicates noise added according to the levels estimated levels (Mn ≥ 1.5) have trouble with the segmenta-in the previous section. The effects of various values tion of fine details and manage to only detect larger of Mn on the produced synthetic samples are shown blobs of the objects, see Figure 11. By a com-in Figure 8. bined quantitative and qualitative analysis we conclude from our experiment that the network trained 4.5. Results on data with noise Mn = 1.25 delivers the most ro-We evaluated models for varying values of Mn bust performance over various cases. Lastly, we note on the real testing data. Figure 9 shows the seg-that setting Mn = 1.75 failed to both segment fine mentation IoU metric for the evaluated models. The details in close-shots and was not as robust as net-network trained on synthetic data with slightly more works trained for more extreme noise, delivering the noise than estimated (Mn = 1.25) achieved the best weakest performance overall. 35 es 0.70 0.65 0.60 eal Captur 0.55 0.50 IoU on R 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 Mn Figure 9: Performance of neural network for object segmentation trained over data with varying amounts of noise. Multiplier Mn of the sigma from our analysis is shown on the horizontal axis with 0.25 interval. The resulting average IoU on a test set of real captures is visualized by height of the bars. For clarity, the top 6 values are depicted in green, 6 worst are in red and the remaining 5 middle results are in orange. Our data3 and code4 used for noise analysis and network training is publicly available. 5. Conclusion (a) (b) (c) In this paper we have presented an approach for Figure 10: Qualitative evaluation on a real distant modeling axial and lateral noise of real 3D scanning capture. As per our analysis, with larger distance, devices. Using our proposed methodology it is pos-more noise is present, see normals in (a). In this sit-sible to obtain a model of these types types of noise uation, network trained on clean data without noise with respect to imaged object distance and surface wrongly segments the rough surface as an object, see angles. Knowledge of the noise parameters can be (c). On the other hand, network trained on data with valuable when processing obtained 3D scans. noise (Mn = 1.25) is resistant to the noise (b). We also show that emulating noise when training a deep learning segmentation model on synthetic data is beneficial. Our experiment shows that the performance of the segmentation network on real data is best when the emulated noise is slightly stronger than estimated from the real scans. (a) (b) (c) In future, other types of noise should be modeled. Figure 11: Qualitative evaluation on a real scene cap-Furthermore, the combined range image with surface tured in close distance. Image of surface normals normals should be compared to other data representa-used as input to the networks is shown in (a). The tions. We plan to expand the evaluation of the effects masks produced by the trained network are shown of noise levels with extended data as well as decou-for (b) Mn = 1.25 (c) Mn = 2. Note the inability of pling the effects of different types of emulated noise. the latter network to segment fine details. Lastly, the interaction of added noise with other data augmentation techniques is worth investigating. Albeit limited in scope, the experiment presented in this section provides some insights into the ef-Acknowledgments: The work presented in this paper was car-fect of noise emulation during synthetic training data ried out in the framework of the TERAIS project, a Horizon-Widera-2021 program of the European Union under the Grant generation on real-world performance of the trained agreement number 101079338. Research result was obtained us-networks. Our experiment verifies the importance of ing the computational resources procured in the project National noise inclusion in synthetic training data. Addition-competence centre for high performance computing (project ally, we can observe that adding too much noise may code: 311070AKF2) funded by European Regional Development Fund, EU Structural Funds Informatization of society, Op-lead to poor models which are unable to detect fine erational Program Integrated Infrastructure. We thank Michal details in the scene structure. We thus conclude that Piovarči for his help in preparing printing trays for our 3D mod-the ability to model noise as it occurs in real 3D cam-els that were used for real dataset scanning. eras is an important aspect of synthetic training data 3https://doi.org/10.5281/zenodo.10581278 generation. 4https://doi.org/10.5281/zenodo.10581562 36 References 26-28, 2011. Proceedings, Part II 7, pages 199–208. Springer, 2011. 2 [1] A. Belhedi, A. Bartoli, S. Bourgeois, V. Gay- [13] M. Hansard, S. Lee, O. Choi, and R. P. Horaud. Bellile, K. Hamrouni, and P. Sayd. Noise modelling Time-of-flight cameras: principles, methods and ap-in time-of-flight sensors with application to depth plications. Springer Science & Business, 2012. 2 noise removal and uncertainty estimation in three-dimensional measurement. IET Computer Vision, [14] K. Khoshelham and S. O. Elberink. Accuracy and 9(6):967–977, 2015. 1 resolution of kinect depth data for indoor mapping applications. sensors, 12(2):1437–1454, 2012. 2 [2] K. Berger, K. Ruhl, Y. Schroeder, C. Bruemmer, [15] V. Kocur, V. Hegrová, M. Patočka, J. Neuman, and A. Scholz, and M. A. Magnor. Markerless motion A. Herout. Correction of afm data artifacts using a capture using multiple color-depth sensors. In VMV, cnn trained with synthetically generated data. Ultra-pages 317–324, 2011. 2 microscopy, 246:113666, 2023. 3 [3] A. Butler, S. Izadi, O. Hilliges, D. Molyneaux, [16] T. Kovacovsky, M. Maly, and J. Zizka. Methods S. Hodges, and D. Kim. Shake’n’sense: Reduc-and apparatus for superpixel modulation, U.S. Patent ing structured light interference when multiple depth US10965891B2, March 2021. 2, 3 cameras overlap. Proc. Human Factors in Comput- [17] T. Mallick, P. P. Das, and A. K. Majumdar. Char-ing Systems (ACM CHI). NY, USA., 14, 2012. 2 acterizations of noise in kinect depth images: A [4] J. Canny. A computational approach to edge detec-review. IEEE Sensors journal, 14(6):1731–1740, tion. IEEE Transactions on pattern analysis and ma-2014. 1, 2, 3 chine intelligence, PAMI-8(6):679–698, 1986. 4 [18] C. V. Nguyen, S. Izadi, and D. Lovell. Modeling [5] A. Chatterjee and V. M. Govindu. Noise in kinect sensor noise for improved 3d reconstruction structured-light stereo depth cameras: Modeling and and tracking. In 2nd International Conference on its applications. arXiv:1505.01936, 2015. 1 3D imaging, modeling, processing, visualization & [6] R. O. Duda and P. E. Hart. Use of the hough transfor-transmission, pages 524–530. IEEE, 2012. 1, 3, 5 mation to detect lines and curves in pictures. Com- [19] O. Ronneberger, P. Fischer, and T. Brox. U-net: munications of the ACM, 15(1):11–15, 1972. 4 Convolutional networks for biomedical image seg- [7] D. Duplevska, M. Ivanovs, J. Arents, and R. Kadikis. mentation. In N. Navab, J. Hornegger, W. M. Wells, Sim2real image translation to improve a synthetic and A. F. Frangi, editors, Medical Image Comput-dataset for a bin picking task. In 2022 IEEE 27th ing and Computer-Assisted Intervention – MICCAI International Conference on Emerging Technologies 2015, pages 234–241, Cham, 2015. Springer Inter-and Factory Automation (ETFA), pages 1–7, 2022. 2 national Publishing. 6 [20] H. Sarbolandi, D. Lefloch, and A. Kolb. Kinect [8] R. A. El-laithy, J. Huang, and M. Yeh. Study on the range sensing: Structured-light versus time-of-flight use of microsoft kinect for robotics applications. In kinect. Computer vision and image understanding, Proceedings of the 2012 IEEE/ION Position, Loca-139:1–20, 2015. 2 tion and Navigation Symposium, pages 1280–1288. IEEE, 2012. 2 [21] M. Sonka, V. Hlavac, and R. Boyle. Image processing, analysis, and machine vision. Cengage Learn- [9] D. Falie and V. Buzuloiu. Noise characteristics of 3d ing, 2014. 2 time-of-flight cameras. In 2007 International Sym- [22] M. Tölgyessy, M. Dekan, v. Chovanec, and P. Hu-posium on Signals, Circuits and Systems, volume 1, binsk´y. Evaluation of the azure kinect and its pages 1–4. IEEE, 2007. 1, 2 comparison to kinect v1 and kinect v2. Sensors, [10] B. Freedman, A. Shpunt, and Y. Arieli. Distance-21(2):413, 2021. 2, 3 varying illumination and imaging techniques for [23] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, depth mapping, U.S. Patent US20100290698A1, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boo-July 2013. 2 choon, and S. Birchfield. Training deep networks [11] L. Gajdošech, V. Kocur, M. Stuchl´ık, L. Hudec, and with synthetic data: Bridging the reality gap by do-M. Madaras. Towards deep learning-based 6d bin main randomization, 2018. 3 pose estimation in 3d scan. In Proceedings of the [24] M. Vogt, A. Rips, and C. Emmelmann. Compari-17th International Joint Conference on Computer son of ipad pro®’s lidar and truedepth capabilities Vision, Imaging and Computer Graphics Theory and with an industrial 3d scanning solution. Technolo-Applications - Volume 4: VISAPP, pages 545–552. gies, 9(2):25, 2021. 2 INSTICC, SciTePress, 2022. 6 [25] O. Wasenmüller and D. Stricker. Comparison of [12] M. Gschwandtner, R. Kwitt, A. Uhl, and W. Pree. kinect v1 and v2 depth images in terms of accuracy Blensor: Blender sensor simulation toolbox. In Ad-and precision. In ACCV 2016 International Work- vances in Visual Computing: 7th International Sym-shops, Taipei, Taiwan, November 20-24, 2016, Part posium, ISVC 2011, Las Vegas, NV, USA, September II 13, pages 34–45. Springer, 2017. 2 37 27th Computer Vision Winter Workshop Terme Olimia, Slovenia, February 14–16, 2024 Detecting and Correcting Perceptual Artifacts in Synthetic Face Images Adéla Šubrtová, Jan Čech Akihiro Sugimoto Faculty of Electrical Engineering, National Institute of Informatics Czech Technical University in Prague Tokyo, Japan subrtade@fel.cvut.cz Abstract. We propose a method for detecting and use the Stable Diffusion – Realistic Vision model [5], automatically correcting perceptual artifacts on syn-which is a popular text-to-image model that can gen-thetic face images. Recent generative models, such erate high-quality images from complex captions. as diffusion models, can produce photorealistic im-We observe that, although this model can generate ages. However, these models often generate visual amazing images, it often produces artifacts on the defects on the faces of people, especially at low reso-faces of people, especially at low resolutions. lutions, which impairs the quality of the images. We Unlike GANs, which have a known “truncation use a face detector and a binary classifier to iden-trick” [3] to avoid artifacts by restricting the latent tify perceptual artifacts. The classifier was trained codes to a narrow range (near the mean latent vec-on our dataset of manually annotated synthetic face tor), diffusion models do not have such a simple tech-images generated by a diffusion model, half of which nique to control the trade-off between the quality and contain perceptual artifacts. We compare our method the diversity of generated images. Therefore, we pro-with several baselines and show that it achieves su-pose to train a detector to identify perceptual artifacts perior accuracy of 93% on an independent test set. on synthetic face images, and use its output to auto-In addition, we propose a simple mechanism for au-matically correct the generated faces. See Fig. 1 for tomatically correcting the distorted faces using in-an example. Our contributions are as follows. painting. For each face with artifact response, we generate several replacement candidates by inpaint- • We trained a binary classifier to detect percep-ing and choose the best one by the lowest artifact tual artifacts on face images generated by the score. The best candidate is then back-projected into diffusion model by learning on our dataset. We to the image. Inpainting ensures a seamless connec-manually annotated a set of 1274 images where tion between the corrected face and the original im-a half of the samples contained perceptual arti-age. Our method improves the realism and quality of facts. synthetic images. • We compared our method with several base- lines, such as the size of the synthetic face, 1. Introduction the response score of the face detector, the response of the LAION Aesthetics predic- Synthetic image generation has made a giant leap tor [25], and a recent perceptual artifact detec-in recent years, thanks to the development of pow-tor PAL4VST [33], showing that our method erful generative models, such as generative adver-achieves superior accuracy in detecting arti- sarial networks (GANs) [10, 15] and diffusion mod-facts. els [24, 23]. These models generate photorealistic images that are often indistinguishable from real • We proposed a fully automatic method for fix- photographs by human observers. However, they ing distorted faces generated by the diffusion also sometimes produce visually unpleasant and dis-model, using inpainting. For each face with the tracting artifacts, including distorted faces. artifact response, we generate several replace- In this paper, we focus on detecting and correctment candidates by inpainting and choose the ing perceptual artifacts in synthetic face images. We best one by the lowest artifact score. 38 Original Correction Detail Figure 1: Detection and correction of perceptual artifacts on synthetic faces performed fully automatically by our method. Left image is the input, an original image generated by Realistic Vision model [5] with the prompt “A family enjoying a picnic in a vibrant, flower-filled meadow”. Right image shows the result of our method. Bottom images are zoomed details of distorted/corected face pairs. The rest of the paper is structured as follows. Re-given crowd sourced aesthetics score. Paper [16] lated work is reviewed in Sec. 2, the method is pre-learns the aesthetic score indirectly from user com-sented in Sec. 3, experiments are given in Sec. 4 and ments of online images. ‘Naturalness’ of an image is finally, Sec. 5 concludes the paper. learned in [4]. For a comprehensive review of these methods, we recommend surveys [8, 1]. 2. Related work A standard approach to assess the quality of a gen-Long before the availability of photo-realistic syn-erative model, is to use the Fréchet Inception Dis-thetic generators, researchers aimed to assess the tance (FID) [11]. However, it assesses both the qual-quality of images rather from a technical perspec-ity and diversity of generated images and is not detive (for sharpness, noise, compression, etc.) [26, 18]. fined for a single sample, but needs a large set of gen-Early attempts to assess the perceptual image qual-erated images. ity were made even before the boom of deep learn-The recent FreeU [27] promises a universal ing. Paper [28] classified photos taken by amateurs improvement of visual quality of diffusion mod-and professional photographers, or paper [7] learned els, without any additional training, by simply re-a simple classifier on hand crafted features using a weighting the skip connections in the denoising U-dataset from peer-rated photo website. NET. However, the quality improvement seems to be Recently, there have emerged many works on at the cost of diversity and even prompt fidelity. A image aesthetics assessment. To name a few, the different approach [2] to improve the generator qualLAION Aestitics predictor [25] learns a simple ity is to train the diffusion model by reinforcement multi-layer perceptron on CLIP embeddings [22], learning, possibly using the aesthetic reward. 39 More closely related to our work are papers that learn perceptual artifacts in synthetic images. Paper [31] detects artifacts in super-resolution GANs, paper [34] detects artifacts in inpainting. The recent work [33] learns a predictor to localize the perceptual artifacts in images produced by recent synthetic (a) (b) (c) generator models including the Stable Diffusion [24]. A face framed A group of A farmer driving The paper also proposes a mechanism similar to ours by a hooded friends gathered a tractor through to correct the artifacts. We compare with their results sweatshirt on a around a bonfire, a field of corn. chilly day. and show that our method has superior artifacts de-their faces illuminated by the tection accuracy. Our automatic correction differs in flames. the mechanism to select the best one out of several candidate replacements. Figure 2: Examples of generated images alongside Our problem is indirectly related to out-of-with their prompts. distribution (OOD) [32] detection problem, where only the in-distribution samples are available for training. Although face images form a relatively tion [9]. The training is done in a supervised manner compact domain, we observe that artifacts generated on our manually annotated synthetic dataset. by the diffusion model are so specific that the supervised classification problem is more appropriate. Synthetic dataset Realistic Vision [5] is a popular Natural drawback of this choice is that we are model text-to-image diffusion model. Each generated im-dependent and have to retrain for a new model. age requires as an input a Gaussian noise and textual Another related problem is forensic detection of prompt to guide the diffusion process. synthetic images, a.k.a. ‘deepfake’ detection [20]. It might sound easy to train synthetic vs. real face im-To make the synthesis fully automatic, we gen- age classifier and use it to spot images with artifacts. erated random prompts using ChatGPT [21]. The However, it is not true that this classifier will respond queries for ChatGPT aimed to produce textual with higher synthetic score on images with obvious prompts describing images containing (1) people perceptual artifacts. We will show this experiment with focus on whole-body shots (e.g. Fig 2b) and among our baselines. The reason is that the real vs. (2) people’s portraits (e.g. Fig 2a). We obtained 200 synthetic classifier learns low-level signal features prompts in each of the queries, 400 in total1. During (as reported e.g. by [30, 6]) and the higher-level con-the dataset synthesis, we randomly sampled a prompt tent seems to be overlooked. and an initial Gaussian noise to produce the images. We used the default negative prompt for the Realistic 3. Method Vision model as recommended by its authors. With this process, we synthesized a set of 3k im-Our aim is to develop a method to detect artifacts ages and manually separated the samples into two in synthetic images and correct them automatically. classes – with and without artifacts. The presence of This work focuses on artifacts in the facial area, artifacts is not a binary property in fact, as the bound-firstly, because human perception is very sensitive to ary appears rather fuzzy, and for certain images it is faces and secondly, because a lot of artifacts in re-very challenging and subjective to decide one of the cent generative models are concentrated in the facial two classes. Hence in our dataset, we include only area. Specifically, our data-oriented method consists the most severe and disturbing artifacts. Given the of two modules a detection module, see Sec. 3.1, and random nature of the generated prompts, some im-automatic face artifact removal module, see Sec. 3.2. ages had to be completely discarded, because they did not contain any visible face (See Fig. 2c). 3.1. Artifact detection module Subsequently, we detected faces in the collected The artifact detection module consists of an off-images using the YOLO v8 Face detector [13]. Faces the-shelf face detector [13] and a face artifact (bi-with size smaller than 50 pixels were discarded. All nary) classifier. For the architecture we choose the powerful vision transformer for image classifica-1Image dataset with the prompts will be released. 40 faces were aligned, so that the eye-keypoints line 2 Therefore, we zoom in by factor m = 2, which is was parallel with the horizontal axis. emprically found as a trade off between model re-In total, the dataset of 1274 images was randomly alism and consitency with context. Inpainting itself split, such that the training set consisted of 406 im-ensures that the connection with the original image is ages for each class, validation set of 97 for each class seamless and no additional blending is needed. and the test set of 134 faces for each class. We set the number of replacement candidates N = 10, as a trade-off between quality and computational 3.2. Automatic face artifact removal time. More candidates increase the chance of finding We propose a simple mechanism to automatically a better candidate, but the system is less responsive. and seamlessly rectify faces with artifacts in syn-This way we can effectively remove face artifacts thetic images. and thus improve the perceptual quality and realism The idea is to replace faces with detected artifacts of generated images. by generative inpainting. Inpainting is a process used in image editing where unwanted parts of an image 3.3. Implementation details are filled in seamlessly to fit the overall context. We We initialized the classifier network with weights used the same generative model to do the inpaint-pretrained on the ImageNet dataset. The network ing [5]. Since the model struggles with generating was trained for 10 epochs with AdamW [19] opti-faces at low resolution, we zoom in around the face mizer and the initial learning rate of 5e-05. During bounding box to increase the likelihood that the in-training, we employed a linear learning rate sched-painted face were artifact-free. Moreover, we always uler and augmented our dataset by mirroring each generate several inpainting candidates and decide the example and adding it to the dataset. The images best one by our classifier response. were resampled to ViT input resolution 224 × 224 Our method consists of the following steps: pixels. Following the preprocessing of the pretrained ViT, we use normalization across the RGB channels 1. In the generated image, we find a face for which with mean [0.5, 0.5, 0.5] and standard deviation [0.5, our classifier is positive for artifacts. 0.5, 0.5]. 2. Using inpainting, we generate N candidates for For inpainting in the correction module, we used replacement. Note that we zoom in, such that the same generative model and HuggingFace’s dif-the face bounding box is magnified by factor m fusers library [12] (v0.17.1) with the following and inpaint the pixels inside the original bound-settings: num inference steps=200, strength=0.45, ing box. guidance scale=15.5. 3. For each of the N replacement candidates, we 4. Experiments measure the response of our artifact classifier and choose the winner as one with the lowest To evaluate our method, we conducted number of score. See Fig. 3 for an example of replacement experiments. Firstly, we report quantitative evalua-candidates sorted from highest to lowest artifact tion, comparing our classifier to other methods for score. artifact detection. To the best of our knowledge, there exists only one paper contributing directly on 4. The winning candidate is finally subsampled by this topic, that is PAL4VST [33]. For that reason, 1/m to the original scale and projected back we propose several additional baselines to compare into the original image. our model with. Secondly, we present the qualitative evaluation of the baselines by ranking the test We cannot enlarge the face to the maximum pos-set according to responses of each classifier. Finally, sible size, because inpainting requires some context. we show results of the entire detection and automatic If the context is insufficient, i.e., the area around the correction pipeline. face region is too small and uninformative, the resulting inpainting does not match the original image 4.1. Baselines (in terms of content, geometry, and lighting/shading). Face-size based classifier. We observe high corre-2Facial keypoints are also returned by the YOLO Face detec-lation between face size and the severity of artifacts. tor. The size was determined from face detections found 41 ... Original more artifacts fewer artifacts Figure 3: Replacement candidate ranking. To find a replacement for the original face image with artifacts (left), we generate multiple candidates using inpainting and sort them based on the response of our artifact classifier. Subsequently, we select the one with the best response as a replacement for the original face. by the YOLO v8 face detector [13]. For non-square Model Acc AUC bounding boxes, we took the longer side. The clas-Face-size 0.8731 0.9213 sification threshold that maximizes classification ac-Laion Aesthetics [25] 0.8134 0.9420 curacy was determined on the validation set. Face detector score 0.5896 0.6475 PAL4VST [33] 0.7164 0.7981 Laion Aesthetics predictor. The Laion Aesthet- (face crops) ics predictor [25] was trained to predict an aesthet-Synth/Real 0.7761 0.8651 ics score in range [0, 10] based on the visual appear- (last layer finetuned) ance of an image, 10 being the best. The threshold Ours 0.9254 0.9678 was again found to maximize the validation accuracy. Laion Aesthetics [25] 0.5633 0.5805 The model was trained on whole images, thus we (whole images) asses this baseline in two modes, one with whole im-PAL4VST [33] ages as inputs and second mode with the face crops. 0.6531 0.7766 (whole images) Table 1: Quantitative results. Classification Accu-Face-detection-score based classifier. The YOLO racy (Acc) and Area under the precision-recall curve v8 face detector [13] is our next choice for a baseline; (AUC) calculated on our test set. specifically, the confidence score for each bounding box. Yet again, we determine the classification threshold on the validation set. synthetic images. The classifier was trained in a supervised manner with 10k images in each class. The synthetic class was generated as described in Sec. 3.1 Perceptual artifact localisation (PAL4VST). with the recommended negative prompt. As the real Zhang et al. [33] train a segmentation transformer class, we used randomly selected subset of images of for artifact localization in synthetic images generated the FFHQ dataset [14] and cropped the faces in the by multiple generative models (including the Stable same way as in the synthetic class. Diffusion [24]). The output is a segmentation mask We trained ViT, started from ImageNET model, where active pixels mark the areas with artifacts. but trained only the last layer and kept other weights Since the method expects whole images, we test frozen. This model achieved 99% accuracy in dis-again two scenarios, face crops and whole images. criminating synthetic vs real images. We observed, To compare this method to our facial artifact detec-that artifact detection accuracy was higher than when tion, we inferred the classification labels as follows. training the entire model. We hypothesize that the We consider the prediction as “with artifacts” if at latter option learns the low-level signal features, as least one pixel in the output mask was active for the reported by [30, 6], and not the image content. face crop or inside the face bounding box in case of The threshold for artifact detection was again set whole images. Otherwise, the predicted label was on the validation set. “no artifacts”. 4.2. Quantitative results Synthetic vs Real classification baseline. As next The comparison between our artifact detector and baseline, we consider a classifier between real and the baselines is presented in Table 1. Namely, we 42 Rank Rank size size Face aceF A. A. Laion Laion fnt fnt Real/Synth Real/Synth AL4VSTP PAL4VST OLO det. OLO Y det. Y Ours Ours Figure 4: Image ranking – worst first. Each row de-Figure 5: Image ranking – best first. Each row depicts five worst images from the test set. Ranking is picts five best images from the test set. Ranking is based on the response of each classifier. based on the response of each classifier. report classification accuracy (Acc) and the area un-images, despite it is a recent method trained on a der the precision-recall curve (AUC). Our method much larger dataset of generated images including achieves superior results for both metrics. Stable Diffusion. As expected, the simple face-size based classifier Synth/Real classifier is another rather weak base-is a strong baseline. It confirms the artifacts are line. It is a proxy problem that does not solve the most common in faces in low resolution, but might target artifact detection task very well. be present in higher resolution, too. Laion Aesthetics predictor in the whole image set-4.3. Ranking experiment ting is weaker. Likely, the mismatch between detecting artifacts and predicting aesthetic quality is signif-To qualitatively compare all the models, we con-icant. Ranking in Fig. 5 suggests that the most aes-duct ranking experiment on held-out test set. Each thetics of an image reside in colorfulness and not in test image is evaluated using each model and ranked structural correctness. We also observe that the ver-by its response; in the case of PAL4VST, we rank by sion with cropped faces is significantly more accu-the size of the region with artifacts, i.e., the number rate, probably because the artifacts are more promi-of active pixels in the segmentation mask. Images nent. with the most severe artifacts are depicted in Fig. 4, Face detector score is a surprisingly weak base-the cleanest or the most photo-realistic are in Fig. 5. line. We hypothesize that unlike classical scan-We can see that different baselines returned differ-nig Viola-Jones [29] detector, YOLO [13] decides ent ranking, which indicates each model focus on dif-on a larger context (i.e., a human body), the dis-ferent features. Laion Aesthetics predictor returned torted faces do not impact the score much. Faces rather visually pleasant (colorful) images as the best. with severe artifacts were confidently detected on our PAL4VST returned very distorted images as the best dataset. ones, YOLO detect response returns several good im-PAL4VST [33] does not perform very well to de- ages among the worst ones. The ranking confirms tect face artifacts either on the face crops or whole quantitative results in Tab. 1. 43 Artifact Correction Artifact Correction detail detail detail detail Original Original Correction Figure 7: An example of automatic correction of face artifacts in a synthetic image. The original image contains unnatural facial features. The newly-generated faces look much more realistic. 4.4. Results of the entire pipeline Correction Finally, we show results of the entire pipeline (detection and correction) on several images. See Figs. 1, 6, 7, 8 for examples. Our detector finds distorted faces and correctly selects a good replacement candidate. The result is a seamless correction of faces with artifacts. 5. Conclusion In this work we propose an artifact classifier for synthetic face images trained on our manually anno-Detail tated dataset. We provide comparison with several baselines such as face-size based classifier, LAION Aesthetics predictor or the recent perceptual artifact detector [33], showing that our method achieves su-Figure 6: Example of the application of our method. perior classification metrics in face artifact detection. Original image (top) contains severe artifacts in the Furthermore, we demonstrate that our method is facial area. Artifacts are discovered using our pre-applicable in automatic correction of the facial arti-trained classifier and multiple candidates for replace-facts caused by recent diffusion models. Specifically, ment are generated using inpainting. The candidates we generate multiple replacement candidates of the are again evaluated by our classifier and ranked ac-face with artifacts using standard inpainting. Sub-cording to its response. The one with the best score sequently, we evaluate the new face candidates with is selected as the replacement. The corrected images our classifier and, in the end, we select the candidate are shown in the middle row, while the details of the with the lowest artifact score as the replacement. faces are depicted in the bottom row. Limitations and future work. One of the weak- nesses of our method is the fact that during the automatic artifact correction, we use quite an ambigu-44 ous prompt “face” to regenerate the image. Due to this fact, we do not have any guarantee that the corrected face will be of the same age or gender, we only rely on the context. In minor cases, semantically incompatible faces are found. That might be avoided by keeping the original prompt if available or estimate the prompt with off-the-shelf image captioning Original model such as BLIP [17]. Acknowledgements This work was supported by the NII international internship program and by the CTU Study Grant SGS23/173/OHK3/3T/13. References [1] A. Anwar, S. Kanwal, M. Tahir, M. Saqib, M. Uzair, M. K. I. Rahmani, and H. Ullah. A survey on image aesthetic assessment. arXiv preprint arXiv:2103.11616, 2021. 2 [2] K. Black, M. Janner, Y. Du, I. Kostrikov, and Correction S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023. 2 [3] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. 1 [4] Z. Chen, W. Sun, H. Wu, Z. Zhang, J. Jia, X. Min, G. Zhai, and W. Zhang. Exploring the natu- ralness of ai-generated images. arXiv preprint arXiv:2312.05476, 2023. 2 [5] CivitAI. Realistic vision, v5.1, 2023. https://civitai.com/models/4201/ realistic-vision. 1, 2, 3, 4 [6] R. Corvi, D. Cozzolino, G. Poggi, K. Nagano, and Detail L. Verdoliva. Intriguing properties of synthetic images: from generative adversarial networks to diffusion models. In Proc. CVPR, pages 973–982, 2023. 3, 5 [7] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Study-ing aesthetics in photographic images using a computational approach. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vi- sion, Graz, Austria, May 7-13, 2006, Proceedings, Part III 9, pages 288–301. Springer, 2006. 2 [8] Y. Deng, C. C. Loy, and X. Tang. Image aesthetic assessment: An experimental survey. IEEE Signal Figure 8: Example of automatically rectified face ar-Processing Magazine, 34(4):80–106, 2017. 2 tifacts, produced by our method. [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and 45 N. Houlsby. An image is worth 16x16 words: Trans-In International conference on machine learning, formers for image recognition at scale. In Inter-pages 8748–8763. PMLR, 2021. 2 national Conference on Learning Representations, [23] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, 2021. 3 and M. Chen. Hierarchical text-conditional im- [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, age generation with clip latents. arXiv preprint D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-arXiv:2204.06125, 2022. 1 gio. Generative adversarial nets. Advances in neural [24] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and information processing systems, 27, 2014. 1 B. Ommer. High-resolution image synthesis with la- [11] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, tent diffusion models. In Proc. CVPR, 2022. 1, 3, and S. Hochreiter. Gans trained by a two time-scale 5 update rule converge to a local nash equilibrium. Ad- [25] C. Schuhmann and LAION team. LAION- vances in neural information processing systems, 30, AESTHETICS, 2022. https://laion.ai/ 2017. 2 blog/laion-aesthetics/. 1, 2, 5 [12] Hugging Face. Diffusers – Inpainting. 2023. [26] H. R. Sheikh and A. C. Bovik. Image information https://huggingface.co/docs/ and visual quality. IEEE Transactions on image pro-diffusers/using-diffusers/inpaint. cessing, 15(2):430–444, 2006. 2 4 [27] C. Si, Z. Huang, Y. Jiang, and Z. Liu. FreeU: [13] A. Kanametov. Yolo v8 face detector, 2023. Free lunch in diffusion u-net. arXiv preprint https://github.com/akanametov/ arXiv:2309.11497, 2023. 2 yolov8-face. 3, 5, 6 [28] H. Tong, M. Li, H.-J. Zhang, J. He, and C. Zhang. [14] T. Karras, S. Laine, and T. Aila. A style-based gener-Classification of digital photos taken by photogra-ator architecture for generative adversarial networks. phers or home users. In Advances in Multimedia In Proc. CVPR, 2019. 5 Information Processing-PCM 2004: 5th Pacific Rim [15] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehti-Conference on Multimedia, Tokyo, Japan, November nen, and T. Aila. Analyzing and improving the im-30-December 3, 2004. Proceedings, Part I 5, pages age quality of stylegan. In Proc. CVPR, 2020. 1 198–205. Springer, 2005. 2 [16] J. Ke, K. Ye, J. Yu, Y. Wu, P. Milanfar, and F. Yang. [29] P. Viola and M. Jones. Rapid object detection using VILA: learning image aesthetics from user com-a boosted cascade of simple features. In Proceed-ments with vision-language pretraining. In Proceedings of the 2001 IEEE computer society conference ings of the IEEE/CVF Conference on Computer Vi-on computer vision and pattern recognition. CVPR sion and Pattern Recognition, pages 10041–10051, 2001, 2001. 6 2023. 2 [30] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and [17] J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrap-A. A. Efros. CNN-generated images are surprisingly ping language-image pre-training for unified vision-easy to spot...for now. In Proc. CVPR, 2020. 3, 5 language understanding and generation. In ICML, [31] L. Xie, X. Wang, X. Chen, G. Li, Y. Shan, J. Zhou, 2022. 8 and C. Dong. DeSRA: Detect and delete the artifacts [18] Q. Li and Z. Wang. Reduced-reference image qual-of gan-based real-world super-resolution models. In ity assessment using divisive normalization-based Proceedings of the 40th International Conference on image representation. IEEE journal of selected top-Machine Learning, ICML’23, 2023. 3 ics in signal processing, 3(2):202–211, 2009. 2 [32] J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out- [19] I. Loshchilov and F. Hutter. Decoupled weight de-of-distribution detection: A survey. arXiv preprint cay regularization. In International Conference on arXiv:2110.11334, 2021. 3 Learning Representations, 2019. 4 [33] L. Zhang, Z. Xu, C. Barnes, Y. Zhou, Q. Liu, [20] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. H. Zhang, S. Amirghodsi, Z. Lin, E. Shechtman, and Nguyen, T. Huynh-The, S. Nahavandi, T. T. Nguyen, J. Shi. Perceptual artifacts localization for image Q.-V. Pham, and C. M. Nguyen. Deep learning for synthesis tasks. In Proceedings of the IEEE/CVF In-deepfakes creation and detection: A survey. Com-ternational Conference on Computer Vision, pages puter Vision and Image Understanding, 223:103525, 7579–7590, 2023. 1, 3, 4, 5, 6, 7 2022. 3 [34] L. Zhang, Y. Zhou, C. Barnes, S. Amirghodsi, [21] OpenAI. Chatgpt v3.5, 2023. https://chat. Z. Lin, E. Shechtman, and J. Shi. Perceptual artifacts openai.com/chat. 3 localization for inpainting. In European Conference [22] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, on Computer Vision, pages 146–164. Springer, 2022. G. Goh, S. Agarwal, G. Sastry, A. Askell, 3 P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. 46 27th Computer Vision Winter Workshop Terme Olimia, Slovenia, February 14–16, 2024 Cross-Dataset Deepfake Detection: Evaluating the Generalization Capabilities of Modern DeepFake Detectors Marko Brodarič, Vitomir Štruc Peter Peer University of Ljubljana, University of Ljubljana, Faculty of Electrical Engineering, Faculty of Computer and Information Science, Tržaška cesta 25, 1000 Ljubljana Večna pot 113, 1000 Ljubljana marko.brodaric@fe.uni-lj.si peter.peer@fri.uni-lj.si Abstract. Due to the recent advances in generative X c deep learning, numerous techniques have been pro-e AUC p posed in the literature that allow for the creation of t FF++ i + o E so-called deepfakes, i.e., forged facial images com-n DiffFace V monly used for malicious purposes. These devel- A H L opments have triggered a need for effective deep-F U - A fake detectors, capable of identifying forged and ma-F F T D E nipulated imagery as robustly as possible. While a considerable number of detection techniques has S B Grad-CAM been proposed over the years, generalization across I a wide spectrum of deepfake-generation techniques Figure 1: We evaluate the performance of three still remains an open problem. In this paper, we study conceptually distinct deepfake detection methods a representative set of deepfake generation methods in a cross-dataset setup on the FaceForensics++ and analyze their performance in a cross-dataset set-database and investigate the reasons for the different ting with the goal of better understanding the reasons generalization capabilities using Gradient-weighted behind the observed generalization performance. To Class Activation Mappings (Grad-CAM). To facili-this end, we conduct a comprehensive analysis on tate the analysis, we also introduce a dataset of deep-the FaceForensics++ dataset and adopt Gradient- fakes, generated with a diffusion-based generator. weighted Class Activation Mappings (Grad-CAM) to provide insights into the behavior of the evaluated detectors. Since a new class of deepfake genera-synthesizing forged and/or manipulated images and tion techniques based on diffusion models recently videos. The most widespread synthesis methods are appeared in the literature, we introduce a new subset based on Generative Adversarial Networks (GANs) of the FaceForensics++ dataset with diffusion-based [6,11,21], and in recent years, solutions utilizing the deepfake and include it in our analysis. The results concept of denoising diffusion [8, 13, 40]. Human of our experiments show that most detectors over-faces have always been one of the most popular tar-fit to the specific image artifacts induced by a given gets for such synthesis and manipulation techniques, deepfake-generation model and mostly focus on lo-as this allows for the design of numerous practical cal image areas where such artifacts can be expected. applications, ranging from applications in the enter-Conversely, good generalization appears to be corre-tainment industry (e.g., movies and smartphone ap-lated with class activations that cover a broad spatial plications), security systems, privacy-enhancing so-area and hence capture different image artifacts that lutions and many more [22]. However, due to the appear in various part of the facial region. high level of realism ensured by these methods, they 1. Introduction can also be employed for malicious purposes, such as creating fake news or falsifying evidence. All of this With the advances in generative deep neural net-has prompted the development of so-called deepfake works, there has been a surge in methods capable of detectors to alleviate this threat. 47 Among the first detectors developed were tech-this end, we conduct a comprehensive cross-dataset niques that work as binary classifiers. Such discrim-evaluation of various types of detectors on deep-inative detectors are commonly trained on a dataset fakes from the FaceForensics++ dataset [25] and to perform classification between images represent-study the results quantitatively as well as qualita-ing original/pristine, unaltered images and images tively through Gradient-weighted Class Activation that have been manipulated using one of the exist-Mappings (Grad-CAM) [26]. ing deepfake generation methods [1, 3]. A limitation of the discriminatively-trained approach is that 2. Related work the errors made by the synthesis method in generatIn this section, we present a brief overview of rel-ing deepfakes are quite specific to that method. This evant works on deepfake detection. For a more com-results in poor generalization of the detector, which prehensive review of existing detectors, the reader is learns to classify a specific type of deepfake. In real-referred to some of the excellent surveys on this topic life deployment scenarios where we lack informa-available in the literature [22, 23, 36]. tion about how the forgery was created, it is crucial for the detector to perform well regardless of the Early Detectors. Early deepfake detectors primarily type of deepfake encountered. Some solutions have relied on the identification of known artifacts, intro-addressed these problems by introducing a specific duced into the forged images by the deepfake gen-pipeline before the classifier that extracts additional eration techniques. As a result, this group of detec-information from the given image, either by consid-tors used conventional (hand-crafted) descriptors and ering multiple modalities [20] or by manipulating the classifiers to detect blending signs [2, 38], deviations image [27, 39]. The latter proves to be one of the of the face from the surrounding background (e.g., more effective approaches to improving generaliza-incorrect lighting) [28], identification of face warp-tion. The idea behind these methods is that they gen-ing artifacts [19], and even methods that observe the erate so-called pseudo deepfakes and use them as an broader context of a video, such as detecting unusual extension of the training dataset, or they learn exclu-eye blinking patterns [18] or observing lip synchro-sively on them. Images can be augmented in various nization and corresponding speech [14]. Such de-ways, which determines the types of artifacts that are tectors provided promising initial results, but were injected into the training set of the detector. How-limited in their performance due to their focus on ever, even these methods can only improve gener-explicit (human-defined) image artifacts, induced by alization to a certain extent, as they are fundamen-the deepfake-generation models. tally discriminative. In this domain, approaches have Discriminative Detectors. To mitigate the depen-also been proposed that use only one class for train-dence on manual modeling of image artifacts, a more ing [12,15]. These methods learn only from samples recent group of detectors approached deepfake detec-of unaltered images, defining in a way what a normal tion from a machine learning perspective and formu-image is, and anything deviating from it is marked as lated the problem as a binary classification task. So-an anomaly—indicating a potential deepfake. These lutions from this group, commonly learn a discrim-methods are expected to be robust to different types inative model, e.g., a convolutional neural network of deepfakes, as they do not encounter any real deep- (CNN), on a dataset of real and fake images, and fake samples during training. during the training process, simultaneously learn relevant features for detection. It turns out that even In this paper, we aim to explore the generalization standard (off-the-shelf) CNN architectures already capabilities of existing deepfake detectors in cross-perform better in addressing deepfake detection than dataset experiments, where the term cross-dataset the early hand-crafted techniques discussed above, refers to the fact that the detectors are tested on while more specialized solutions further improve on deepfake types that are distinct from those used for these results. In [3], for example, the authors intro-training. Additionally, we are interested in the per-duced Xception, a CNN model that with minor mod-formance of existing detectors with the more re-ifications was demonstrated to be highly effective for cent diffusion-based deepfake generation techniques, deepfake detection [24]. Tariq et al. [33] showed that have not been studied widely yet in the lit-that vanilla CNN detectors, based on Xception [3] or erature. Finally, our goal is also to understand DenseNet [9] backbones, perform poorly with low-the causes behind the observed performances. To resolution deepfakes. To address this issue, they pro-48 posed an ensemble of three Shallow Convolutional ample, proposed the Face X-ray method, which fo-Networks with different layer configurations, effec-cuses on identifying image artifacts resulting from tively handling various input image resolutions. Sim-the blending process. In the learning stage, real faces ilarly, Afchar et al. [1], argued that microscopic im-are initially blended together to generate blended image analysis based on image noise is not suitable ages, and a detector is then trained on these samples for compressed images, where the noise induced by to distinguish between original and blended images. the deepfake generation process is strongly degraded, This idea was later extended in [27], where the au-and similarly, that the analysis of high-level seman-thors synthesized training samples by blending a face tics is also unsuitable due to the subtle appearance back into its original frame. Because the same face differences between real and fake images. There-is used as the target as source for swapping, the pro-fore, an intermediate approach was proposed, where posed self-blending process introduces very subtle a neural network classifies images based on meso-artifacts from which a deepfake detector is learned, scopic features, a mid-level image representation. leading to very competitive detection performance. Although discriminative detectors perform well in Since the primary task of deepfake detectors is detecting forgeries, when they are tested with the to distinguish forgeries of any kind from pristine same type of deepfakes that was also used for train-images, solutions have also been proposed that ap-ing, their performance tends to deteriorate, when approach the problem within a one-class anomaly de-plied to deepfakes created using a previously unseen tection setting. In [12], Khalid et al. proposed the method. This generalization issue is also generally OC-FakeDect method that is based on a One-Class considered as one of the main problems of modern Variational Autoencoder. Here, the input images are deepfake detectors, and the causes of the poor gener-classified based on the reconstruction score obtained alization are still poorly understood. through the encoder-decoder architecture. Similarly, in [15], a one-class method, called SeeABLE, was Beyond Discriminative Detectors. The problem of presented, where the model learns low-dimensional generalization was addressed in [20] by introducing representations of synthetic local image perturba-a dedicated feature extractor that incorporated spe-tions. To detect forgeries, an anomaly score derived cific domain-knowledge before the classifier. The from a prototype matching procedure is used. feature extractor infers task-specific and information-rich features at multiple scales from the input image, Our Contribution. While the evolution of deep-combining them into a discriminative representation fake detectors, discussed above, has led to obvious that is then fed to a classifier. In [4], the authors fol-progress in detection performance and improvements lowed a similar idea and proposed the Hierarchical in the generalization capabilities, the characteristics Memory Network to decide whether an image repre-of these models that impact cross-deepfake detecsents a deepfake or not. The proposed network con-tion performance are still underexplored. In the ex-siders both the current facial content to be classified perimental section, we therefore study the behavior as well as previously seen faces. Facial features are of a representative set of existing deepfake detec-extracted using a pretrained neural network, consist-tors in cross-dataset detection experiments and aning of a bidirectional GRU (Gated Recurrent Unit) alyze class activation mappings to better understand, and an attention mechanism. The resulting output is which image areas contribute to the detection deci-then compared to previously seen faces to make a de-sions. Additionally, we also explore the performance cision on whether the input face is a deepfake or not. of the detectors with a new class of deepfakes, generated with modern diffusion-based models. To the One of the more effective methods for improving best of our knowledge, this issues has not yet been the generalization of deepfake detectors is the syn-widely explored in the open literature. thesis of forged images, which are then used together with real/pristine face images to train discriminative 3. Methodology detection models. These so-called pseudo-deepfake methods are in essence learned from real data only To facilitate the analysis, we select three concep-and never observe a real deepfake image. Instead, tually distinct deepfake detectors: (i) a discrimina-they simulate deepfake artifacts through various aug-tive model based on the Xeception architecture that mentation and synthesis strategies, leading to highly learns to distinguish between real and forged images effective detection models. Li et al. [17], for ex-through a binary classification problem [24], (ii) the 49 High-Frequency Face Forgery Detection (HF-FFD) distribution, and this characteristic discrepancy can method [20] that aims to improve the generalization be utilized for forgery detection. As a result of this capabilities of discriminatively learned deepfake de-observation, the authors proposed a method that em-tectors by extracting informative task-specific fea-ploys both RGB spatial features and high-frequency tures, and (iii) a pseudo-deepfake detector relying noises for detecting forgeries. The pipeline com-on Self-Blended Images (SBI) [27] that learns from prises three parts: the entry, middle, and exit flows. pristine images only and simulates deepfake induced The input image is first converted into a residual artifacts for the training process through a dedicated image Xh using SRM filters [5]. The entry flow blending process. Details on the selected deepfake takes both the RGB image X and the residual image detectors are given in the following sections. Xh, performing convolution on both to obtain feature 3.1. The Discriminative Xception-Based Detector maps F 1 and F 1. To extract more high-frequency h information, an SRM followed by a 1 × 1 convoluThe Xception method conceptually originates tion is applied to F 1, resulting in ˜ F 1. This result is h h from the family of Inception methods [10, 29–31]. then added to F 1, and the operations are repeated. h Unlike traditional convolutional layers that learn fil-The output of the entry flow consists of feature maps ters in 3D space (two spatial dimensions and one of two modalities, where the high-frequency Fh car-channel dimension), processing both the spatial and ries much more information than the input Xh. The cross-channel correlations with each convolutional output spatial feature map F is element-wise mul-kernel, the fundamental idea of Inception modules tiplied with an attention map M obtained from the is to divide this process into multiple operations that residual image as: M = fatt(Xh), where fatt is an independently handle the mapping of these corre-attention block, inspired by CBAM [37]. In the mid-lations. Specifically, in Inception modules, cross-dle flow, feature maps of two modalities are fed into channel correlations are first computed using 1 × 1 a dual cross-modality attention module (DCMA), convolutional filters, followed by all other correla-which captures dependencies between low-frequency tions using 3 × 3 convolutions. If we simplify the textures and high-frequency noises. Each input is module by omitting the average pooling tower and divided into two components: a value, representing reformulate the architecture as one large 1 × 1 con-domain-specific information, and a key, measuring volutional layer followed by 3 × 3 convolutions, we the correlation between these two domains. In the get a streamlined version of the Inception layer. Tak-exit flow, high-level features of the two modalities ing this idea to the extreme by mapping spatial cor-are merged. Classifier training to distinguish be-relations for each output channel, we get a mod-tween genuine and forged images can then be perule very similar to depthwise separable convolution. formed on these obtained features. In this work, we Xception is a convolutional neural network architec-again use the Xception [3] model to learn a deepfake ture that replaces Inception modules with depthwise detector over the extracted features. separable convolution layers, assuming that mapping cross-channel correlations and spatial correlations in 3.3. Self-Blended Images [27] the feature maps of a convolutional neural network are completely decoupled. The proposed architecture The third approach considered for our analysis consists of 36 convolutional layers structured into 14 [27], i.e, Self-Blended Images, falls into the cate-modules, each with a linear residual connection (ex-gory of detectors that address the generalization is-cept the first and last). At the end, there is logis-sue by generating synthetic forgeries, on which a tic regression and an optional fully-connected layer. discriminative detector is learned. Typically, these The first detector used in this work uses the Xception methods synthesize training samples by blending two model to learn a discriminative deepfake detector. distinct faces and generating artifacts based on the gap between source and target images. In con-3.2. High-Frequency Face Forgery Detection trast, this method performs blending of a slightly Luo et al. [20] identified that face manipulation altered version of the same face, actively generat-procedures generally consist of two stages: fake face ing artifacts with selected transformations. The so-creation and face blending. Since only the facial part caleld Self-Blended Images (SBIs) are generated in is altered in the image while the background remains three steps. First, the source-target generator creates the same, the blending stage disrupts the original data pseudo source and target images for blending. The 50 4. Experiments and results 4.1. Datasets For the experiments, we select the FaceForen- sics++ dataset [25], which is one of the most pop- (a) (b) (c) (d) ular and challenging datasets publicly available for the development and testing of deepfake detectors in cross-deepfake type experiments. Additionally, to make the analysis more comprehensive, we generate two novel subsets of the FaceForensics++ dataset, one based on a recent GAN-based face swap- (e) (f) (g) (h) ping procedure, and one based on a diffusion-based Figure 2: Examples of images generated using model. These two subsets also represents one of the DiffFace. DiffFace produces convincing deepfakes tangible contributions of this work. Below, we pro-that are almost indistinguishable from real images, vide details on the FaceForensics++ dataset and the e.g., see the pristine images in (e) and (g) and their novel InsightFace and DiffFace subsets. deepfakes in (f) and (h), but also leads to failure FaceForensics++. For the training and testing of cases in challenging scenarios, e.g., a profile view models, we utilize the FaceForensics++ dataset [25], in (a), facial occlusions, e.g., a visible border around which comprises 1000 videos. These videos are di-glasses in (b). Sometimes artifacts also remain in the vided into three groups: 720 for training, 140 for images, e.g., shadows in (c) or hair segments in (d). validation, and 140 for testing. The dataset is par-titioned into several subsets that are generated using 5 distinct deepfake-generating methods: Deepfakes1, Face2Face [35], FaceShifter [16], FaceSwap1, and NeuralTextures [34]. These deepfakes are created input image I is initially duplicated, and both im-using predefined target and source face pairs and ages are augmented to introduce statistical inconsis-are mostly based on methods relying on Generative tencies (RGB and HSV color space values are ran-Adversarial Networks (GANs). Additionally, each domly shifted, as well as brightness and contrast; group includes authentic, unaltered videos. We aug-the images are downsampled or upsampled). Blend-ment the dataset with two additional subsets. The ing boundaries in landmark mismatches are repro-first uses the InsightFace [7] face swapping proce-duced by resizing the source image, zero-padding, or dure, and the second the diffusion-based DiffFace ap-center-cropping, and finally translating it. Pseudo-proach from DiffFace [13]. Because deepfakes based source and target images end up with the same size on diffusion models have so far not been widely disas the original image. In the next step, the mask gen-cussed in the literature and no relevant datasets are erator creates a grayscale mask used for blending the available in the literature, we discuss the generated previously generated images. This is done by having DiffFace subset of FaceForensics++ (FF++) in a sepa landmark detector first determine parts of the face arate section below. based on which a convex hull is calculated. To increase the diversity of the mask, the obtained shape The DiffFace FF++ Subset. We structure the Diffis deformed with elastic deformation and then eroded Face Subset in the same way as all others from the or dilated. Lastly, the blending ratio of the source FaceForensics++ collection: it consists of frames image is determined by multiplying the mask by a from 1000 videos, divided into training, validation, constant and test sets, with only every tenth frame processed r ∈ (0, 1]. In the final step, the blending of the source image for each recording. Forged images generated us- Is and target image It is per- formed with the blending mask ing the DiffFace approach are highly convincing and M to generate the self-blended image. With such synthetically gener-difficult to distinguish from authentic ones at first ated samples, a binary classifier is then trained to glance. In Figures 2e to 2h, we see that the generated distinguish between genuine images and deepfakes. deepfake can even look more convincing than the Following [27], we also use EfficientNet-b4 [32] for original images. However, the method yields poorer this task. 1https://github.com/deepfakes/faceswap 51 Train set Test set - AUC Deepfakes DiffFace Face2Face FaceShifter FaceSwap InsightFace NeuralTextures Deepfakes 0.9974 0.7018 0.8844 0.5699 0.6434 0.6130 0.9174 DiffFace 0.6111 0.9959 0.5079 0.6128 0.5151 0.6072 0.5199 Face2Face 0.9420 0.6475 0.9903 0.6946 0.6562 0.5316 0.8106 FaceShifter 0.6533 0.9368 0.5197 0.9969 0.5161 0.6156 0.5696 FaceSwap 0.6647 0.6928 0.8608 0.5050 0.9955 0.5361 0.7730 InsightFace 0.6981 0.6473 0.5851 0.8027 0.5473 0.9298 0.6535 NeuralTextures 0.9931 0.6765 0.9497 0.7302 0.6847 0.5516 0.9862 Table 1: Performance of Xception trained on different databases in cross-dataset scenario. (a) (b) (c) (d) (a) (b) (c) (d) Figure 3: Grad-CAM analysis of the last convolutional layer of the Xception network. The model trained on Deepfakes (a), Face2Face (b), and NeuralTextures (c) databases typically activates in the regions around the eyes, mouth, and nose. The classifier trained on deepfakes from the DiffFace database (e) (f) (g) (h) (d) typically activates in a circular pattern. Figure 4: Illustration of Grad-CAM depicting the triggering regions of the last convolutional results when faced with more challenging scenarios, layer of the Xception network with an added such as under face orientations that cause the face to feature-extracting pipeline: focus on the root of be partially visible (e.g., a profile view in Figure 2a) the nose (Deepfakes (a)) and on the edge of the and various occlusions on the face (e.g., glasses in nose (InsightFace (b)), triangular area with the cen-Figure 2b). As the process is of a sequential stochaster on the mouth (DiffFace (c)), circular focus on tic nature, artifacts such as shadows (in Figure 2c) the philtrum (Face2Face (d) and NeuralTextures (g)), or hair segments (in Figure 2d) are sometimes trans-hourglass shape (FaceShifter (e)), and truncated tri-ferred to the output as well. angle (FaceSwaps (f)), focus on the eyes in genuine images (h). Best viewed in color. 4.2. Performance metrics Following standard evaluation methodology [12, FF++ dataset, and tested the model on the entire test-15, 23] we evaluated the performance of the selected ing set to obtain insight about the method’s perfor-methods based on the Area Under the Receiver Opmance detecting various types of deepfakes. The re-erating Characteristic Curve (AUC). We also consults are compiled in Table 1. It is evident that the duct a qualitative analysis of the results, comparing method performs best on forgeries generated using the characteristics of images and Gradient-weighted the same method as used for the generation of train-Class Activation Mapping (Grad-CAM) heatmaps of ing samples. Clearly, the detector overfits to the tex-samples where the methods are successful and those tural errors specific to the given deepfake generation where they are not [26]. We use Grad-CAM as the method. Consequently, when applied to images ma-primary tool for understanding the generalization ca-nipulated using a different method, the detector’s per-pabilities of the tested detectors. formance significantly decreases. 4.3. Results Additionally, we observe that the model exhibits significantly better generalization across the Deep-Xception Results. For the evaluation, we trained fakes, Face2Face, and NeuralTextures databases the Xception model using deepfakes generated with compared to other types of deepfakes. These forg-one of the face forgery methods that constitute the eries contain visually similar artifacts, e.g., blend-52 Train set Test set - AUC Deepfakes DiffFace Face2Face FaceShifter FaceSwap InsightFace NeuralTextures Deepfakes 0.9971 0.7494 0.9403 0.6615 0.5666 0.6353 0.9596 DiffFace 0.5166 0.9999 0.5302 0.5076 0.5210 0.5086 0.5294 Face2Face 0.9965 0.5277 0.9912 0.7614 0.7343 0.6638 0.9591 FaceShifter 0.7750 0.8228 0.8491 0.9987 0.7094 0.6229 0.7910 FaceSwap 0.9407 0.9897 0.9934 0.8823 0.9969 0.4995 0.9274 InsightFace 0.6896 0.7830 0.6146 0.5203 0.5447 0.9725 0.5843 NeuralTextures 0.9928 0.8561 0.9891 0.9220 0.9302 0.6603 0.9933 Table 2: Performance of HF-FFD with an Xception classifier in a cross-dataset scenario. mouth, and philtrum (the area between the nose and mouth). The network’s focus on the root of the nose and its surroundings occurs when training the network on the Deepfakes dataset. A similar focus is observed when training on the InsightFace dataset, but (a) (b) (c) (d) in this case, the center of focus is not the root of the Figure 5: Typical examples of artifacts that the nose; instead, it is somewhere on the edge (tip, left or SBI method successfully detects: obvious blend-right edge, or the top of the nose). In the case of the ing border (a), color mismatch (b), structural incon-DiffFace dataset, the network focuses on the mouth, sistencies (e.g., partially deleted glasses (c)), poorly with a triangular area towards the nose. For all other generated facial landmarks (e.g., nose (d)). datasets, the network focuses on the philtrum area, but they differ in the shape of the focus area. The ing edges, distortions in facial landmarks, and color Face2Face and NeuralTextures datasets have a circu-mismatches. An analysis of the detector using Grad-lar area similar to Deepfakes, while the FaceShifter CAM [26] reveals that the last convolutional layer of and FaceSwaps datasets have areas that stretch up-the method trained on one of these subsets activates ward on the face, with the former having an hourglass in similar regions during inference, i.e., areas around shape and the latter a truncated triangle. In the case the eyes, mouth, and nose, as seen in Figure 3a to 3c. of genuine images, the model is triggered in the eye The results also indicate that training the detector area, regardless of the training dataset. These focus on diffusion-based deepfakes leads to poor general-areas are illustrated in Figure 4. ization. Diffusion-based forgeries appear markedly From the results in Table 2, it is evident that the different at first glance and do not exhibit typical ar-method trained on the Deepfakes, Face2Face, and tifacts. This suggests that the detector is attentive NeuralTextures subsets also generalizes well across to entirely different features, as evident in the Grad-those specific deepfake types. Moreover, it is also CAM analysis shown in Figure 3d, i.e., the triggering noticeable that the triggering area of the method on area of the last convolutional layer is typically circu-these subsets is very similar, i.e., an approximately lar, unlike any other training database. circular area around the focus center, with slight vari-HF-FFD Results. In this case, HF-FFD detector, ations in the center’s position (Figure 4a, 4d, 4g). we are dealing with a discrimantive model that uses However, it turns out that the method performs bet-the Xception architecture for classification and a spe-ter among datasets where the intersection between cialized pipeline for feature extraction, as described the triggering areas of the network is larger. Thus, in Section 3.2. We conduct training and testing of a model trained on datasets with a larger triggering this model in the same way as with Xception. The area (FaceSwap (Figure 4f), FaceShifter (Figure 4e), results are shown in Table 2. As can be seen, the and NeuralTextures (Figure 4g)) detects deepfakes introduction of the pipeline significantly improves of almost all types. In contrast, training on datasets generalization. However, a more in-depth analysis with a small triggering area (DiffFace (Figure 4c)) using Grad-CAM is needed for a better understand-results in very poor generalization. A special case is ing. Based on Grad-CAM analysis, we can roughly the InsightFace dataset, where the center and shape categorize the learned bases into three groups based of the focus are not constant/consistent. Different on the focus of the last convolutional layer: nose, spatial/semantic areas in the images seem informa-53 Model Test set - AUC Deepfakes DiffFace Face2Face FaceShifter FaceSwap InsightFace NeuralTextures SBI 0.9106 0.5708 0.8715 0.7922 0.7851 0.5892 0.8430 Table 3: Performance of EfficientNet-b4 fine-tuned using self-blended images tested on deepfakes created with seven different approaches. Results are shown in terms of AUC. tive for the method in these types of deepfakes ar-icantly declines when confronted with forgeries that eas in the images seem informative for the method in do not contain the artifacts present in the training set. these types of deepfakes. Consequently, when rec-It notably struggles with forgeries from the DiffFace ognizing forgeries of other types, we correctly detect and InsightFace datasets. In the latter, the method only those images with a similar informative defect, focuses primarily on areas that appear to have been which is evident in Grad-CAM heatmaps by the cen-smoothed during the forgery process. However, this ter and shape of the focus approximating the typi-is not precise enough, leading to misclassification cal focusing area of this dataset. However, detection of many genuine images as deepfakes. Forgeries with these subsets also results in many false nega-from the DiffFace dataset present a unique challenge tives, as in cases where the network focuses on the as they do not exhibit typical errors due to a dif-top of the nose, it closely resembles the focus of a ferent generation approach. Consequently, a clas-genuine image (which typically focuses on the eye sifier trained on pseudo-deepfakes with typical arti-area). Slightly better performance is achieved only facts faces difficulty distinguishing these forgeries. when testing on the DiffFace dataset, as the samples This approach successfully mitigates the problem of of these two datasets are the most similar, which is overfitting to a specific deepfake generation method. why we often obtain a triangular area at the base of However, the issue of generalization is then shifted to the nose that closely resembles the triggering area in the level of selecting transformations during the synthe DiffFace dataset. thesis of training samples. This directly influences Self-Blended Images (SBI) Results. what the classifier will decide upon during classifi-This method cation, meaning that in the presence of new types relies solely on pristine images from the training of forgeries expressing different defects, the detector dataset, eliminating the need for deepfakes in the may not successfully identify them. training dataset. To evaluate its performance, we conduct tests using a pre-trained model that was trained, as described in the paper [27]. The results 5. Conclusion are summarized in Table 3. This technique utilizes only authentic images to generate pseudo-deepfakes In this paper, we analyzed three face forgery defor training the detector. This unique approach en-tection methods, evaluating them in a cross-dataset ables the direct determination of specific artifacts on scenario and assessing generalization. Using Grad-which the detector should focus. The authors of this CAM, we examined failure cases and observed that approach categorize these artifacts into four groups: discriminative models like Xception generalize pri-landmark mismatch, blending boundary, color mis-marily among forgeries with similar textural arti-match, and frequency inconsistency. The results in-facts, while models with a feature-extracting pipeline dicate that the method performs comparably well in before the classifier demonstrated improved general-recognizing forgeries of all types where the same arization when trained on datasets that induce larger fo-tifacts that were synthesized on training images are cus areas in the final convolutional layer. Classifiers present. The method achieves its highest success trained with pseudo deepfakes proved effective only rates on samples from the Deepfakes dataset (Fig-when artifacts assumed during training sample gen-ure 5a) and Face2Face dataset (Figure 5b), where the eration also appeared in the forgeries. Future work injected artifacts are most conspicuous. The method will expand the analysis to a broader detector set, ex-is also effective in detecting structural inconsisten-plore aspects like the impact of image compression, cies (e.g., partially deleted glasses in Figure 5c) and investigate the characteristics of the detection tech-poorly generated facial landmarks (e.g., nose in Fig-niques in the frequency domain, and assess the dis-ure 5d). However, the method’s performance signif-criminativness of learned image representations. 54 References [14] P. Korshunov, M. Halstead, D. Castan, M. Gra-ciarena, M. McLaren, B. Burns, A. Lawson, and [1] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen. S. Marcel. Tampered speaker inconsistency detec-Mesonet: a compact facial video forgery detection tion with phonetically aware audio-visual features. network. In 2018 IEEE international workshop on In International conference on machine learning, information forensics and security (WIFS), pages 1– number CONF, 2019. 2 7. IEEE, 2018. 2, 3 [15] N. Larue, N.-S. Vu, V. Struc, P. Peer, and [2] Z. Akhtar and D. Dasgupta. A comparative evalu-V. Christophides. Seeable: Soft discrepancies and ation of local feature descriptors for deepfakes de-bounded contrastive learning for exposing deep- tection. In 2019 IEEE International Symposium on fakes. In Proceedings of the IEEE/CVF Inter- Technologies for Homeland Security (HST), pages national Conference on Computer Vision, pages 1–5, 2019. 2 21011–21021, 2023. 2, 3, 6 [3] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE [16] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen. conference on computer vision and pattern recogni-Faceshifter: Towards high fidelity and occlu- tion, pages 1251–1258, 2017. 2, 4 sion aware face swapping. arXiv preprint arXiv:1912.13457, 2019. 5 [4] T. Fernando, C. Fookes, S. Denman, and S. Srid-haran. Exploiting human social cognition for the [17] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, detection of fake and fraudulent faces via memory and B. Guo. Face x-ray for more general face forgery networks, 2019. 3 detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition [5] J. Fridrich and J. Kodovsky. Rich models for ste- (CVPR), June 2020. 3 ganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3):868–882, [18] Y. Li, M.-C. Chang, and S. Lyu. In ictu oculi: Expos-2012. 4 ing ai created fake videos by detecting eye blinking. [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, In 2018 IEEE International workshop on informa-D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-tion forensics and security (WIFS), pages 1–7. IEEE, gio. Generative adversarial nets. Advances in neural 2018. 2 information processing systems, 27, 2014. 1 [19] Y. Li and S. Lyu. Exposing deepfake videos by de- [7] J. Guo, J. Deng, X. An, and J. Yu. Deepin- tecting face warping artifacts. In IEEE Conference sight/insightface: State-of-the-art 2d and 3d face on Computer Vision and Pattern Recognition Work-analysis project. 5 shops (CVPRW), 2019. 2 [8] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion [20] Y. Luo, Y. Zhang, J. Yan, and W. Liu. Generaliz-probabilistic models. Advances in neural informa-ing face forgery detection with high-frequency fea-tion processing systems, 33:6840–6851, 2020. 1 tures. In Proceedings of the IEEE/CVF conference [9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. on computer vision and pattern recognition, pages Weinberger. Densely connected convolutional net-16317–16326, 2021. 2, 3, 4 works. In Proceedings of the IEEE conference [21] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and on computer vision and pattern recognition, pages S. Paul Smolley. Least squares generative adversar-4700–4708, 2017. 2 ial networks. In Proceedings of the IEEE Interna- [10] S. Ioffe and C. Szegedy. Batch normalization: Ac-tional Conference on Computer Vision (ICCV), Oct celerating deep network training by reducing inter-2017. 1 nal covariate shift. In International conference on [22] Y. Mirsky and W. Lee. The creation and detection machine learning, pages 448–456. pmlr, 2015. 4 of deepfakes: A survey. ACM Computing Surveys [11] T. Karras, S. Laine, and T. Aila. A style-based (CSUR), 54(1):1–41, 2021. 1, 2 generator architecture for generative adversarial net- [23] T. T. Nguyen, Q. V. H. Nguyen, D. T. Nguyen, D. T. works. In Proceedings of the IEEE/CVF Confer- Nguyen, T. Huynh-The, S. Nahavandi, T. T. Nguyen, ence on Computer Vision and Pattern Recognition Q.-V. Pham, and C. M. Nguyen. Deep learning for (CVPR), June 2019. 1 deepfakes creation and detection: A survey. Com- [12] H. Khalid and S. S. Woo. Oc-fakedect: Classi-puter Vision and Image Understanding, 223:103525, fying deepfakes using one-class variational autoen-2022. 2, 6 coder. In Proceedings of the IEEE/CVF Confer- [24] S. Pashine, S. Mandiya, P. Gupta, and R. Sheikh. ence on Computer Vision and Pattern Recognition Deep fake detection: Survey of facial ma- (CVPR) Workshops, June 2020. 2, 3, 6 nipulation detection solutions. arXiv preprint [13] K. Kim, Y. Kim, S. Cho, J. Seo, J. Nam, K. Lee, arXiv:2106.12605, 2021. 2, 3 S. Kim, and K. Lee. Diffface: Diffusion-based face [25] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, swapping with facial guidance, 2022. 1, 5 J. Thies, and M. Nießner. Faceforensics++: Learn-55 ing to detect manipulated facial images. In Proceed- [37] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon. Cbam: ings of the IEEE/CVF international conference on Convolutional block attention module. In Proceed-computer vision, pages 1–11, 2019. 2, 5 ings of the European conference on computer vision [26] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedan- (ECCV), pages 3–19, 2018. 4 tam, D. Parikh, and D. Batra. Grad-cam: Visual [38] Y. Zhang, L. Zheng, and V. L. L. Thing. Automated explanations from deep networks via gradient-based face swapping and its detection. In 2017 IEEE 2nd localization. In Proceedings of the IEEE interna-International Conference on Signal and Image Pro-tional conference on computer vision, pages 618– cessing (ICSIP), pages 15–19, 2017. 2 626, 2017. 2, 6, 7 [39] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and [27] K. Shiohara and T. Yamasaki. Detecting deepfakes W. Xia. Learning self-consistency for deepfake de-with self-blended images. In Proceedings of the tection. In Proceedings of the IEEE/CVF interna-IEEE/CVF Conference on Computer Vision and Pat- tional conference on computer vision, pages 15023– tern Recognition, pages 18720–18729, 2022. 2, 3, 4, 15033, 2021. 2 5, 8 [40] W. Zhao, Y. Rao, W. Shi, Z. Liu, J. Zhou, and J. Lu. [28] J. Straub. Using subject face brightness assessment Diffswap: High-fidelity and controllable face swap-to detect ‘deep fakes’(conference presentation). In ping via 3d-aware masked diffusion. In Proceedings Real-Time Image Processing and Deep Learning of the IEEE/CVF Conference on Computer Vision 2019, volume 10996, page 109960H. SPIE, 2019. and Pattern Recognition (CVPR), pages 8568–8577, 2 June 2023. 1 [29] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017. 4 [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-binovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. 4 [31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 4 [32] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019. 5 [33] S. Tariq, S. Lee, H. Kim, Y. Shin, and S. S. Woo. Detecting both machine and human created fake face images in the wild. In Proceedings of the 2nd international workshop on multimedia privacy and security, pages 81–87, 2018. 2 [34] J. Thies, M. Zollhöfer, and M. Nießner. De- ferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019. 5 [35] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2387–2395, 2016. 5 [36] R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, and J. Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion, 64:131–148, 2020. 2 56 Video shutter angle estimation using optical flow and linear blur David Korčák Jiř´ı Matas Faculty of Electrical Engineering Czech Technical University in Prague korcadav@fel.cvut.cz matas@fel.cvut.cz Abstract straint that influences every pixel, and it thus has the potential for the detection of fake videos and local image ed-We present a method for estimating the shutter angle, its. Violations of the constraint – the linear relationship be-a.k.a. exposure fraction – the ratio of the exposure time tween the local motion blur and optical flow with the shutter and the reciprocal of frame rate – of videoclips containing angle providing the scaling constant – are not immediately motion. The approach exploits the relation of the exposure obvious to human observers, and thus might not be noticed fraction, optical flow, and linear motion blur. Robustness is by neither the authors nor the viewers of the altered or syn-achieved by selecting image patches where both the optical thesized content. Moreover, many generators of synthetic flow and blur estimates are reliable, checking their consis-content are often trained on sharp images corresponding to tency. The method was evaluated on the publicly available very short exposures or on graphics-generated data and thus Beam-Splitter Dataset with a range of exposure fractions might not represent motion blur as required by physics. from 0.015 to 0.36. The best achieved mean absolute error In the paper, we present a robust method for the esti-of estimates was 0.039. We successfully test the suitability mation of the shutter angle that relies on explicitly run-of the method for a forensic application of detection of video ning a state-of-the-art optical flow algorithm [11] and linear tampering by frame removal or insertion. blur kernel estimator [2]. We are not aware of any existing method for shutter angle estimation from unconstrained video sequences. Barber et al. [1] formulate the problem for 1. Introduction sequences containing only specific types of blur. We are the first to address the problem for sequences containing gen-The shutter angle, a.k.a the exposure fraction, is the ra-eral motion. tio of the exposure time, i.e. the time period a film or a In summary, we make the following contributions. The sensor is exposed to light, and the time between two con-proposed method is novel, exploiting recent progress in secutive frames, i.e. the reciprocal of the frame rate. The deep optical flow and linear blur estimation. Both of these shutter angle determines the relation between object motion estimates are dense, permitting to achieve robustness by and image blur and thus influences viewer perception. This combining predictions from patches where both estimates has been used in film-making as part of artistic expression are reliable and consistent. We show a forensic application and tutorials and websites are devoted to explaining the ef-of the method, considering detection of video tampering by fect [6]. In computer vision and image processing, exposure frame removal or insertion. fraction affects methods for temporal super-resolution and video frame interpolation, since the inserted images need to 2. Related work both interpolate motions and reduce motion blur. For analog cameras, the exposure fraction remains con-The problem of estimating the shutter angle of a video-stant throughout the duration of the video. In digital cam-clip has been approached from the point of view of precisely eras, the exposure time and thus the shutter angle may be set measuring camera characteristics with the help of a bespoke dynamically, according to illumination intensity. Neverthe-setup. The method of Simon et al. [8] relies on special de-less, for most recorded videos it stays constant. For global vices, such as turntables, CRT displays or arrays of LEDs shutter cameras, it is the same for every pixel of a frame. lit in specific patterns. Barber et al. [1] addressed the prob-For rolling shutter cameras the same is true for horizontal lem for sequences containing blur induced by either zoom motions; the exact modeling is more complex for vertical or camera rotation during exposure. motions. We formulate the problem of estimating video shutter an-The exposure fraction provides a physics-based con-gle with the use of optical flow and linear blur kernel esti-57 mates from a general video clip containing motion. Meth-The linear blur assumes the kernels at (x, y) can be mod-ods for estimating linear blur parameters [2, 10, 13] of-eled as 1D, in the direction of the local motion. Such kernels ten rely on deep neural networks trained on synthetically can also be interpreted as two-dimensional motion vectors blurred images. Typically, they serve as an intermediate Kx, y = (k1x, y, k2x, y). Linear blur kernels also character-step towards image or video frame deblurring. While the ize the motion of a pixel over the camera exposure time ε, as details of particular implementations differ, the output is the blurring occurs by motion during camera light capture generally a set of estimated parameters of linear blur ker-over the exposure period. nels. The assumption of linear blur is violated e.g. for hand-Similarly, the topic of optical flow estimation has seen a held cameras that may undergo Brownian-like motions. In lot of progress with methods such as Teed and Deng [11]’s such cases, the estimation of both blur kernels and opti-RAFT. Optical flow methods seek to estimate pixel discal flow is difficult and they present a challenge for our placements between consecutive frames, and as of recently, method. Since the exposure fraction is the same for all pix-are based on various deep neural network architectures. As els in the image, it is sufficient to find a modest number of obtaining ground truth of optical flow on real-world data is areas where the linear blur assumption holds, e.g. on a lin-difficult, these methods are trained on synthetic datasets. early moving object in the scene. In Sec. 3.5, we introduce techniques for identifying and selecting such areas of video 3. Method frames. In this paper, we apply the method of linear blur esti-The proposed method for shutter angle estimation remation by Gong et al. [2], which shows both good gener-lies on the calculation of dense optical flow, described in alization ability and dataset performance. It is also one of Sec. 3.1, and linear blur, described in Sec. 3.2. The shut-the only methods that perform per-pixel estimates of blur ter angle is the ratio of the length of blur and optical flow kernels, i.e. the estimates are calculated, not interpolated vectors (Sec. 3.4) if the direction of motion does not change from patches, in full resolution. This method, however, in-within the exposure time, thus not all parts of the image are troduces a level of discretization error, as the deep neural suitable for estimating this ratio. For instance, both esti-network used for estimating blur kernels operates as a multi-mates of the blur and the optical flow will be unreliable in class classifier with a discrete output space. We attempt to parts of the image that contain the sky or similar smooth-minimize the effect of such errors on our final estimate as texture surfaces. In Sec. 3.5 we describe our algorithm for described in Sec. 3.5. the selection of patches suitable for shutter angle estimation. During testing, we also evaluated the performance of 3.1. Optical flow model Zhang et al.’s method [13] as it operates with a real output space but found that in our set up it performed worse Optical flow is conventionally defined as per-pixel mo-than Gong et al.’s method [2]. tion between video frames. Given 2 consecutive video frames Fi and Fi+1, the goal of optical flow estimation 3.3. Shutter angle from linear blur and optical flow methods is to map the position (xi, yi) of a given pixel in frame We consider a video camera with the following param-Fi to its position (xi+1, yi+1) in frame Fi+1. This mapping can be modeled as a dense displacement field eters. Let ε denote the exposure time of each frame, f the video framerate in frames per second and (f 1, f 2). The position of a given pixel in frame F θ the video shut- i+1 can be then described as ter angle in degrees. For the i-th video frame, we define (x ti i + f 1(xi), yi + f 2(yi)) [11]. In this paper, we employ Teed and Deng’s RAFT [11] for op-as the time of exposure start and ti + ε as the exposure end. tical flow estimation, selected for its well-documented per-The time difference between exposure starts of two consec-formance across various sequences [11] and strong bench-utive frames, ∆t = ti+1 − ti = 1/f = f−1, is equal to the mark results [12]. reciprocal of the frame rate f. For the purpose of simplicity, we use the exposure frac-3.2. Linear blur model tion notation, rather than the shutter angle, i.e. instead of 180◦ we speak of 0.5. The degree notation is widely used The heterogeneous motion blur model commonly views in cinematography, as it originates from the construction of the blurred (real) image Y as the product of a convolution historical cameras that utilized mechanical rotating shutters of a sharp image X with an operator K and additive noise to set the exposure time ε. In many modern digital cameras, N [2]. exposure time ε can be set explicitly. We define exposure Y = K ∗ X + N (1) fraction α as the ratio The motion blur kernel map, K, in general, consists of dif- ε θ ferent blur kernels for each pixel at position (x, y). α = = . (2) ∆t 360◦ 58 3.4. Estimating the exposure fraction ε (ms) 1 2 3 8 16 24 Consider optical flow described in Sec. 3.1 and linear α 0.015 0.030 0.045 0.120 0.240 0.360 blur kernel described in Sec. 3.2. The magnitude of the optical flow vector ∥fx, y∥ is equivalent to the distance disTable 1. Exposure fractions of BSD subsets based on exposure placed by pixel over the duration of a single frame, i. e. over time ε. All videoclips have a framerate f = 15 FPS, ∆t = 0.066 s the time interval of length ∆t. Similarly, the magnitude of linear blur kernel ∥Kx, y∥ corresponds to the distance displaced by pixel (x, y) over the time interval ε. Here, we the cosine domain which avoids wrap-around effects and assume uniform motion, i. e. the pixel at (x, y) is travel-also addresses the problem that the blur kernel is estimated ing at a constant velocity vx, y between consecutive frames. modulo π, it does not have a direction: Such assumption is reasonable, as the absolute frame duration ∆t of video clips shot at multiple frames per second is | ⟨fx, y|Kx, y⟩ | often negligible compared to camera motion or motion of ≥ cos φ (5) ∥f common objects. Under this assumption, we may express x, y∥2 · ∥Kx, y∥2 the norm of optical flow vector and linear blur kernel as where φ is the maximum angle threshold in degrees; ⟨·|·⟩ distance displaced by pixel (x, y) at constant velocity vx, y denotes the dot product. over respective time intervals. Second, for Eq. (4) it follows that the norm of the ground truth optical flow vector is always larger than the norm of ∥fx, y∥2 = vx, y · ∆t, ∥Kx, y∥2 = vx, y · ε (3) the ground truth linear blur kernel – a pixel cannot be phys-We substitute in Eq. (2) and obtain the following ically captured for a longer period than the maximum inter-frame period ∆t. We therefore remove all values from po- ∥K sitions α x, y∥2 (x, y) where: x, y = ; (4) ∥fx, y∥2 i.e. given the magnitude of the optical flow ∥K f x, y∥2 > ∥fx, y∥2 (6) x, y and linear blur kernel Kx, y at pixel (x, y) the value of α at position The blur kernel estimation method [2] outputs values in (x, y) as a ratio of magnitudes of the two vectors. a discrete domain, with the discretization error introduced 3.5. Computation in both vertical and horizontal directions equal to 1. This renders predictions with small motions arbitrarily, and we The proposed method for estimating the value of α thus remove positions with small flow and blur magnitudes: builds on Sec. 3.4. As described in Sec. 1, modern video cameras operate either in a global shutter mode, where all ∥K pixels of the frame get exposed at the same point in time or, x, y∥2 ≤ 1, ∥fx, y∥2 ≤ 1 . (7) more commonly, in a rolling shutter mode, where exposure Next, we find a is performed row-wise. In both cases, the time of exposure D x D patch that contains the highest number of valid positions. The value of α for a given frame ε remains constant for all pixels in the frame. Similarly, the is estimated from this patch. We calculate the estimate of time interval between exposures of pixels (both global and rolling shutter) remains constant. As a result of these physi- α = αpatch for the current frame as the ratio of norms of means of linear blur kernels and optical flow vectors cal constraints, the value of α must be consistent in an entire frame, and typically in the entire video clip. Therefore, the N problem of estimating the value P αx, y pixel-wise is reduced ∥ 1 K ∥ N xi, yi 2 to estimating one global value αglob for the entire video clip. α i=1 patch = (8) As the sources of both optical flow and linear motion N P ∥ 1 f ∥ blur are not robust and prone to errors in their estimates, N xi, yi 2 i=1 and the condition of motion in a single direction during exposure time may not be satisfied, the proposed method where N is the number of valid positions (x, y) inside of locates patches of pixels with the lowest error potential in the selected patch. both linear blur kernels and optical flow. We define multiple Finally, we calculate αglob as the median of estimates of constraints and show that are effective in filtering erroneous all individual frames predictions. First, we discard estimates at all positions (x, y), where αglob = med{αpatch , αpatch , ..., αpatch } (9) 1 2 N the angle between the linear blur kernel and optical flow vectors exceeds a threshold. The condition is evaluated in where N is the number of frames in the video clip. 59 ε (ms) 1 2 3 8 16 24 Average D φ 10 3◦ 0.058 0.034 0.035 0.024 0.048 0.091 0.049 10 5◦ 0.054 0.030 0.033 0.025 0.052 0.095 0.048 10 7◦ 0.051 0.030 0.032 0.026 0.055 0.096 0.048 20 3◦ 0.054 0.033 0.031 0.023 0.041 0.080 0.044 20 5◦ 0.048 0.029 0.029 0.022 0.042 0.081 0.042 20 7◦ 0.046 0.029 0.029 0.026 0.043 0.082 0.042 30 3◦ 0.054 0.033 0.031 0.022 0.038 0.077 0.042 30 5◦ 0.047 0.028 0.028 0.020 0.035 0.075 0.039 30 7◦ 0.045 0.029 0.029 0.025 0.034 0.072 0.039 Table 2. BSD dataset - mean absolute error of exposure fraction ˆ α estimates for a range of patch sizes D and the tolerated angular difference φ between the optical flow and blur directions. The best results, in bold, were achieved for the largest window size and angular tolerance of 5-7 ◦. 4. Experiments and evaluation this dataset for quantitative testing as well as qualitative results on well-performing video clips and failure cases. In this section, after presenting the values of the two parameters of the proposed method - the patch size D and the 4.3. Estimation of exposure fraction on the BSD angular threshold φ, we perform testing on a public dataset dataset and in-depth experiments on individual video clips. We investigate both well-performing video clips and failure cases We performed quantitative evaluation on all subsets of in an attempt to find the limitations of the proposed method. the BSD dataset for all parameter combinations mentioned in Sec. 4.1. We use Mean Absolute Error (MAE) as the 4.1. Parameter selection performance measure: We tested all configurations with patch sizes of D = 1 N X {10, 20, 30} and angular constraints, φ = {3◦, 5◦, 7◦}. MAE = |α − ˆ α N i| (10) For optical flow estimates, we utilized RAFT model [11] i=1 pretrained on the Sintel dataset with 12 iterations per two Results are presented in Tab. 2. We observe a mild pos-consecutive frames. The results are summarized in Tab. 2. itive relationship between method accuracy (MAE) and the 4.2. Evaluation datasets increasing size of selected patches D. Testing also shows that a more relaxed cosine constraint φ leads to a lower error Finding a suitable dataset was difficult, as the exposure for all fixed patch sizes. We attribute this to the property of time data is often erased from video clip metadata or not the adopted linear blur estimation method, which occasion-saved by the video camera at all. Many popular video ally produces results with a correct magnitude but incorrect datasets such as GoPro [7] or DeepVideoDeblurring Dataset orientation or vice versa, further amplified by its discrete [9] are either stripped of this information or are available as output space [2]. individual frames, converted post-capture with camera de-The estimated ˆα is less accurate for very small and very tails unavailable. large values of the true exposure fraction α. Analysis of the The largest public dataset containing exposure time data behavior is a part of our future work. We conjecture that for every video clip is the Beam-Splitter Dataset (BSD) for very low values of α, the discrete estimates of blur are [14]. The Beam-Splitter Dataset consists of pairs of identi-highly inaccurate. For large values of α, the optical flow, cal videoclips captured with different exposure settings by operating on blurred images, is possibly losing accuracy. 2 independently controlled cameras. Due to the framerate Fig. 1 and Fig. 2 show the distribution of ˆα from a test f being known, we are able to calculate the ground-truth with parameters φ = 5◦, D = 30. Larger levels of noise of α for each videoclip. We test on the full 600-videoclip are present in the estimates of α < 0.1. This supports our dataset. Exposure parameters for each subset are in Tab. 1. conjecture that the accuracy of linear blur kernel estimation There are 100 videoclips in each distinct subset. We used on very small values of α is rather low. For values α > 0.1, 60 ε = 0.001 s, α = 0.015 ε = 0.002 s, α = 0.030 ε = 0.003 s, α = 0.045 0.200 0.200 0.200 0.175 0.175 0.175 0.150 0.150 0.150 0.125 0.125 0.125 ˆ α ˆ α ˆ α 0.100 0.100 0.100 0.075 0.075 0.075 0.050 0.050 0.050 0.025 0.025 0.025 0.000 0.000 0.000 Figure 1. Histograms and box plots of ˆ α estimates on clips from the BSD dataset with ε = {0.001 s, 0.002 s, 0.003 s}. Estimation parameters φ = 5◦, D = 30. Note that only the (0, 0.2) range is displayed. ε = 0.008 s, α = 0.12 ε = 0.016 s, α = 0.24 ε = 0.024 s, α = 0.36 0.4 0.4 0.4 0.3 0.3 0.3 ˆ α ˆ α ˆ α 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 Figure 2. Histograms and box plots of ˆ α estimates on clips from the BSD dataset with ε = {0.008 s, 0.016 s, 0.024 s}. Estimation parameters φ = 5◦, D = 30. Note that only the (0, 0.4) range is displayed. we observe the majority of estimates within close intervals vectors, as well as relatively uniform magnitudes of both of ground truth values. The dependency of estimation accu-vectors. We attribute the good performance of both linear racy on the ground truth value of α is the largest limitation blur kernel estimation and optical flow to the largely linear, of the method. lateral motion of the camera and the presence of areas with blurred textured surfaces in the scene. Similar situations 4.4. Qualitative analysis of results on BSD dataset with camera motion are ideal for the utilized linear blur kernel estimator, as Gong et al.’s method [2] was trained on From the estimates performed with parameters φ = 5◦, synthetic data modeled as blur by camera motion. D = 30 detailed in Fig. 1, Fig. 2, we selected two videoclips for in-depth analysis in an attempt to further compare the We also analyze an example of a failure case in Fig. 4 values of linear blur kernel estimates and optical flow, and where the estimate ˆα is grossly erroneous (ground truth their effect on per-frame estimates of α. α = 0.015, estimate ˆ α = 0.065). Here, we observe an As an example of the ideal case, we selected videoclip incorrectly estimated magnitude of linear blur kernels, re-no. 16 from the BSD-16ms subset. The estimated value sulting in an overestimate of α. The method of [2] seems ˆ α = 0.22; ground truth α = 0.24. In Fig. 3, we display to fail in dark areas with low contrast and no pronounced frames F38, F39, the patch selected by the method, and de-textures. As discussed in Sec. 4.3, the linear blur kernel tailed visualization of both linear blur kernel estimates and estimator does not estimate low levels of motion blur accu-optical flow vectors. In this ideal case, we observe near-rately. In consequence, the method often fails to produce perfect collinearity of linear blur kernels and optical flow accurate estimates for ground truth values of α < 0.1. 61 (a) Video frame F38, selected patch highlighted in red. (b) Video frame F39 (c) Linear blur kernels inside and around the selected patch. (d) Optical flow between F38, F39 inside and around the selected patch. Figure 3. Example patch with nearly perfect agreement with the assumption expressed by Eq. (4). The blur kernel and optical flow estimates are collinear, ˆ αpatch = 0.26 and α = 0.24. The selected patch from frame F38, video clip no. 16 from BSD-16ms subset. Estimation parameters φ = 5◦, D = 30. Clip # k α ˆ α α′ ˆ α′ Abs. error portionally to the number of removed frames, yet the motion blur remains the same. If the time scale, i.e. the play-78 3 0.360 0.349 0.120 0.117 0.003 back frame rate, is edited or ignored by the player, the re-14 3 0.240 0.230 0.080 0.075 0.005 played video will appear to the viewer to have faster mo-115 2 0.120 0.121 0.060 0.054 0.006 tions. As a result of frame deletion, the estimated value of α will not be consistent with the original video clip; the Table 3. Detection of video clip subsampling by integer factor tampered section will have values of the α different from k. The ground truth α and estimated ˆ α on the original sequence and the rest of the video. the GT α′ and the estimated ˆ α′ exposure fractions on the tam- Frame deletion and insertion may be used for malicious pered video. In all cases, the value of ˆ α′ was estimated accurately. purposes in video clips where the speed of motion provides The test was performed on videoclips 78 (BSD-24ms), 14 (BSD-significant information value, such as video clips from auto-16ms), and 115 (BSD-8ms) containing traffic and moving vehi-motive dash cameras that could be used for speed measure-cles. Estimation parameters φ = 5◦, D = 30. ments. Similarly, it may be utilized to remove frames containing sensitive or identifying information, such as license plates or faces on footage from surveillance cameras. In the 4.5. Detection of video alteration by frame deletion case of dash and surveillance cameras, the capture framerate Video frame deletion is a form of video clip tampering f and ε are often available as camera metadata, allowing a that directly affects the temporal consistency introduced by comparison between altered video clips and ground truth α. the camera physical properties and its exposure mechanism. We selected three video clips of scenes containing traf-When consecutive video frames are deleted, objects in the fic and moving vehicles with α > 0.1. This choice was scene exhibit progressively larger inter-frame motions, pro-based on the fact that the method performs better on larger 62 (a) Video frame F52, selected patch highlighted in red. (b) Video frame F53 (c) Linear blur kernels inside and around the selected patch. (d) Optical flow between F52, F53 inside and around the selected patch. Figure 4. A failure case of linear blur kernel estimates in a dark, low contrast area with no texture; ˆ αpatch = 0.065, α = 0.015. The linear blur kernel estimator fails to accurately model the blur magnitudes, resulting in an inaccurate estimate of ˆ αpatch. Since the kernels still satisfy the orientation and magnitude constraints defined in Eq. (5), Eq. (6), Eq. (7), they are considered valid. Similar situations remain challenging for both the linear blur kernel estimator and the method. Frame F52, estimation parameters φ = 5◦, D = 30. Video clip no. 74 from BSD-1ms subset. values of α due to the limitations of the linear blur esti-Clip # Interpolation factor α ˆ α ˆ α′ mator. For each of the selected video clips, we performed frame subsampling with an integer factor k, i.e. every k-th 78 2x 0.360 0.349 0.668 frame was preserved; the intermediate frames were dis-4x – – 0.536 carded. The new apparent ground truth value of α′ = αk 14 2x 0.240 0.230 0.412 where α is the ground truth value of the source video clip. The results are displayed in Tab. 3. For all subsampled 4x – – 0.562 videoclips, the method produced estimates within a close margin of the apparent ground truth value. Note signifi-Table 4. Detection of video clip interpolation. The ground truth cantly lower error than on video clips where unmodified α and estimated ˆ α on the original sequence and the estimated ˆ α′ exposure fractions on the tampered video. In all cases, the value α = {0.015, 0.030, 0.045}. We attribute this to more accu-of ˆ α′ increased noticeably. The test was performed on videoclips rate estimates in the selected videoclips, as their unmodified 78 (BSD-24ms) and 14 (BSD-16ms) containing traffic and moving α = {0.120, 0.240, 0.360} results in larger motion blur vehicles. Estimation parameters φ = 5◦, D = 30 and therefore more accurate linear blur kernel estimates. 4.6. Detection of video alteration by frame interpolation It may be followed by a change of playback timescale, in which case it results in a slow-motion video. A videoclip Video frame interpolation is a technique of temporal al-altered in such a way can then be used as fake evidence of teration that synthesizes new intermediate frames in order to vehicle speed from a surveillance or dash camera. Tech-increase video framerate and for motion to appear smoother. niques derived from video frame interpolation may also be 63 used to add new content to videos or change their appear-veloping an improved method for the estimation of lin-ance, such as interpolation between still photographs of a ear blur kernels is a key part of our future work. Lastly, person’s face for the creation of so-called ”deepfakes” [3]. we presented a possible application of exposure fraction As described in Sec. 4.5, surveillance or dash cameras of-estimation for video tampering detection, specifically of ten save the value of ε and f by directly imprinting it on frame deletion and frame insertion. The implementation video frames or by saving it to the metadata. This provides is available at https://github.com/edavidk7/ the ground-truth value of α for comparison with the method exposure_fraction_estimation. estimate. Acknowledgement. The authors were supported Based on the definition of exposure fraction α (Sec. 3.3), by the Research Center for Informatics project it is expected its value to increase if the newly-synthesized CZ.02.1.01/0.0/0.0/16 019/0000765 of OP VVV MEYS. intermediate frames reduce the inter-frame motion of objects without affecting the amount of motion blur, i.e. the References newly synthesized frames appear as blurry as the source [1] Alastair Barber, Matthew Brown, Paul Hogbin, and Darren frames. It is, however, not possible to compute the value Cosker. Inferring changes in intrinsic parameters from mo-of α′ of interpolated videoclips accurately, as modern deep tion blur. Computers Graphics, 52:155–170, 2015. 1 neural network-based interpolation methods such as RIFE [2] Dong Gong, Jie Yang, Lingqiao Liu, Yanning Zhang, Ian [4] do not perform parametrized blurring or deblurring. Reid, Chunhua Shen, Anton van den Hengel, and Qinfeng We performed interpolation on videoclips no. 78 and Shi. From motion blur to motion flow: A deep learning 14 from the experiment in Sec. 4.5 with the state-of-the-art solution for removing heterogeneous motion blur. In The RIFE method [4]. For each videoclip, we tested 2x interpo-IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 1, 2, 3, 4, 5 lation and 4x interpolation. Results are presented in Tab. 4. We observe an increase in estimates of [3] David Güera and Edward J. Delp. Deepfake video detection α′ in all cases of in-using recurrent neural networks. In 2018 15th IEEE Inter-terpolation, pointing to the method synthesizing new frames national Conference on Advanced Video and Signal Based with similar amounts of motion blur as the source. Surveillance (AVSS), pages 1–6, 2018. 8 [4] Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and 5. Further applications Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European In applications concerning video frame interpolation and Conference on Computer Vision (ECCV), 2022. 8 video frame deblurring, the value of α might help parame- [5] Xiang Ji, Zhixiang Wang, Zhihang Zhong, and Yinqiang terize blur for more accurate deblurring or motion model-Zheng. Rethinking video frame interpolation from shutter ing. The synthetization of new frames from a blurry source mode induced degradation. In Proceedings of the IEEE/CVF remains a challenge for modern interpolation methods, and International Conference on Computer Vision (ICCV), pages accurate blur modeling might provide the necessary infor-12259–12268, October 2023. 8 mation for performance improvements [5]. In the case of [6] RED Digital Cinema LLC. Shutter angles and creative con-linear blur estimates and optical flow, the estimate of α is trol. 1 useful as a complement in computing values for positions [7] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep (x, y) where one of the methods failed (under the assump-multi-scale convolutional neural network for dynamic scene tion that is it possible to estimate the value of α from other deblurring. In CVPR, July 2017. 4 frames and positions in the videoclip). This might be useful [8] Gyula Simon, Gergely Vakulya, and Márk Rátosi. The way for the creation of new datasets for linear blur kernel esti-to modern shutter speed measurement methods: A historical mation or optical flow, as the parameters may be estimated overview. Sensors, 22(5), 2022. 1 from existing datasets and computed for entire frames or [9] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video videoclips. deblurring for hand-held cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-6. Conclusion nition, pages 1279–1288, 2017. 4 [10] Jian Sun, Wenfei Cao, Zongben Xu, and Jean Ponce. Learn-We proposed a novel method for estimating the expo-ing a convolutional neural network for non-uniform motion sure fraction based on dense optical flow and linear blur blur removal, 2015. 2 estimates. The method was evaluated on the publicly avail- [11] Zachary Teed and Jia Deng. RAFT: recurrent all-pairs field able BSD Dataset. The mean absolute error was 0.039; transforms for optical flow. CoRR, abs/2003.12039, 2020. 1, the method performed best in the range (0.12, 0.36). We 2, 4 observed reduced accuracy for ground truth values below [12] Mingliang Zhai, Xuezhi Xiang, Ning Lv, and Xiangdong 0.1, leading us to conjecture that the use of discrete lin-Kong. Optical flow and scene flow estimation: A survey. ear blur kernel estimates may be a limiting factor. De-Pattern Recognition, 114:107861, 2021. 2 64 [13] Youjian Zhang, Chaoyue Wang, Stephen J. Maybank, and Dacheng Tao. Exposure trajectory recovery from motion blur. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7490–7504, 2022. 2 [14] Zhihang Zhong, Ye Gao, Yinqiang Zheng, and Bo Zheng. Efficient spatio-temporal recurrent neural network for video deblurring. In European Conference on Computer Vision, pages 191–207. Springer, 2020. 4 65 27th Computer Vision Winter Workshop Terme Olimia, Slovenia, February 14±16, 2024 Measuring the Speed of Periodic Events with an Event Camera Jakub KolÂař, Radim ŠpetlÂık, JiřÂı Matas Visual Recognition Group, Faculty of Electrical Engineering Czech Technical University in Prague kolarj55@fel.cvut.cz Abstract. We introduce a novel method for measuring the speed of periodic events by an event camera, a device asynchronously reporting brightness changes at independently operating pixels. The approach assumes that for fast periodic events, in any spatial window where it occurs, a very similar set of events is generated at the time difference corresponding to the frequency of the motion. To estimate the frequency, we compute correlations of spatio-temporal windows in the event space. The period is calculated from the time differences between the peaks of the correlation responses. The method is contactless, eliminating the need for markers, and does not need distinguishable landmarks. We evaluate the proposed method on three instances of periodic events: (i) light flashes, (ii) vibration, and Figure 1: The proposed method: (i) data captured (iii) rotational speed. In all experiments, our method from an event camera is aggregated into N non-achieves a relative error lower than overlapping arrays along the time axis, (ii) a Re- ±0.04%, which is within the error margin of ground truth measure-gion of Interest and a template are selected, (iii) 2D ments. correlation of the template with arrays is computed, (iv) and the frequency is calculated from the average 1. Introduction of time deltas measured between correlation peaks. The measurement of properties of periodic events has a wide applicability in diverse real-world do-In contrast to contact measurement devices, laser mains. For example, precise quantification of rota-devices offer highly accurate [7], less invasive mea-tional speed is important in many fields, ranging from surements. However, reflective material (e.g., a sport analysis to the assessment of rotating compo-sticker) must be placed on the target, reflecting the nents in machinery and mechanical systems across laser into the sensor while measuring. Under certain industries such as aviation (especially drones [5]), conditions, this limits the application of laser devices energy production using wind turbines [6], and mo-since it might not be convenient or even feasible to at-tor speed testing. tach labels to particular objects or in confined spaces Commercial devices for measuring periodic event of the observed machinery. Another disadvantage is properties, like traditional contact tachometers [4] or that the device operator must aim the laser precisely rotary encoders used for measuring rotation speed, at the target, as missing the reflective material pass-necessitate direct contact with the observed object. through results in an inaccurate measurement. These approaches interfere with the target’s move-We propose a method that allows for non-contact ment, as additional equipment must be in contact measurement of the frequency of any periodic event. with the observed object. 66 (a) flashing LED (b) vibrating speaker (c) a felt disc with (d) a velcro disc, (e) a velcro disc captured diaphragm a high contrast mark fronto-parallel view through a glass sheet at a 45° camera angle Figure 2: Experimental setup with visualisation of event camera output. Top: physical setups, bottom: events from a 250 ms window visualised in spatio-temporal space. The proposed method computes correlations of rotation speed. However, this direct physical connec-spatio-temporal windows in the event space, as- tion introduces inaccuracies due to the mass and fric-suming that the period of the periodic motion tion of the tachometer. Electrostatic sensors detect corresponds to the time differences between the changes in the electromagnetic field caused by a shaft peaks of the correlation responses (see Fig. 1). The bearing fixed on the target, estimating the rotation method is validated on experiments with periodic speed based on the frequency of these changes. Op-events, i.e. flashing light and vibration, and periodic tical encoder tachometers utilise a photoelectric sen-motion, i.e. rotation. Our method achieves accuracy sor to detect light passing through a disc between a with a relative error of ±0.04% in all our experi-light source and the sensor. The disc contains opaque ments, which falls within the margins of error of the and transparent segments that allow for the estima-ground truth. tion of rotation speed based on the frequency of light changes detected by the sensor. Laser tachome-2. Related work ters measure rotation speed by using the frequency In this section, we explore existing approaches and of laser light bounces to its sensor from small and technologies in the domain of rotation speed mea-lightweight reflective labels that must be affixed to surement, as it is the most common periodic event. the target’s surface. Firstly, we delve into commercially available rota-2.2. Camera-based Rotation Speed Measurement tion speed measuring devices with contact and con-Methods tactless options. Subsequently, we explore camera-based rotation speed measurement methods. Lastly, Wang et al. [8] created a rotational speed measure-we investigate event-based rotation speed measurement system based on a low-cost imaging device rement methods, examining approaches that utilise quiring a simple marker on the target. The method event cameras for accurate rotational speed estima-involves pre-processing sequential images by denois-tion. Each subsection provides notes on the strengths ing, histogram equalisation, and circle Hough trans-and limitations inherent in each approach. form. Subsequently, these processed images undergo a similarity assessment method. The rotational speed 2.1. Commercially Available Rotation Speed Mea-is calculated by applying the Chirp-Z transform to suring Devices the restructured signals, and the method achieves Commercially available devices offer either con-valid measurements with a relative error of ± 1% in tact, e.g. traditional mechanical tachometers, or the speed range of 300 to 900 revolutions per minute contact-less rotation speed measuring, e.g. electro- (RPM). static and optical encoder tachometers, including An alternative approach [9] involved the computa-laser tachometers. tion of structural similarity and two-dimensional cor-Mechanical tachometers are physically attached to relation between consecutive frames. Subsequently, the target’s shaft and rotate with it to determine the similarity parameters were utilised to reconstruct a 67 continuous and periodic time-series signal. The fast Fourier transform was then applied to determine the period of the signal, providing the maximum relative error of ± 1% over a speed range of 0 to 700 RPM. Camera-based rotation speed measurement meth- t ∈ [50, 51) t ∈ [57, 58) t ∈ [59, 60) t ∈ [61, 62) ods offer the advantage of non-contact measure- ments, eliminating physical attachments to the rotat-Figure 3: Aggregated events in a fixed time intering object and often providing a cost-effective solu-val of one millisecond for a selected Region of Inter-tion. However, the frame rate of standard cameras is est. Positive events ± white colour, negative events ± relatively low, which can constrain the range of ob-bright blue. servable rotating objects and potentially compromise the accuracy of speed measurements, especially for data in a selected region of interest, and outputs the high-speed rotations. average duration between peaks of correlation re-2.3. Event-based Rotation Speed Measurement sponses. A detailed description follows. Methods The Region of Interest (RoI) is a two-dimensional square area represented by four coordinates defining Hylton et al. [2] introduced a technique for com-its top-left and bottom-right corners. We aggregate puting the optical flow of a moving object within an events within fixed time intervals with spatial coor-event stream, demonstrating its application in esti-dinates within the selected RoI into two-dimensional mating the rotational speed of a disc with a black-arrays we call event-aggregation arrays. and-white pattern. However, the algorithm’s design The aggregation procedure starts by creating a lacked the sophistication required to deal with the two-dimensional array with the same size as the RoI non-structural and noisy event stream to obtain ac-filled with zeros. We go through the list of events curate measurements of high-speed rotation. with spatial coordinates within a chosen Region of EV-Tach method [10] starts by eliminating event Interest (RoI) and timestamps within a specified time outliers by estimating the median distance from interval. For each of these events, we modify an elevents to their centroid, flagging events with disement in an array. The position of this element in tances surpassing a specified threshold as outliers. the array corresponds to the spatial coordinates of the Subsequently, it identifies rotating objects charac-event relative to the RoI. The polarity of the event de-terised by centrosymmetric shapes and proceeds to termines the new value in the array. track specific features, such as propeller blades. We choose one of these event-aggregation arrays Event-based rotation speed measurement methods to calculate the correlation with all event-aggregation offer the advantage of high temporal resolution, en-arrays. We refer to this selected array as a template. abling precise tracking of rapid rotational motion. Fig. 3 visualises four selected event-aggregation ar-However, these methods may face challenges in sce-rays, showcasing the spatial distribution of events narios where clear observable landmarks or mark-within one millisecond set time intervals. ers on the rotating target are absent, limiting their Next, we calculate correlation responses between applicability in specific environments and necessi-the template and all event-aggregation arrays. As ex-tating well-defined visual features for accurate mea-pected, the responses have periodic peaks. When a surements or knowledge of the centre of rotation. peak is reached, it signifies the completion of one 3. Proposed method period of the event. An example of periodic peaks in the correlation responses is shown in Fig. 1. In this section, we introduce our method. Put sim-Subsequently, we compute N delta times ∆ti by ply, our method first aggregates outputs of an event measuring the temporal differences between succes-camera1 along the time axis, then computes a 2D cor-sive event-aggregation arrays that exhibit peaks in relation of a selected template with the aggregated their correlation responses. Each ∆ti represents the 1The data acquired from the event camera are represented as microseconds it takes for the observed object to com-a list of tuples (x, y, t, p), where x and y denote spatial coor-plete one revolution or cycle of states. dinates of the event, t the timestamp of the event, and p is the polarity of the brightness change. The p value is −1 in a case of a case of no brightness change larger than the defined threshold. brightness decrease, 1 in a case of brightness increase, and 0 in The list contains events with ascending timestamps. 68 The RPM value based on a single revolution is In the second experiment, we measure a disc cov-subsequently calculated using the following formula ered by a uniform velcro material, where any pattern for experiments on measuring the speed of rotating is hardly observable. In these two experiments, the objects. event camera is in a fronto-parallel position relative to the disc. In the third experiment, we show that 106 RPM our method is accurate when the camera axis is not i = × 60, i = 1, 2, ..., N (1) ∆ti collinear with the axis of rotation and points on the Ultimately, we calculate the average RPM value for rotating surface are moving along elliptical orbits in each second of data as the 2D space. Moreover, the accuracy is not degraded when data are captured through a transparent mate-PM RPM rial for visible light. For physical setups, see Fig. 2. RPM = i=1 i (2) M For each experiment, we maintain the same lighting conditions and stationary position of sensors and where M is the number of samples in one second of the observed object. data. Before we dive into presenting the results of our For the other experiments (4.1, 4.2) the frequency method, we describe both the event camera and the ν expressed in hertz (Hz) is computed for each ∆ti ground-truth laser tachometer. as 106 Event camera In our experiments, we used the νi = , i = 1, 2, ..., N (3) ∆t Prophesee EVK4 HD event camera. The camera’s i resolution is We then calculate the arithmetic mean frequency of 1280x720 pixels and can capture up to periodic movement. An overview of the method can 1066 million events per second [1]. The behaviour of the camera is adjustable with five biases [3], be seen in Fig. 1. namely with two contrast sensitivity threshold biases The σ value in Tab. 5-10 is computed as (bias_diff_on, bias_diff_off), two band- r width biases (bias_fo, bias_hpf), and with the σ2 σ = (4) dead time bias (bias_refr). The contrast sensitiv-M ity threshold biases regulate the contrast threshold, where M is the count of measurements during the influencing the sensor’s sensitivity to changes in il-respective time interval of 1 second and represents lumination. The bias_diff_on adjusts the ON the standard deviation of the average measured value. contrast threshold, which is the factor by which the We assume the measurements are independently pixel must get brighter, before an ON event occurs identically distributed and drawn from a normal dis-for that pixel, while the bias_diff_off deter- tribution. We chose the confidence interval of 95.4%, mines the OFF contrast threshold, which is the factor by which our point estimate of the mean should be by which the pixel must get darker before an OFF less than 2σ away from the true mean, and our point event occurs for that pixel. Bandwidth biases con-estimate of standard deviation should be less than 2σ. trol low-pass and high-pass filters, with bias_fo Our proposed method requires parameters that adjusting a low-pass filter to filter rapidly fluctuating need to be selected by the user. These parameters light and bias_hpf adjusting the high-pass filter are the event-aggregation duration, position and size to filter slow illumination changes. Dead-time bias of the RoI and which event-aggregation array to use (bias_refr) regulates the pixel’s refractory pe-as a template for calculating correlation responses. riod, determining the duration of non-responsiveness for each pixel after each event. 4. Experiments In our experiments, the camera contrast sensitivity First, we present two frequency measurement ex-threshold biases were set to 20. The high-pass filter periments in this section. In the first one, we measure bias was set to 50, and we kept the other biases at the frequency of the flashing diode. In the second ex-their default values. periment, we estimate the frequency of vibration. Laser tachometer To capture the ground truth Then, we present three rotational speed measure- (GT) rotation speed data, the Uni-Trend UT372 laser ment experiments. In the first experiment, we mea-tachometer was used. The tachometer range is 10 to sure the speed of a felt disc with a high contrast mark. 99,999 RPM with a relative error of ±0.04%. 69 4.1. Measuring periodic light flashes In this experiment, we used a simple circuit with a diode and Raspberry Pi controlling it (see Fig. 2a). We used private software to precisely set the flashing (a) Selected Regions of Interest frequency and portion of the frequency period (duty cycle) that the diode should emit light. We opted for 2000 Hz and a duty cycle of 50%. Selection of RoI We selected three distinct Re- gions of Interest (RoIs) with varying positions and 125x125 px 45x45 px 20x20 px sizes. The first RoI covers the entirety of the flashing (b) Templates with aggregation duration of 0.1 ms. diode, while the second focuses solely on its upper half. Lastly, the third RoI is set to a smaller area within the upper portion of the diode. The results (Tab. 1) indicate that the precision slightly decreases when the RoI becomes too small to cover a reason-t=0.1 ms t=0.25 ms t=0.5 t=1 ms (c) A template as a function of able area. the duration t of the aggregation time interval. Figure 4: Setup of experiment 4.1 (see Tab. 1, 2). Selection of aggregation duration We fixed the PP 125x125 pixel RoI mentioned in the preceding para-P method Ground Our method Our method Our method t(s) P truth 125x125px 45x45px 20x20px graph and conducted experiments by adjusting the 2000 2000.54 [0, 4) 2000 2000 ± 0 ± 0 ± 0.82 duration of the event-aggregation window. ConsisTable 1: Frequency (Hz) ± 2σ (4) as a function of tent with our expectations, the method fails with a the size of the Region of Interest (see Fig. 2a, 4b). window duration exceeding 0.25 milliseconds as the period of the captured frequency is shorter than the PPPmethod aggregation duration (Tab. 2). Consequently, this t(ms) P Ground truth Our method 60x60px leads to nearly identical event-aggregation arrays, as 0.1 2000 ±0 2000 events generated from multiple LED flashes are ag-0.25 2000 ± 0 gregated into a single array. 0.5 757.79 ± 17.28 1.0 361.81 ± 12.74 4.2. Measuring vibrations Table 2: Frequency (Hz) ± 2σ (4) as a function of the aggregation time interval (see Fig. 2a, 4c). In this experiment, we used a Bluetooth speaker with two large diaphragms responsible for playing low frequencies and an Android application allowing us to play a specified frequency. We opted for 98 Hz, We chose the lowest available measurement out-which is equivalent to tone G put rate of 0.5 seconds and captured the measure-2 with classic tuning A4 as 440 Hz and captured one of those diaphragms for ments via a USB cable. It is worth mentioning that four seconds with the event camera (see Fig. 2b). the optical tachometer outputs only 3 to 5 samples per second, while our method produces a measure-Selection of RoI We experimented with different ment for each period of the observed periodic event. RoI positions and sizes to find the smallest RoI size The GT frequency is known for the other experi-still producing accurate results. We picked three of ments as it was manually set beforehand. them for demonstration purposes (see Fig. 5a,b), with the duration of the event-aggregation set to 0.25 mil-In the following subsections, we present the men-lisecond. The results for four seconds of data are tioned experiments and the results of our method. In presented in Tab. 3. In these four seconds, 406 in-each subsection, the selection of RoI and the dura-dividual vibrations were captured. We can see that tion of the event aggregation are discussed, as we decreasing the RoI sizes impacts the precision only hypothesised that these two parameters influence our slightly and that the method results remain accurate method the most. even with a small RoI size. 70 from the optical tachometer simultaneously. By capturing the data for 4 seconds, we measure at least 80 revolutions of the observed object when the power drill is set to its lowest speed. Since the rotation speed of the power drill is not constant, we com- (a) Selected Regions of Interest pare the measured data from the tachometer and our method for each second of the data independently when experimenting with RoI sizes (e.g. Tab. 5), as we believe that the rotation speed changes very little during such time interval. 60x60 px 35x35 px 10x10 px (b) Templates with aggregation duration of 0.1 ms. 4.3.1 Felt disc with a high-contrast mark This subsection presents experiments in the high contrast mark setup (Fig. 2c). t=0.1 ms t=0.25 ms t=0.5 t=1 ms Selection of RoI We experimented with different (c) A template as a function of RoI positions and sizes to find the smallest RoI size the duration t of the aggregation time interval. still producing accurate results. We picked three Figure 5: Setup of experiment 4.2 (see Tab. 3, 4). of them for demonstration purposes (see Fig. 6a,b), PPPmethod Ground Ourmethod Ourmethod Ourmethod with the duration of the event-aggregation fixed and t(s) P truth 60x60px 35x35px 10x10px set to 98.21 98.48 0.1 millisecond. The results are presented in [0, 4) 98 98.1 ± 0.63 ± 0.73 ± 0.9 Tab. 5 alongside measurements obtained from the Table 3: Frequency (Hz) ± 2σ (4) as a function of laser tachometer. the size of the Region of Interest (see Fig.2b,5b). The results show that even the smallest RoI of size P 20 by 20 pixels (px) produces errors of the same or-PP method t(ms) P Ground truth Our method 125x125px der as the larger RoI sizes. 0.1 98.39 ±1.5 98 Selection of aggregation duration In this experi-0.25 98.15 ± 1.13 ment, with the duration of the event aggregation, we 0.5 98.01 ± 0.89 fixed the RoI to size 100×100 px. We present results 1.0 98.05 ± 0.88 from one second of data with various durations of ag-Table 4: Frequency (Hz) ± 2σ (4) as a function of gregation ranging from 0.1 millisecond (ms) to 1 ms. the aggregation time interval (see Fig.2b,5c). For the templates for each duration of the aggregation, see Fig. 6c. The results are shown in Tab. 6. Selection of aggregation duration We fixed the Increasing the aggregation duration marginally in-RoI size to 60 × 60 px and set the position as in creases the standard deviation of the average RPM the previous paragraph experiment. We present re-value. We believe that it is caused by the fact that the sults from one second of data with various durations mark produces a very distinctive pattern. of aggregation ranging from 0.1 milliseconds (ms) to 1 ms. From Tab. 4 we can see that the best performing aggregation duration was 0.5 ms and that the ac-4.3.2 Fronto-parallel velcro disc curacy generally increases with the aggregation du-Here, we present experiments in the velcro disc setup ration in this scenario. (see Fig. 2d) with fronto-parallel camera position. 4.3. Measuring rotation speed Selection of RoI For the demonstration purposes, In the three following subsections, we present ro-we picked three Regions of Interest of sizes 100×100 tational speed experiments. In these experiments, we px, 60 × 60 px and 40 × 40 px (see Fig. 8a,b). As used a power drill to spin observed objects. The shown in Tab. 7, when a distinguishable pattern is power drill is secured to a flat surface to prevent in-not present in the template, the smallest RoI of size juries. We begin capturing the event data and data 40 × 40 px produces errors of two orders larger than in the case of larger Regions of Interest. 71 ❍ P ❍ method Our method Our method Our method PPmethod Our method t(s) ❍ Tacho t(ms) P Tachometer 655x655px100x100px 20x20px 100x100px 1199.43 1199.12 1199.12 1199.12 1199.12 (a) Selected Regions of Interest [0, 1) 0.1 ± 0.82 ± 0.53 ± 0.53 ± 0.53 ± 0.53 1198.9 1200.85 1200.48 1200.48 1200.36 1199.06 [1, 2) 0.25 ± 1 ± 0.21 ± 0.64 ± 0.73 ± 0.61 ± 1 1203.18 1202.53 1202.53 1202.65 1198.75 655x655px 100x100px 20x20px [2, 3) 0.5 (b) Templates with aggreg. duration of 0.1 ms. ± 0.41 ± 0.93 ± 0.93 ± 1.07 ± 1.67 1203.1 1203.49 1203.49 1203.49 1198.76 [3, 4) 1.0 ± 0.21 ± 0.87 ± 0.72 ± 0.87 ± 2.41 t=0.1 ms t=0.25 ms t=0.5 t=1 ms (c) A template as a function of Table 5: Revolutions per minute ± Table 6: Revolutions per minute ± the duration t of the aggregation time interval. 2σ (4) as a function of the size of 2σ (4) as a function of the dura-Figure 6: Setup of experiment the Region of Interest (see Fig. 2c, tion of the event-aggregation (see 4.3.1 (see Tab. 5, 6). 6b). Fig. 2c, 6c). Figure 7: The fronto-parallel felt disc with a high-contrast mark experiment. ❍ P ❍ method Our method Our methodOur method PPmethod Our method t(s) ❍ Tacho t(ms) P Tachometer 100x100px 60x60px 40x40px 120x120px 1266.07 1266.21 1265.96 1394.74 1266.08 (a) Selected Regions of Interest [0, 1) 0.1 ± 0.38 ± 0.83 ± 1.36 ± 269.44 ± 0.71 1266.08 1266.67 1266.46 1266.72 1670.26 1265.83 [1, 2) 0.25 ± 0.38 ± 0.29 ± 1.13 ± 1.41 ± 365.62 ± 1.46 1267.65 1267.61 1267.61 1235.53 1265.85 100x100 px 60x60 px 40x40 px [2, 3) 0.5 (b) Templates with aggreg. duration of 0.1 ms. ± 0.46 ± 0.75 ± 1.04 ± 61.58 ± 2.4 1267.38 1267.48 1267.36 1407.24 1265.96 [3, 4) 1.0 ± 0.4 ± 1.05 ± 1.6 ± 219.57 ± 5.83 t=0.1 ms t=0.25 ms t=0.5 t=1 ms (c) A template as a function of Table 7: Revolutions per minute ± Table 8: Revolutions per minute ± the duration t of the aggregation time interval. 2σ (4) as a function of the size of 2σ (4) as a function of the dura-Figure 8: Setup of experiment the Region of Interest (see Fig.2d, tion of the event-aggregation (see 4.3.2 (see Tab. 7, 8). 8b). Fig.2d, 8c). Figure 9: The fronto-parallel velcro disc experiment. Selection of aggregation duration We fixed the 4.3.3 Velcro disc with non-frontal camera be- RoI size to 120 × 120 px and aligned it with the ob-hind a glass sheet ject’s centre of rotation. From Tab. 8, we see that the average RPM values remain close to those measured In this subsection, we present experiments with a vel-by the tachometer. The standard deviation of the av-cro disc observed by the camera at a 45° angle that erage RPM increases as the duration of the event ag-captures data through a sheet of glass. gregation prolongs, which is expected as longer time Selection of RoI We experiment with RoI sizes intervals of event aggregation reduce accuracy. ranging from 200 × 200 px to 35 × 35 px. For RoI positions, see Fig. 10a. We present results for three selected sizes in Tab. 9. As shown in this table, the performance of our method degrades significantly in 72 ❍ P ❍ method Our method Our methodOur method PPmethod Our method t(s) ❍ Tacho t(ms) P Tachometer 120x120px 80x80px 35x35px 120x120px 1578.2 1578.56 1578.56 1257.61 1578.29 (a) Selected Regions of Interest [0, 1) 0.1 ± 1.48 ± 1.78 ± 1.78 ± 192.37 ± 1.21 1578.2 1580.19 1579.76 1579.77 1392.5 1578.56 [1, 2) 0.25 ± 1.48 ± 0.98 ± 1.57 ± 2.24 ± 147.16 ± 1.78 1578.38 1578.57 1578.97 1435.02 1578.16 120x120 px 80x80 px 35x35 px [2, 3) 0.5 (b) Templates with aggreg. duration of 0.1 ms. ± 0.78 ± 2.11 ± 2.26 ± 130.11 ± 1.55 1577.47 1577.76 1577.36 1394 1578.95 [3, 4) 1.0 ± 0.68 ± 1.29 ± 1.85 ± 166.97 ± 3.05 t=0.1 ms t=0.25 ms t=0.5 t=1 ms (c) A template as a function of Table 9: Revolutions per minute ± Table 10: Revolutions per minute the duration t of the aggregation time interval. 2σ (4) as a function of the size of ± 2σ (4) as a function of the dura-Figure 10: Setup of experiment the Region of Interest (see Fig.2e, tion of the event-aggregation (see 4.3.3 (see Tab. 9, 10). 10b). Fig.2e, 10c). Figure 11: The non-frontal velcro disc experiment. the case of the smallest 35 × 35 px RoI. We believe Limitations The presented method does not con-that it is caused by the fact that there are not enough sider an automatic detection of a suitable RoI and distinctive events in such a small RoI. its respective template. Also, no centrosymmetric objects were tested - the symmetries might produce spurious peaks. Selection of aggregation duration From Tab. 10, it is clear that with 120 × 120 px RoI, all aggrega-5. Conclusion tion lengths yield measurements comparable to the ground truth device measurements. With the longest In this paper, we proposed a novel contactless event-aggregation interval of 1 ms, the 2σ is approx-measurement method of periodic events with an imately two times larger than with the other lengths event camera. The method only assumes that the ob-and ground truth data. served object periodically produces a similar set of events by returning to a known state or position. 4.4. Discussion We evaluated the proposed method on the task of measuring the frequency of periodic events and ro-Based on the presented experiments, we conclude tational speed, achieving a relative error lower than that (i) a small RoI is feasible without degraded ac- ±0.04%, which is within the error margin of the curacy when a distinguishable pattern is present. (ii) ground-truth measurement. The precision is main-the best results are achieved when the RoI covers the tained while measuring frequencies ranging from 20 area with the highest density of events, and the tem-hertz (equivalent to 1200 RPM) up to 2 kilohertz plate captures a distinctive pattern emerging periodi- (equivalent to 120 000 RPM). We demonstrated rocally, (iii) the event-aggregation duration of 0.25 ms bustness against changes in camera angles. is preferred, as it provides a good balance between a Acknowledgement The authors acknowledge the Grant low number of event-aggregation arrays resulting in Agency of the Czech Technical University in Prague, grant faster computations and relatively low standard devi-No.SGS23/173/OHK3/3T/13. ation of the average results. We could not find the parameters breaking our method in the fronto-parallel high contrast mark experiment. We believe the mark, static camera, static lighting, and fixed power drill contributed to this. 73 References [1] T. Finateu, A. Niwa, D. Matolin, K. Tsuchi- moto, A. Mascheroni, E. Reynaud, P. Mostafalu, F. Brady, L. Chotard, F. LeGoff, H. Takahashi, H. Wakabayashi, Y. Oike, and C. Posch. 5.10 A 1280×720 Back-Illuminated Stacked Temporal Contrast Event-Based Vision Sensor with 4.86µm Pixels, 1.066GEPS Readout, Programmable Event- Rate Controller and Compressive Data-Formatting Pipeline. In 2020 IEEE International Solid-State Circuits Conference - (ISSCC), pages 112±114, 2020. 4 [2] K. W. Hylton, P. Mitchell, B. Van Hoy, and T. P. Karnowski. Experiments and analysis for measuring mechanical motion with event cameras. Electronic Imaging, 33(6):333±1±333±8, Jan. 2021. 3 [3] Prophesee S.A. Biases Ð Metavision SDK Docs 4.5.0 documentation, Dec. 2022. 4 [4] RS Components Ltd. Tachometers - A Complete Guide, Jan. 2023. 1 [5] A. Singh and Y. Kim. Accurate measurement of drone’s blade length and rotation rate using pattern analysis with W-band radar. Electronics Letters, 54(8):523±525, Apr. 2018. 1 [6] Thunder Said Energy. How is the power of a wind turbine calculated? - Thunder Said, Nov. 2022. 1 [7] L. UNI-TREND TECHNOLOGY (CHINA) CO. UT370 Series Tachometers - UNI-T Meters | Test & Measurement Tools and Solutions, Aug. 2022. 1 [8] T. Wang, Y. Yan, L. Wang, and Y. Hu. Rotational speed measurement through image similarity evaluation and spectral analysis. IEEE Access, 6:46718± 46730, 2018. 2 [9] Y. Wang, L. Wang, and Y. Yan. Rotational speed measurement through digital imaging and image processing. In 2017 IEEE International Instru- mentation and Measurement Technology Conference (I2MTC), pages 1±6, 2017. 2 [10] G. Zhao, Y. Shen, N. Chen, P. Hu, L. Liu, and H. Wen. High Speed Rotation Estima- tion with Dynamic Vision Sensors, Sept. 2022. arXiv:2209.02205 [cs, eess]. 3 74 27th Computer Vision Winter Workshop Terme Olimia, Slovenia, February 14–16, 2024 Weather-Condition Style Transfer Evaluation for Dataset Augmentation Emir Mujić, Janez Perš Darko Štern University of Ljubljana, Faculty of Electrical Engineering AVL List GmbH Tržaška 25, 1000 Ljubljana Hans-List-Platz 1, 8020 Graz janez.pers@fe.uni-lj.si, em4593@student.uni-lj.si darko.stern@avl.com Abstract. In this paper, we introduce a framework for evaluating style transfer methods that simulate desired target weather conditions from source images, acquired in fair weather. The resulting images can be used for targeted augmentation of datasets geared toward object detection. Our approach diverges from traditional measures that focus on human perception only, and importantly, does not rely on annotated datasets. Instead, we operate on statistical distribution of out-comes of the inference process (in our case, object detections). The proposed evaluation measure effectively pe- nalizes methods that preserve features and consistencies in object detection, and awards those, which generate challenging cases more similar to the target style. This is counteracted by the requirements that the generated images remain similar to the images acquired in target weather conditions. This shift enables a more relevant and computa- tionally practical assessment of style transfer techniques in the context of weather condition generation. By reducing the dependency on annotated datasets, our methodology offers a more streamlined and accessible approach to evaluation. 1. Introduction In the rapidly evolving field of computer vision, Figure 1: Successful style transfer mimicking real the enhancement and adaptation of datasets through rain’s impact on vehicle detection. Top: input im-style transfer, particularly under varying weather age; middle: simulated rainy image; bottom: style conditions, is of paramount importance [20]. The reference (rainy weather). Vehicle detection using research, presented in this paper, is primarily moti-YOLOv8 [11] shows significant performance drop vated by the desire to improve the performance of from top (clear) to middle (rainy) image. advanced driver assistance systems (ADAS) through targeted learning of hard examples from challenging weather conditions. Importantly, the method itself a quantitative measure of quality of synthetic sam-does not directly enhance ADAS; rather, it provides ples in context of ADAS tasks, which could poten-75 tially be used to refine and improve already existing shelf image semantic segmentation network such as algorithms. The collection of images under adverse [15], segment generated images. Early research in weather is often hindered by their rarity, seasonal I2I didn’t consider unpaired translation, the issue dependence, and increased risk to vehicles. Thus, of not having image pairs from two domains, since synthesizing weather conditions in existing images it focused on automatic segmentation, coloring and recorded in fair weather, that not only look realistic label→image tasks. Amongst the first to solve this but also can correctly challenge or impair the perfor-problem, Park et al. introduced cycle consistency mance of computer vision algorithms, is a key point to GANs [25], creating CycleGAN. Their methods of interest in the modern automotive industry. As-for evaluation are the same as in [9]. More recently, sessing the quality of style-transfer model is a diffi-work by Park et al. in [17] and Hu et al. [8] created cult task and still remains an open issue [23]. unpaired I2I translation based on contrastive learnIn this paper, we propose a novel evaluation mea-ing, reaching current state-of-the-art performance in sure that quantitatively shows how successful style style transfer tasks. In our tasks we look at weather transfer is by visual quality of generated images condition translation and first to create a bespoke net-while keeping the statistics of object detection sim-work for this are Li et al. [12]. They employed ilar to the targeted (conditional) style. Our proposed attention and segmentation modules to the genera-method should help object detection by determining tor. More recently Piazzati et al. in [18], found that if a dataset of images is challenging enough and at using physics-informed network to guide the effects the same time has similar core characteristics to tar-of weather proved to be state-of-the-art in weather get style, for it to be used to improve object detection translation. in specific weather conditions. The approach of this paper is tailored towards ADAS applications. Ex-Evaluation measures. The first, and one of the ample result of successful transfer where detection is most used, measures for quantitative score of GANs hindered by the added style in the same way it would is Inception Score (IS) [21]. It uses a deep net-be by the style of a rainy image is shown in Figure work Inception v3 [22] pre-trained on ImageNet [5] 1. Additionally, if we are in possession of annotated to extract relevant features from generated images fair weather images, then the style transfer preserves and calculates the average KL-Divergence between the annotations since they show the same scene. This the conditional label distribution and generated sam-significantly simplifies the process of obtaining exples distribution. It shows correlation with human amples difficult for object detection since we can an-scoring on CIFAR-10 dataset. Barratt et al. in [1] alyze and extract those examples and based on the showed that IS has issues with both theory and use statistical difference of annotations and detections. in practice. More modern measure is Fréchet inception distance (FID), introduced by Heusel et al. in 2. Related Work [7]. Similar to IS, FID uses Inception v3 network pre-trained on ImageNet, but now the generated and We split our related work section into three parts: real samples are embedded to tensors before calculatin the first (i) we look into type of generative mod-ing the statistical distance, in this case 2-Wasserstein, els we use in our paper, the second (ii) covers the between them. Chong et al. in [4] prove that both work on the most popular evaluation measures and IS and FID are functions of the generator and the the third (iii) the practical computer vision applica-number of samples, therefore we can’t fairly com-tions we will focus on. pare two generators and even the same generators Generative models. Ever since the introduction evaluated on different number of images. They pro-of generative adversarial networks (GANs) [6], the pose a new measure of effectively unbiased FID and idea of translating images from one domain to an-IS called FID∞ and IS∞ respectfully. Besides the other has been a keen topic of research. Early works bias issue FID also assumes a Gaussian distribution in this were done by Isola et al. in [9] where they of samples which is not necessarily true; to solve this, showed it was possible to preform image-to-image Binkowski et al. [3] introduce kernel inception dis- (I2I) translation using conditional GANs. For evalu-tance (KID) where the kernel can be customized to ation measures they used Amazon Mechanical Turk accommodate different tensor distributions. Betzalel (AMT) and “semantic interpretability“ [24] of the et al. in [2] state that the same problems that IS has of generated images to see how well can an off-the-Inception v3 being trained on ImageNet are present 76 in FID as well, and suggest using CLIP [19] basis in-treat these all these as cars in case of motorcycles and stead of the Inception model. They also state that truck for vans. These particular categories were cho-evaluation measures in general might benefit from sen based on the most common vehicles found in our multiple measures such as FID∞ + KID. custom driving dataset. As detection result, YOLOv8 Applications. One of the practical applications of returns a bounding box and a label (category). We computer vision is in the world of advanced driver use this to compare to the ground truth. Ground truth assistance systems (ADAS). Nidamanuri et al. in was done by manually labeling and drawing bound- [16] state that the camera sensor is useful for multiple ing boxes and object categories same ones as taken functions such as object detection, blind spot moni-in YOLO, on test set of 92 images from both dry and toring, parking assist, lane keeping and traffic sign rainy weather conditions. recognition, with moderate accuracy. Liu et al. [14] state that classical object detectors often fail when 3.2. Image quality assessment with FID faced with adverse weather conditions. Despite its issues with bias, most generative mod-Our research is similar to Li et al. [12] where they els are evaluated using the FID measure for image employ a weather classifier trained on real weather assessment. A few methods have been developed af-images, to check if images generated by their model ter FID, however it remains the most used measure are good enough to fool the classifier into giving the in practice. This is due to the fact that it’s rela-image a label of the target condition. tively easy to calculate and there are numerous implementations. To calculate it we run inference on 3. Methods a pre-trained Inception v3 model and calculate the 2-Wasserstein distance from the N Our framework is comprised of multiple elements. × 2048 dimen- sional vector we get as a result from the inference. First is the ADAS algorithm (e.g. object detection) N is the number of images from each label (real or that we wish to improve by additional learning on generated). FID assumes a Gaussian distribution on target weather examples. The second component the feature vector with mean µ and covariance ma-is a trained generator of target weather conditions, trix Σ. FID score is calculated as the square of the 2-which takes non-annotated fair-weather image and Wasserstein distance (1) between tensors X and Y : transforms it into target-weather image. The third component is an image quality assessment measure, FIDX,Y = ∥µX − µY ∥2+ which guarantees the similarity of generated images p + tr Σ Σ . (1) to the real world target weather images. The fourth X + ΣY − 2 X ΣY and critical component of our proposed framework Since FID assumes Gaussian distribution, µx and is evaluation measure of performance degradation of µY are the means of tensors X and Y , and ΣX the chosen ADAS algorithm that does not rely on im-and ΣX their respective covariance matricies. FID age annotations. is known to be biased ([2] [4]), however it is the most used measure of generative model quality, even 3.1. Object detection with YOLOv8 in tasks such as conditioned style-transfer ([8] [12] Since we don’t have access to actual ADAS al- [18]). Despite objectively better measures existing gorithms used on vehicles, we believe that YOLOv8 such as F ID∞ [4], KID [3] and CLIP [19], in this as an example of a state-of-the-art object detector is paper we decided on using FID due to its ease of a good approximation. YOLOv8 is a deep neural implementation and popularity. We will discuss the network [11], developed as an upgrade on YOLOv5 possible issues with this choice in section 4. [10] architecture in both speed and performance. In this paper we use it as a default object detector. It’s 3.3. Quality of detection measure trained on COCO Dataset [13] with 272 categories. To measure quality of detections without anno- For our use case (simulation of ADAS), many of tated images, we rely on statistics of results from these categories are not interesting, hence we filter all YOLOv8, on the entire test set. The assump- results and look only for a few categories. In no par-tion underpinning this approach is that statistically, ticular order these are: car, truck, bus, train, person YOLOv8 detections should have similar distribution and bicycle. COCO has additional categories associ-shape to the actual annotations on the same driving ated with driving such as van and motorcycle, but we route, regardless of the weather. This statement will 77 be backed up in section 4.1. Detection is run on all histograms for both position and size to get a sense images included in the test: input image, conditional how close the distributions of detected objects are for style image and generated image. For consistency dry, rainy and generated images. Normalizing all of with our use case we refer to these as “dry”, “rainy” the calculated values for both FID and Bhattacharyya and “generated” (images generated from dry to have distance so the smallest is 0 and largest 1, and can the style of rainy), respectively. From detection re-combine them into a weighted sum to give us a score sults on an image we take two values, the horizon-shown in equation (3): tal and vertical coordinate of the bounding box and s(X, Y ) = −α · F IDX,Y + the size of the bounding box in pixels in both hor- + β · dB (Hsize(X), Hsize(Y )) + izontal and vertical directions. From these we cre- + γ · d ate 2D discrete histograms which, when computed B (Hposition(X ), Hposition(Y )) (3) for the whole set of a single image style, show us where s(X, Y ) is the measure score between statistical feature that we want to emphasize during a set of images X and Y , F ID(X, Y ) is the translation. In our case this is object detection. We FID score between those two sets, Hsize(X), can compare histograms using an adequate measure Hsize(Y ) and Hposition(X), Hposition(Y ) are the like Bhattacharyya distance. Bhattacharyya distance notations for histograms computed for detection tells us how “far“ two discrete data distributions are sizes and positions of detection for a set of im-one to another. Before computing, all histograms ages X and Y , dB(H(X), H(Y )) is the Bhat-are normalized to a range [0, 1]. This makes our tacharyya distance between sets of histograms. Pa-method invariant to actual number of detected ob-rameters α, β and γ are hyperparameter weights jects and focuses only on positions. Currently, this is (α, β, γ ≥ 0) for FID, dB (Hsize(X), Hsize(Y )) and an advantage since we expect the number of vehicles dB (Hposition(X), Hposition(Y ))) respectively, that for example, to be different during data collection in describe the contribution to the total measure score. different weather conditions. However, with careful data acquisition (ensuring that the streets are equally 3.5. Dry-to-rainy translation: QS-Attn Model busy) the absolute frequencies could be another fea-In this paper, we utilized the query-selected atten-ture to include. Computing the Bhattacharyya dis-tion (QS-Attn) model [8] for I2I translation tasks. tance is quite straightforward, in this case as we fol-QS-Attn enhances contrastive learning [17] by se-low equation (2) where P and Q are discrete data lectively focusing on significant anchor points within distributions and pi and qi are their respective bins: images. This model employs an attention mechanism ! that prioritizes important queries in the source do-n X √ main, creating a condensed attention matrix. This dB(P, Q) = −ln piqi (2) matrix is pivotal in routing features across both i=1 source and target domains, ensuring that relational For computing equation (2) on 2D histograms, we structures from the source are retained in the trans-simply transform a n × m matrix into a 1 × (n × m) lated images. vector. This approach, along with equation (2), measures the effectiveness of YOLOv8 in object detec-4. Experiments tion across various images. A requirement for using this measure is that dry and rainy images need to con-4.1. Dataset and training tain similar image content across the dataset, but for For our dataset, we recorded 1242 images on a tasks such as style-transfer this is fulfilled in most route in dry and the same amount in rainy weather. cases. This makes our dataset weakly-paired, meaning pairs of images do not exist since the recording environ-3.4. Combining the measures ment (the road) is dynamic, but images are still sim-As a result of FID we get a single number that ilar enough to be considered “location pairs“. Ex-should indicate the “visual distance“ between two amples of this are shown in Figure 2. Recording the images consistent with human evaluation. With de-dataset like this, twice on the same route with the tection this is more complicated and we propose a camera fixed in the same place on the windscreen, method described in section 3.3. ties in with the discussion of weather the statistics We compute Bhattacharyya distances between of detections are the same. These statistics are very 78 route specific and we choose a route that has main streets with a good flow of traffic as well as resi-dential areas with less active traffic and more passive traffic such as parked vehicles to try and cover what most vehicles see in day to day city driving. Of the complete dataset, 1140 images are training images, 10 validation images and 92 test images. Images in the dataset cover urban driving scenes in central European cities, in dry conditions and rain such is shown in Figure 2. All images are resized from original 3840×2160 pixels resolution, to 400×400 pixels to accommodate the model input and to make it possible to train the model on a single Nvidia RTX3090 GPU. For our model, we used an official implementa- tion1 of GAN described in section 3.5. Training hyperparameters are default, except for “QS Mode“ set to global, “crop size“ and “load size“ hyperparameters are set to 400 to accommodate hardware limitations. Model was trained for 400 total epochs, of which the first 200 are with the default learning rate (“n epochs“ hyperparameter) and the latter 200 with linear learning rate decay (“n epochs decay“ hyperparameter). 4.2. YOLOv8 detection on our data To benchmark YOLOv8, we annotated our test data with bounding boxes and labels for objects of Figure 2: Sampled images from our weakly-paired interest. Annotation statistics are shown in Table 1, dataset depicting scenes of urban driving the numbers represent the number of bounding boxes for that label in absolute value and as a percentage Dry Rain of all labels. Results on 92 test images per weather Car 364 (78.45%) 322 (82.11%) condition are shown in Table 2. Looking at normal-Person 69 (14.78%) 28 (7.05%) ized histograms in Figure 3 of detections for both dry Bicycle 13 (2.8%) 17 (4.28%) and rainy conditions, we can get a sense how well Bus 9 (1.94%) 10 (2.52%) YOLOv8 does in a more practical sense. Since his-Truck 9 (1.94%) 20 (5.04%) tograms in Figure 3 and 4 are normalized to a range [0, 1], the shape of the distribution is for now much Table 1: Annotation results for our dataset more relevant for us than the values at any particular point. Figures 3 and 4 are the distributions we are Dry Rain basing our evaluation on. We can see that they are Precision 0.734 0.377 similar in shape. This results needs to be additionally Recall 0.492 0.153 verified with more annotated images for both dry and F1 Score 0.589 0.218 rainy conditions. Table 2: YOLOv8 benchmark results for our dataset 4.3. Evaluation procedure ages as epochs tend towards the final one. We can Our evaluation relies on the fact that during train-sample the training weights at certain points during ing, model creates more and more realistic rainy im-training to obtain a sub-optimal model and run in-1https://github.com/sapphire497/ ference with test image set to obtain intermediate re-query-selected-attention sults. We then follow method described in section 79 Figure 5: 2D histogram of samples taken on last training epoch make sure the results are correctly calculated. Rainy images contain different scenes so sampling can be Figure 3: 2D histogram comparison of detections on done either way. From this we create histograms for dry images. all sets of images: dry, rainy and generated. Example of histograms from the last training epoch is shown in Figure 5. Normalization to range [0, 1] is done and finally we compute Bhattacharyya distance according to eq. (2) between dry and generated, and dry and rainy histograms. FID is then computed between rainy and fake images to give us a FID score. We note we used FID primarily for its ease of computation, implementations of other measures, such KID and FID∞, are less common. For our tests, following the described method, our measure rewards generated samples that have a similar (according to eq. (2)) histogram distribution to rainy samples, and low FID score between rainy and generated samples. For clarity, in our experiments we compared dry to rainy and dry to generated samples to show that the measure for generated images goes from being more similar to dry towards being more similar to rainy. Theoretically the best score Figure 4: 2D histogram of detections on rainy im-a model can achieve is 2. This is because we need ages. to make sure all values are scaled to the same size, therefore we normalize both FID and Bhattacharyya 3.4, and test our combined evaluation measure. One distances to [0, 1]. Setting all of the weights in eq. thing we need to make sure is to correctly sample (3) to α, β, γ = 1 gives us a maximum score of 2. results from dry and generated set. The reason for 4.4. Results this is, because of the nature of style-transfer tasks, there is a possibility of high correlation between de-We sample the model at every 5th epoch and eval-tection scores from these dry and generated images uate the results according to our method. We first since they depict the exact same scene only in dif-look at graphs for all influential measures over sam-ferent weather. To get an accurate measure of per-pled epochs separately and not normalized. In Figure formance over different samples, we take every even 6, we can see that the measure for similarity of his-numbered image from the test set of dry samples tograms between dry and fake samples drifts quite and odd numbered sample from the generated set, to rapidly from values closer to dry vs. dry, towards dry 80 Figure 6: Bhattacharyya distance comparison for po-Figure 7: Bhattacharyya distance comparison for sitional histograms over training epochs size histograms over training epochs vs. rainy quite rapidly at the beginning of the training process. Spikes in the Figure 6 are due to the random nature of training generative models and we can interpret it as follows: the model suddenly learns how to represent a new feature from the rainy set such as windscreen wipers found on training images, so suddenly on all generated images form a certain epoch there are simulated wipers represented as a back line across the screen. Example of this is in Figure 10. These interfere with possible detections and make the histogram of generated samples dissimilar to that of dry samples. From the section 4.3 we know our mea-Figure 8: FID score over training epochs sure has a theoretical best value so going over this is not wanted, just as much as not reaching this value the score comparing rainy and generated trends down in the first place. Spikes can tell us that something over epochs. This, to a certain degree, ensures us that drastically changed during training, and needs to be the style-transfer seems to be working correctly. visually examined. Windscreen wipers are actually a Now normalizing these values and summing them valid distortion on our images, if the camera is placed according to equation (3), gives us our measure how in a way that is occasionally covered with wipers and good the style transfer is. Measure is shown in Figure we can use this epoch to obtain difficult training sam-9. ples that simulate wipers if that is our goal. We also fitted a trend line using least squares to the Looking at size comparisons, things are more dif-results to get a better trend estimate. We can see that ficult to assess. Graph showing comparisons over the measure value goes up with training epochs, reas-epochs is shown in Figure 7. Detection sizes are suring us the model is doing style-transfer correctly noisy over epochs and general trend is difficult to see. according to both FID (as proxy for human percep-This is due to the fact that over different epochs im-tion) and at the same time making the images chal-ages go through various phases of added artifacts and lenging for an object detector in a similar direction to effects by the model, making the detections that are that of a rainy image. Different weights would em-present, inconsistent. phasize different aspects of style-transfer and there-Analyzing the graph of FID score over epochs in fore give us different looking graphs for a model, de-Figure 8, we get a sense how does well does the pending on what component is most important for translation work. We can see that the score com-any given task. Example image sampled at different paring dry and generated samples trends up towards epochs where our measure shows higher values are the value of dry vs. rainy and, more relevant for us, shown in Figure 10. 81 Figure 9: Our model evaluation measure over training epochs. Hyperparameters are set to values α, β, γ = 1. s(X, Y ) represents the score between two sets of images. One important fact to mention is that all of the presented results were done on our test dataset that has 92 images from each weather condition. This means that looking at raw values of the measure is not reliable enough since we can’t with certainty state that the results of even Bhattacharyya distance are unbiased, let alone FID. Therefore, for current results we propose looking at only the trend is it rising or falling and based on that determine is the model working in the wanted “direction“. 5. Conclusion This study developed a framework for evaluating style transfer in weather-conditioned image generation, addressing the challenge of maintaining key features for object detection while accurately simulating weather conditions. This has implications for dataset augmentation in fields like ADAS. Future goals include testing with a larger dataset, both for training and evaluation, and further research on the statistical consistency of the proposed measure. Plans also include adapting this measure as a loss function for training style transfer models for specific computer vision tasks. Additionally, alternatives to Figure 10: Example generated images by QS-Attn FID and other histogram distances for image similar- [8] from the test dataset on epoch numbers 130, 230, ity will be explored. 270, 360 (roughly corresponding to local maxima of the proposed measure) and 400 (final epoch), in order Acknowledgement from top to bottom. In first three, an attempt to add This work was financed by the Slovenian Research wipers is clearly visible. Agency (ARIS), research program [P2-0095], and research project [J2-2506]. 82 References [14] W. Liu, G. Ren, R. Yu, S. Guo, J. Zhu, and L. Zhang. Image-adaptive yolo for object detection in adverse [1] S. Barratt and R. Sharma. A note on the inception weather conditions. In Proceedings of the AAAI score. arXiv preprint arXiv:1801.01973, 2018. 2 Conference on Artificial Intelligence, volume 36, [2] E. Betzalel, C. Penso, A. Navon, and E. Fetaya. A pages 1792–1800, 2022. 3 study on the evaluation of generative models. arXiv [15] J. Long, E. Shelhamer, and T. Darrell. Fully convo-preprint arXiv:2206.10935, 2022. 2, 3 lutional networks for semantic segmentation. In Pro- [3] M. Bińkowski, D. J. Sutherland, M. Arbel, and ceedings of the IEEE conference on computer vision A. Gretton. Demystifying mmd gans. arXiv preprint and pattern recognition, pages 3431–3440, 2015. 2 arXiv:1801.01401, 2018. 2, 3 [16] J. Nidamanuri, C. Nibhanupudi, R. Assfalg, and [4] M. J. Chong and D. Forsyth. Effectively unbiased H. Venkataraman. A progressive review: Emerging fid and inception score and where to find them. In technologies for adas driven solutions. IEEE Trans-Proceedings of the IEEE/CVF conference on com- actions on Intelligent Vehicles, 7(2):326–341, 2021. puter vision and pattern recognition, pages 6070– 3 6079, 2020. 2, 3 [17] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu. Con- [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and trastive learning for unpaired image-to-image transL. Fei-Fei. Imagenet: A large-scale hierarchical im-lation. In Computer Vision–ECCV 2020: 16th Eu- age database. In 2009 IEEE conference on computer ropean Conference, Glasgow, UK, August 23–28, vision and pattern recognition, pages 248–255. Ieee, 2020, Proceedings, Part IX 16, pages 319–345. 2009. 2 Springer, 2020. 2, 4 [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [18] F. Pizzati, P. Cerri, and R. de Charette. Physics-D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-informed guided disentanglement in generative net-gio. Generative adversarial nets. Advances in neural works. IEEE Transactions on Pattern Analysis and information processing systems, 27, 2014. 2 Machine Intelligence, 2023. 2, 3 [7] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, [19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, and S. Hochreiter. Gans trained by a two time-scale G. Goh, S. Agarwal, G. Sastry, A. Askell, update rule converge to a local nash equilibrium. Ad-P. Mishkin, J. Clark, et al. Learning transferable vances in neural information processing systems, 30, visual models from natural language supervision. 2017. 2 In International conference on machine learning, [8] X. Hu, X. Zhou, Q. Huang, Z. Shi, L. Sun, and pages 8748–8763. PMLR, 2021. 3 Q. Li. Qs-attn: Query-selected attention for con- [20] C.-G. Roh, J. Kim, and I.-J. Im. Analysis of impact trastive learning in i2i translation. In Proceedings of of rain conditions on adas. Sensors, 20(23):6720, the IEEE/CVF Conference on Computer Vision and 2020. 1 Pattern Recognition, pages 18291–18300, 2022. 2, [21] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, 3, 4, 8 A. Radford, and X. Chen. Improved techniques for [9] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. training gans. Advances in neural information pro-Image-to-image translation with conditional adver-cessing systems, 29, 2016. 2 sarial networks. In Proceedings of the IEEE con- [22] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and ference on computer vision and pattern recognition, Z. Wojna. Rethinking the inception architecture for pages 1125–1134, 2017. 2 computer vision. In Proceedings of the IEEE con- [10] G. Jocher. Ultralytics yolov5. https:// ference on computer vision and pattern recognition, github.com/ultralytics/yolov5, 2020. 3 pages 2818–2826, 2016. 2 [11] G. Jocher, A. Chaurasia, and J. Qiu. YOLO [23] Z. Wang, L. Zhao, H. Chen, Z. Zuo, A. Li, W. Xing, by Ultralytics. https://github.com/ and D. Lu. Evaluate and improve the quality of neu-ultralytics/ultralytics, Jan. 2023. 1, 3 ral style transfer. Computer Vision and Image Understanding, 207:103203, 2021. 2 [12] X. Li, K. Kou, and B. Zhao. Weather gan: Multi-domain weather translation using generative adver- [24] R. Zhang, P. Isola, and A. A. Efros. Colorful im-sarial networks. arXiv preprint arXiv:2103.05422, age colorization. In Computer Vision–ECCV 2016: 2021. 2, 3 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III [13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Per-14, pages 649–666. Springer, 2016. 2 ona, D. Ramanan, P. Dollár, and C. L. Zitnick. Mi- [25] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. crosoft coco: Common objects in context. In Com-Unpaired image-to-image translation using cycle-puter Vision–ECCV 2014: 13th European Confer- consistent adversarial networks. In Proceedings of ence, Zurich, Switzerland, September 6-12, 2014, the IEEE international conference on computer vi-Proceedings, Part V 13, pages 740–755. Springer, sion, pages 2223–2232, 2017. 2 2014. 3 83