Informatica An International Journal of Computing and Informatics Special Issue: Soft Computing in Multimedia Processing Guest Editors: Karen Egiazarian Aboul Ella Hassanien The Slovene Society Informatika, Ljubljana, Slovenia EDITORIAL BOARDS, PUBLISHING COUNCIL Informatica is a journal primarily covering the European computer science and informatics community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international referee-ing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors. The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Science and Technology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor - Editor in Chief Anton P. Železnikar Volariceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor (Contact Person) Matjaž Gams, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 matjaz.gams@ijs.si http://ai.ijs.si/mezi/matjaz.html Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Editorial Board Suad Alagic (USA) Anders Ardo (Sweden) Vladimir Bajic (South Africa) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Hubert L. Dreyfus (USA) Jozo Dujmovic (USA) Johann Eder (Austria) Vladimir A. Fomichov (Russia) Janez Grad (Slovenia) Hiroaki Kitano (Japan) Igor Kononenko (Slovenia) Miroslav Kubat (USA) Ante Lauc (Croatia) Jadran Lenarcic (Slovenia) Huan Liu (USA) Suzana Loskovska (Macedonia) Ramon L. de Mantras (Spain) Angelo Montanari (Italy) Pavol Nävrat (Slovakia) Jerzy R. Nawrocki (Poland) Franc Novak (Slovenia) Marcin Paprzycki (USA/Poland) Gert S. Pedersen (Denmark) Karl H. Pribram (USA) Luc De Raedt (Germany) Dejan Rakovic (Serbia and Montenegro) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) Robert Trappl (Austria) Terry Winograd (USA) Stefan Wrobel (Germany) Xindong Wu (USA) Executive Associate Editor (Technical Editor) Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 219 385 drago.torkar@ijs.si Publishing Council: Tomaž Banovec, Ciril Baškovic, Andrej Jerman-Blažic, Jožko (Ćuk, Vladislav Rajkovic Board of Advisors: Ivan Bratko, Marko Jagodic, Tomaž Pisanski, Stanko Strmcnik Editorial: Special Issue on Soft Computing in Multimedia Processing I. Introduction Soft Computing (SC) is an emerging field that consists of complementary elements of fuzzy logic, neural computing, evolutionary computation, machine learning and probabilistic reasoning. Due to their strong learning and cognitive ability and good tolerance of uncertainty and imprecision, soft computing techniques have found wide applications. Needless to say, multimedia data (video, image, audio, text, color, etc.) is one of these applications. Multimedia processing is a very important scientific research domain with a broadening range of applications. The development of new insights and applications results from both fundamental scientific research and the development of new technologies. One of these emerging technologies is soft computing, which is a generic term for a specific collection of tools to model uncertainty, imprecision, evolutionary behavior and complex models. This special issue is devoted to the recent developments in the applications of soft computing (SC) techniques to multimedia processing. We received 16 papers, of which 6 were accepted for publication. The topics covered in this issue cover a wide range of research areas of soft computing in multimedia processing including video sequence, color quantization, image retrieval, meeting video, document image analysis, image segmentation and biometric application. II. Scanning through the issue Effective and efficient representation of video sequences is an important multimedia analysis challenging task for video retrieval and browsing applications. The paper by Lang Gongyan, Xu De and Yang Xu introduces a new approach for the prominent region detection from the viewpoint of the human perception intending to construct a good pattern for content representation of the video sequences. It starts by partitioning each frame into homogeneous regions using a technique based on a nonparameter clustering algorithm, then extracts a number of different mise-en-scene-based perceptual features which influence human visual attention in order to automatically determine the prominent importance of the different homogenous regions in a frame. Finally, a modified Fuzzy Inference Neural Networks is used to detect prominent regions in video sequence due to its simple structure and superior performance for automatic fuzzy rules extraction. The extracted prominent regions could be used as a good pattern to bridge semantic gap between low-level features and semantic understanding. Experimental results show the excellent performance of the approach. The popularity of the world wide web has emerged as the largest repository of multimedia data in the world. One form of information that is very popular on the web today is the digital color image. This includes both single- and multi-frame (video) images. While many forms of information and data can be transferred quickly using the web, the transfer of digital images can be very time consuming due to their inherent size. To speed up this process, images are commonly compressed before being stored at the local site or transmitted across the internet. But the compression of digital images is not a straight forward process. Color image quantization or simply color quantization, is a form of image compression which reduces the number of colors used in an image while maintaining, as much as possible, the appearance of the original. This type of compression does not allow the original image to be reproduced, however, from the compressed image. The optimal goal in the color quantization process is to produce an image which can not be distinguished from the original. Thus, a color quantization algorithm attempts to approximate the optimal solution. The paper by Mahamed Omran, Andries Engelbrecht and Ayed Salem deals with the color image quantization problem. It is based on Particle Swarm Optimization algorithm (PSO). The proposed algorithm randomly initializes each particle in the swarm to contain K centroids. The K-means clustering algorithm is then applied to each particle at a user-specified probability to refine the chosen centroids. Each pixel is then assigned to the cluster with the closest centroid. The PSO is then applied to refine the centroids obtained from the K-means algorithm. The proposed algorithm is then applied to commonly used images. It is shown from the conducted experiments that the proposed algorithm generally results in a significant improvement of image quality compared to other well-known approaches With the development of the Internet and database techniques, information retrieval (IR) becomes very popular. As a powerful form of delivering information, multimedia data is frequently used in many domain applications. Techniques for effectively dealing with multimedia databases management are useful and in demand. Dianhui Wang and Xiaohang Ma developed a hybrid scheme for intelligent image retrieval using neural nets. Each item in an image database is indexed by a visual feature vector, which is extracted using color moments and discrete cosine transform coefficients. Query is characterized by a set of semantic labels, which are predefined by system designers and associated with domain concerns. The system utilizes the image content features as the system input, and the semantic labels as its output. To compensate the deficiency of semantics modeling, an on-line user's relevance feedback is applied to improve the retrieval performance of the hybrid intelligent retrieval system. The neural net acts like a pattern association memory bank that maps the low-level feature vectors into their corresponding semantic labels. During retrieval process, the weights of the neural net are updated by an interactive user's relevance feedback technique, where the feedback signal comprise the neural net actual output, semantic labels provided by users and the given query. A prototype hybrid intelligent retrieval system and evaluated using an artificial image database Meeting videos are important multimedia documents consisting of captured meetings in specialized smart room environments. Research activities cover for instance recording, representing, and browsing of meeting videos. Speech can be very useful cue in indexing videos, but precise speech recognition in meting rooms remains a challenging task because of extensive vocabulary topics, speech styles and so on. The sound cue can also be used in teleconferencing scenarios to identify the speaker and to improve the tracking performance. Indexing videos using visual content is also a challenging task. On the basis of visual cues it is possible to recognize what single participants are doing throughout the meeting. The paper by Bogdan Kwolek deals with the action recognition meeting videos using the head trajectory and fuzzy color histogram where the knowledge was extracted from such video. The tracking of the head is done using a particle filter built on cues such as color, gradient and shape. The head is represented by an ellipse with fuzzy color histogram in its interior and an intensity gradient along the ellipse boundary. By comparing pixels in entry zones to a model of the background we can detect the entry of the person quickly and reliable. The fuzzy color is constructed in the interior of an ellipse fitting best the oval shape of the head. When a new person appears in the scene a creation of new trajectory is initialized. The recognition of actions is performed using kernel histograms built on head positions as well as segmented trajectories that are related to the layout of the room. Document analysis or more precisely, document image analysis, is the process that performs the overall interpretation of document images. Document image processing is now an established field within the electronic imaging world. It is becoming even more prevalent in an area where paper documents need to be transformed into electronic format for long term storage, backup, multiple access and retrieval. The process of extracting information from often poor quality images of documents is a topic of active research. In a multimedia environment where sound, moving images and graphics could be part of a compound document, the role of image processing becomes even more important. The paper by Andras Barta and Istvan Vajk presents a hierarchical object recognition system for document image processing. It is based on a spatial tree structure representation and Bayesian network framework. The image components are built up from lower level image components stored in a library. The tree representations of the objects are assembled from these components. A probabilistic framework is used in order to get robust behaviour. The method is able to convert general circuit diagrams to their components and store them in a hierarchical data-structure. The paper presents simulation for extracting the components of sample circuit diagrams. The utilization of digital techniques in the creation, editing and distribution of multimedia data offers a number of opportunities to a pirate user, such as high fidelity copying. Furthermore, the widespread usage of Internet is providing additional channels for a pirate to quickly and easily distribute the copyrighted digital content without the fear of being tracked. As a result, the protection of multimedia content (image, video, audio, etc.) is now receiving a substantial amount of attention. Digital fingerprinting is an emerging technology to protect multimedia from unauthorized redistribution. The paper by Mohamed Mostafa deals with the problem of authentication. It presents a novel and fast fingerprint identification technique, which uses a novel clustering algorithm to detect similar feature groups from multiple template images generated from the same finger and create the cluster core set. It is based on a new supervised recurrent neural-network. A quick response was achieved by manipulating the search order inside the experimental databases. The experiments results demonstrate that the similarity search approach with neural networks proves suitable one-to many matching of fingerprints on large databases. Acknowledgments. The team of Guest Editors would like to take this opportunity to thank all those authors who submitted papers, and all of the reviewers who took such care in reviewing these papers, as well as the Editor-In-Chief and professor Anton P. Zeleznikar and professor Matjaz Gams managing Editor who give us the permission to edit this issue and their support and guidance in the publication. As Guest editors, we hope that the papers in this issue will stimulate further progress in this direction. We believe that the best is yet to come. Karen Egiazarian and Aboul Ella Hassanien Perception-Oriented Prominent Region Detection in Video Sequences Lang Congyan, Xu De and Yang Xu School of Computer Science & Information Technology, Beijing Jiaotong University, Beijing, 100044, China E-mail: gltree@263.net, xd@comput.njtu.edu.cn Keywords: prominent region, image segmentation, feature extraction, video content representation Received: October 21, 2004 Effective and efficient representation of video sequences is an important yet challenging task for video retrieval and browsing. In this paper, we propose a new approach for the prominent region detection from the viewpoint of the human perception intending to construct a good pattern for content representation of the video sequences. Firstly, we partition each frame into homogeneous regions using a technique based on a non-parameter clustering algorithm. Then, in order to automatically determine the prominent importance of the different homogenous regions in a frame, we extract a number of different mise-en-scene-based perceptual features which influence human visual attention. Finally, a modified Fuzzy Inference Neural Networks is used to detect prominent regions in video sequence due to its simple structure and superior performance for automatic fuzzy rules extraction. The extracted prominent regions could be used as a good pattern to bridge semantic gap between low-level features and semantic understanding. Experimental results show the excellent performance of the approach. Povzetek: Predstavljen je nov postopek za zaznavanje regij. 1 Introduction With the current advance of video database technique, efficient video retrieval and browsing have become crucially important, especially with the development of the video content description standard, such as MPEG-7. Unfortunately, current approaches to video processing suffer from one or more shortcomings that stem from the semantic gap between low-level features and high-level semantic concepts. To bridge the semantic gap, most previous works select semantic video objects [1-3] as the underlying video patterns for video content representation and feature extraction. However, the major problem using semantic video object as video patterns is that automatic semantic video object extraction in general still needs for human's interaction at the current stage. Faced with these problems, an increasing number of researchers are now exploring the intermediate-level processing, shifting the focus of attention away from the purely local and pixel-based indicators to more global measures that seem to provide a stepping stone towards a more robust high-level processing [18]. Many image and video processing applications could be made both more efficient and effective if a number of salient regions were first segmented. Studies of visual attention and eye movements [4,5] have show that humans generally only attend to a few areas in an image. Even when given unlimited viewing time, subjects will continue to focus on these few areas rather than scan the whole image. According to the fact, many research efforts have been given in detecting salient region in image intending to overcome the limitations of semantic object extraction. A considerable amount of research has addressed the salient region detection problem by clustering-based methods, for instance, in Ref [18], authors firstly map an image into the appropriate feature space, then detection salient regions by nonparametric clustering method. Hang Fai Lau, et al [19] identify a small number of regions in an image using low-level features, which work well on the colour image for image retrieval. On the other hand, most existing approaches [6,7] aim at detecting the salient region for images, which are mainly based on the construction of a saliency map modeled as an integration of different measurable, low-level image features such as color, orientation, depth information etc. The purpose of the saliency map is to represent the conspicuity of each locations of the visual field, that is, salient regions extracted have higher prominent importance than the other regions. In [11], authors use motion information to construct salient map for video sequence, which gives superior performance for moving region analysis in video sequence. A salient region detection and tracking method is presented in [15], which extract salient regions based on color and orientation maps followed by a tracking mechanism. Salient region extraction based on saliency map provides a good starting point for semantic-sensitive content representation. However, perceived salient region extraction for image or video is still an unsolved problem. One reason is that video sequence has more context information than single image, hence, low-level features are often not enough to classify some regions unambiguously without the incorporation of high-level and human perceptual information into the classification process. Another reason for the problems is perception subjectivity. Different people can differ in their perception of high-level concepts, thus a closely related problem is that the uncertainty or ambiguity of classification in some regions cannot be resolved completely based on measurements methods. A basic difference between perceptions and measurements is that, in general, measurements are crisp whereas perceptions are fuzzy [17]. In this paper, we propose a new method for prominent region extraction in video sequences in order to remove limitations explained above. For each frame, a pre-segmentation composed of homogeneous regions is produced, and then the segmented image is analyzed by a number of perceptual attributes based on the mise-en-scene principles. As a set of techniques, mise-en-scene helps compose the film shot in space and time [8], which are used by the filmmakers to guide our attention across the screen, shaping our sense of the space that is represented and emphasizing certain parts of it . It is known that fuzzy logic can provide a flexible and vague mapping from linguistic terms to represent the human perception, and neural networks have superior learning ability to extract fuzzy rules automatically. Hence, to enable alleviate the semantic gap and the perception subjectivity problems, our method for automatically determining the perceptual importance of regions is constructed based on fuzzy inference neural networks(FINNs). While most existing work focus on the detection of salient region, our approach for extraction of perception prominent regions is distinctive with several important advantages: (1) According to the mise-en-scene principles, the perceptual features are extracted for homogenous regions, rather than the low-level features, so as to provide more encouraging pattern to classifier; (2) The prominent importance of regions is assigned through soft decisions. Experiments show the effectiveness and robustness of our method on different type of video. The rest of the paper is organized as follows: Pre-segmentation process of image frames is described in the next section. In Sect. 3, the perceptual feature extraction for primitive homogenous regions is implemented. And then, prominent region detection based on FINNs is presented in Sect.4. The effectiveness of the proposed approach is validated by experiments over real-word video clips are expressed in Sect.5. Concluding remarks are given in Sect. 6. 2 Pre-segmentation of Image Frames As stated in Ref [19], non-parametric density estimation techniques are well suited to the task of segmenting coherent regions in an image. Therefore, each frame is initially segmented into homogeneous regions based on mean shift algorithm, the color and texture information is incorporated into the mean shift segmenter [9] in this section. To segment an image, we first partition it into 4*4 blocks and extract a 6-D feature vector for each block. In particular, three of them are the average color components computed in CIE LUV color space due to its perceptually uniform derivation of the standard CIE XYZ space. The other three represent energy in high frequency bands of wavelet transforms [14]. And then, the 3 wavelet features are computed as the square root of the 2nd-order moment of the wavelet coefficients in the HL, LH, and HH frequency bands. The wavelet image decomposition provides a representation that is easy to interpret. Every subimage contains information of a specific scale and orientation, spatial information is retained within the subimages and the coefficients in different frequency bands show variations in different directions. We use the Daubechies discrete wavelet transform to decompose the image data into wavelet coefficients. After extract the color and texture features, they must be combined to form a single feature vector. Concerned the dynamic range of each feature and its relative importance, all features must be normalized and weighted. Thus, an integrated feature vector is formed as follows: fc^t-liOck (i) = T Vt ) (1) V^ = {C„ C,y,VT = {T„ T„ T3}; where wcoiorand 'wtexture are the weights for color and texture selected experientially. Vc and V^ are the extracted features color and texture, respectively. After the 6_D feature vector is extracted for all 4*4 blocks, we apply the mean shift clustering approach [9] to segment the image into homogenous regions. Fig.1 shows the results of partitioned homogenous regions for two frames. The level of accuracy in this initial step is important for the overall performance of the proposed prominent region detection, as these pre-segmented homogenous regions will constitute the basic (a) (b) Fig.1. Homogenous region segmentation examples: (a) original images; (b) respective homogenous region contours of the extracted perceptual prominent regions. For that reason, the parameters of this initial step are selected so that the frames are over-segmented rather than under-segmented. 3 Region Feature Description Human selective attention makes us distinguish important features of input stimulus from others and focus on a critical feature of an input stimulus. Most basically, our visual system is attuned to perceiving change both in time and space, since one of the main goals of human visual system is to minimize uncertainty [5]. This is also in agreement with Gestalt Organization Theories. By taking advantage of these facts, the filmmaker uses the arrangement of the mise-en-scene to attract our attention by means of changes in light, shape, movement and other aspects of the image [8]. Thus, in order to get suitable content pattern of region in video sequences, we extract a number of perceptual features described below. 1) Contrast of region with surroundings (CSr) Regions, which have a high contrast with their surroundings, are likely to be greater visual importance and attract more attention. The filmmaker can exploit principles of color contrast to shape our sense of screen space. For instance, bright colors set against a more subdued background are likely to draw the eye [8]. The contrast importance CSr (Ri )of a region Ri is calculated as: CSr (Ri ) = *(R. ) - I "(R-n^eghbourm ) (2) m=1 where I * (R. ) is the mean intensity of region R., and 1 *(R,-ne,ghboursm ) is the mean intensity of the m-th neighboring regions of Ri . 2) Orientation Conspicuity of Region (OCr) Gabor filtering allows to get information about local orientation in the image, thus orientation map computed by Gabor filtering is an important recognition cue, which was chosen on the basis of the evidence from the study of human attention [10]. Here, this technique is also employed to descript region orientional information importance. Local orientations O g are obtained by applying Gabor filters to the original images from particular orientations g = {0° ,45° ,90° ,135°}. The four oriented images are added up to generate the desired orientation map, and then normalized orientation map I ori II" orientations is achieved by using a traditional normalization operator. OCr ( R, ) = Npixel (Ri ) , p G R, (3) where Npi^ei (R, ) denotes the number of pixels in the region R,. 3) Shape Indicator of Region (SIr) The shape indicator of region can be calculated as: SIr(R, ) = Nedge ( R. ) Npixel (R. ) (4) where Nedge (R, ) is the number of pixels in the region R, which border with other regions. Generally speaking, a small SIr ( R, ) signifies a long, thin region, which further implies a high prominent importance, while for rounder regions it will be lower. 4) Compositional Balance Indicator of Region (CIr) Compositional Balance is an importance factor of the mise-en-scene to guide our attention and shape our viewing of the image, it can be interpreted as the extent to which the areas of screen space have equally distributed masses and points of interest. Generally the filmmakers often center the frame on the character's body which is a simplest way to achieve compositional balance [8]. However, no existing works pay attention to this information. Based on the above observations, the compositional balance indicator is determined according to the following measure: CSr (R, ) CIr (R, ) = ) - gc{R) , gc(R) G R; CSr (R, CSr (R, ) - CSr (R,. + gc(^, ) - gc(R) , gc(R) ^R (5) where gc(Ri ) is the gravitational center of the region Ri and the mean of gravitational center for all regions in image is denoted as gc(R). And the region R, is selected whose gravitational center is the nearest neighbor of the symmetrical point of gc(R, ) with respect to the midline of the frame. If the character's body is located in the frame center, we know that the p p larger CSr and the nearer distance between its gravitational center and gc(R) the region in image is, the larger CIr the region is, meaning that the higher possibility that it will be a character portion of the frame. For the second case, as the same sense, the higher CIr shows that the frame may balance two or more elements encouraging our eye move between these regions. 5) Motion Prominent Indicator (MIr) of region Almost invariably, a moving item draws our attention more quickly than a static item does. Motion information is an important cue for human to perceive video content. In our work, block motion vectors are used for motion analysis. Since of a motion vector reflects motion velocity of the block, we define the IM (R^ ) as the mean of the magnitude of the motion vectors in the region R.. An approach similar to [11] is adopted to descript our motion consistency of region, denoted as MC(R^-) . Specially, the motion consistency is calculated based on the entropy, which should be consulted for the details in [11], estimation of the probability is obtained by summation over rows and columns of eight-bin phase histogram of the region R. . Then we define the motion prominent importance as following: Inpu-t/Output Layer Inuut Part Rule Layer MIr ( R^^ ) = IM ÖMC (6) 4 0000 00 000 00 00 Prominent Region Detection based on FINNs After the region features are extracted, the perceptual prominence is required to assign to each region. Fuzzy logic has the capability of modeling perception vagueness, uncertainty and can support human-like reasoning. On the other hand, studies of fuzzy neural networks that combine both advantages of the fuzzy systems and the learning ability of the neural networks have been carried out. These techniques can alleviate the matter of fuzzy modeling by learning ability of neural networks [12, 13]. FINNs is ideal for our problem as it can extract and use rules automatically with a simple structure and superior performance. Output Part Fig.2 the structure of FINNs and its membership function. 4.1 Architecture of FINNs Fig.2 shows the basic structure of FINNs [12]. It consists of two layers. One is the input-output (I/O) layer and another is the rule-layer. The I/O layer consists of the input- and the output- part. Each node in the rule-layer represents one fuzzy rule. Weights from the input-part to the rule-layer and those from the rule-layer to the outputpart are fully connected and they store fuzzy if-then rules. The number of neurons in the input-part is equal to the dimension N^ of the input data, the number of rules is N2 , and N3 is the number of output node. For the prominent region detection problem each segmented region was described using a set of five dimension features, comprising of the perceptual features defined in section 3. Let F = (Fj,F2,F3,F4,F5) denote the perceptual features CSr, OCr, SIr, CIr and MIr of region R^^. For every region, FINNs receives a total of 5-dimensional input data, and outputs the class label: PR and NPR. Then FINNs adjusts the center value and width of its fuzzy membership function automatically during the learning phase. The bell-shaped membership function represents the if-part of fuzzy rules, which is placed between the input node i and the node j on the rule-layer. The membership function is expressed as (F, - w^^. )2 Uj = exp(^ . 2ij ),i = (1,2,...NJ); j = (1,2,..., N2) where wij is the center value of the membership function, aij indicates the width of the membership function adjusted by the learning process in FINNs. In the rule-layer, degree of the j-th rule Pj is computed as the following formula: Pj = min(^i j.^2 j,•••, Mnj ) (8) And then, the estimated output node value yk is calculated by the following equation: yt = I (w,k P ) r:^ p, (9) =I Pk iog (10) where w jk is the weight between the j-th node in the rule-layer and the k-th output node. The logical form of the fuzzy inference if-then rules is given as If fl is WWj, and f2 is WW2,, and f: is yW:j then y^ is Wjk, where M'ij means the value near w,, depended on the value of (J,,. 4.2 Learning process of FINNs The learning process of the FINNs consists of the three phases. First, the center values of membership functions which correspond to the if-part and estimated values which correspond to the then-part are determined temporarily in the self-organizing phase. And then, the system merger the similar rules at the rule extraction phase. Finally, Least Mean Square (LMS) is executed to reduce the total mean-square error of the network to finely adjust the weights and the shapes of the membership functions. 4.2.1 Self-organizing Learning phase In this phase, Kohonen's self-organizing algorithm is applied to the following two purposes. The first purpose is to estimate the center of membership functions of precondition part and the estimated value of j-th rule. The second purpose is to construct fuzzy if-then rules. In our implementation, the self-organizing learning phase and the LMS phase are almost the same as that of FINNs in [12]. 4.2.2 Rule-Extracting Phase In order to get better generalization ability, we employ a relative entropy-based approach to measure similarity between two rules described below. For two probability densities functions pk and qk , the Kullback-Leibler divergence is defined as the Kullback-Leibler divergence indicates how distinguishable pt is from qt by maximum likelihood hypothesis testing when the actual data obeys px. It is well know that is a nonnegative, additive but not symmetric. To obtain a symmetric measure, one can define similarity measure as: SW(pk,gk) = D(pkllqk ) + d(qkllpk ) (11) And then, for each weight vector w, ( j = 1,..., :2 ), we calculate a six-bin ( :h = 6 ) weight histogram Hw(j, h)(h = 1,..., ), therefore, estimation of weight probability distribution function is calculated as p, (h) = Hw ( j, h) Nh I Hw ( j, h) h=1 (12) Since small values indicate the densities are 'close', we merge two rules when SW-,is smaller than the threshold Sw which is selected by experiment described in the Figure 3. Therefore, the FINNs can combine rules to extract generalized rules, so as to improve generalization performance on pattern classification. 4.2.3 LMS Learning Phase The goal of the LMS learning phase is to minimize the mean square error between outputs of the network and the desired signals, which can be achieved by adjust the parameters such as those to determine the shape and the center of the membership functions explained before. For the single-output FINN, the minimizing mean square error function is expressed as follows: E = I E^ (13) E. = y - j?)2 (14) where y is the desired output and y is the output inferred by FINN. s is learning pattern. According to the LMS learning principle, the estimation value of j-th node in the rule-layer is updated as Pj wO (t+1) = w,. (t )+sLms ( y - y ) ZNr k Pk (15) s The center and the width of membership functions are undated as w, (t+1) = w, (t)( y - y) ^ ( w:o pk -Y.^wkoPk ) ^( ^^r ) (16) Zk pk and o^, (t+1)(t)(y - y) respectively, where q,, = 1 if p, = ß, ; 0 elsewhere 5 Experimental Results The proposed algorithm was integrated into a system that was tested using several video sequences. Table 1 summarizes the structural parameters of FINNs. Table.l Structure of FINNs N1 N 3 (t = 0) ^LMS o(t = 0) 5 2 0.5 0.001 4 precision curve with the use of different threshold öw , as we can see more reasonable results of our method c ould be achieved by setting Sw =0.15. 0. 0 0. 1 0. 2 0.3 Threshold 0.4 0.5 Fig.3. Average precision related to the estimated threshold öw The examples of rules obtained from the proposed system are shown in Table 2. These extracted rules are natural and are considered to be correct. The width of each membership function corresponds to the diversity of the input of the fuzzy rule. When the width of membership function is narrow, the input value is sensitive and has a large effect on the results. On the other hand, when the width is large, it means that the input is not very important. Therefore, we can estimate the importance of each input. Table.2. Examples of extracted rules In order to perform training, we randomly select three video sequences including 1152 frames. Rather than processing every frame of the video clip, our technique analyses only one out of every N (N=10) frames since there is typically little difference among consecutive frames. For each sampled frame, we label the ground truth manually; hence, the training data is composed of 117 sampled frames. As the result of the learning, the FINNs extracted 36 fuzzy inference rules. Since no perceptual theory exists regarding how to select the threshold Sw in rules extraction phase, the parameter is determined by experimental tests illuminated as Fig.3. The results obtained from three digital video clips taken from a sports video(Clipl), two movies: '"Season I of Friends"(Clip2) and '"Great Forrest Gump'"(Clip3). Due to the very strong subjectivity of human visual attention, the precision of our method is subjectively examined by ten testers and is averaged. Figure 3 shows the average No. wi, o i, Rules F1 : 0.80 0.14 F2 : 0.67 0.06 F3 : 0.82 0.21 F4 : 0.33 0.29 F5 : 0.56 0.12 R2 F1 : 0.66 0.12 F2 : 0.83 0.06 F3 : 0.67 0.20 F4 : 0.42 0.15 F5 : 0.67 0.09 R3 F1 : 0.99 0.04 F2 : 0.74 0.08 F3 : 0.82 0.16 F4 : 0.52 0.12 F5 : 0.79 0.08 Experiment shows input features CSr, OCr, Mir have more important than other features, which are considered to be sound. Colour and motion have been found to be two of the strongest influences on visual attention [20], especially, a strong influence occurs when the colour of a region is distinct from the colour of its background. And our peripheral vision is highly tuned to detection changes in motion. 4 (a) (b) (c) Fig.4. Prominent region from Season I of Friends detection in video sequence a sports video clip with the results of the algorithm using low-level information (luminance, color, texture) described in [16]. Fig.5 shows the comparison results. As we can see, some noisy areas in Fig.5(b) are removed correctly. Our results also accord to the fact of human selective attention, namely when viewing the video clip, human will put little attention on these noisy areas, even though they have a high contrast. That is different from viewing single image. In our experiments, several limitations are found. One major problem is caused by the noise areas, which have the same contrast and similar motion consistency as the prominent regions. As demonstrated in Fig.6, one of background regions is assigned mistakenly a high perceptual prominence. However, we expect this drawback can be improved by using spatial relations analysis, which is one of our future works. The shadow region of object is the other limitations of our system, which is difficult to handle because, in many cases, shadow region may locate at the center of an image, mean that it has higher value of compositional balance and color contrast. In Fig.6, the shadow region between two characters is regarded as a prominent region mistakenly. Fig.4 shows the results for two successive sampled frames taken from movie Season I of Friends. Specifically, Fig.4 (a) shows the original frames, (b) gives corresponding results of homogenous region segmentation, and (c) shows prominent region detection results. Prominent regions are located and used to obtain a binary image that contains white and black pixels to represent prominent region and non-prominent regions, respectively. As shown in the fig.4, one of the background regions not only has high color contrast but also locates at near the center of image, so both of this region and the character Chandler draw more attention and have high perceptual prominence, which are correctly extracted. (a) (b) (c) Fig.5. Prominent region detection results for two frames taken from sports video clip: (a) Original image; (b)Detection results using low-level information; (c) Our detectioin results Although a direct precision comparison with other system is not possible due to the lack of standard system setups and video sequences, we compared our results for Fig.6. Prominent region detection in video sequence from "Great Forrest Gump'" 6 Conclusions A perception-oriented approach for identifying prominent region in video sequences is presented in this paper. We extract a number of different perceptual features by the taking advantage of the mise-en-scene principles, which is different from many previous researchers who have used only low-level features. Furthermore, considering the subjectivity and imprecise of human perception, a modified fuzzy inference neural networks is ideal for classifying prominent regions due to the combination of learning ability of neural networks and rule processing ability of fuzzy logic. We provide a mechanism for prominent region detection through soft decisions. This framework can adequately capture subjectivity involved in the perceptual promience of region. And then, the technique has been designed to easily accommodate application specific requirements. Although the experimental results show the encouraging performance, the conducted research also shows that there is plenty of room for improvement. Future work will be focused on how to handle the limitations in the approach and improve the results. Additional research work includes a suitable pattern description for the video sequences with the use of the extracted prominent regions. References [1] J.Fan, W.G.Aref, A.K.Elmagamid, M.S.Hacid, M.S.Marzouk, and X.Zhu: Multi View: Multi-level Video Content Representation and Retrieval. J.Electron.Imaging ,special issue on multimedia database 10(4) (2001) 895-908 [2] Y.Deng and B.S.Majunath: NeTra-V:Toward an Object-based Video Representation. IEEE Trans. Circuits Syst. Video Technol., 8 (1998) 616-627 [3] S.F.Chang, W.Chen, H.J.Meng, H.Sundaram and D.Zhong: A Fully Automatic Content-based Video Search Engine Supporting Spatiotemporal Queries. IEEE Trans. Circuits Syst. Video Technol., 8 (1998) 602-615 [4] J. Senders: Distribution of Attention in Static and Dynamic Scenes. In: proceedings SPIE 3026 (1997) 186-194 [5] A. Yarbus: Eye Movements and Vision. Plenum Press, NewYork NY, (1967) [6] L.Itti, C.Koch: Feature Combination Strategies for Saliency-based Visual Attention Systems. Journal of Electronic Imaging, 10(1) (2001) 161-169 [7] D.Parkhurst, K.Law, and E.Niebur: Modeling the Role of Salience in the Allocation of Overt Visual Attention. In: proceedings ICIP, (2003) [8] David Bordwell, Kristin Thompson: Film Art: An Introduction. McGraw-Hill Higher Education, (2001) [9] D.Comaniciu, P.Meer: Mean Shift: A Robust Approach toward Feature Space Analysis. In: IEEE Trans. Pattern Analysis Machine Intelligence, 24 (2002) 603-619 [10] S.Marcelja: Mathematical Description of the Responses of Simple Cortical Cells. Journal of Optical Society of America, 70 (1980) 1169-1179 [11] Y.F.Ma, H.J.Zhang: A Model of Motion Attention for Video Skimming. In: proceedings. ICIP (2002) 22-25 [12] T.Nishina, M.Hagiwara: Fuzzy Inference Neural Network. Neurocomputing, 14(1997) 223-239 [13] H.lyatomi, M.Hagiwara: Scenery Image Recognition and Interpretation Using Fuzzy Inference Neural Networks. Pattern Recognition 35(8) (2002) 1793-1806 [14] Jia Li, James ze wang and G.Wiederhold: Simplicity: Semantics-Sensitive Integrated Matching for Picture Libraries. In: IEEE Trans. Pattern Analysis Machine Intelligence, 23(9) (2001) [15] Ying Li, Y.F. Ma and H.J.Zhang: Salient Region Detection and Tracking in Video. In: Proceedings of ICME (2003) 269-272 [16] Alexander Dimai: Unsupervised Extraction of Salient Region-Descriptors for Content Based Image Retrieval. In: Proceedings ICIAP (1999) 686-672 [17] Lotfi A.Zadeh: A Note on Web intelligence, World Knowledge and Fuzzy Logic. Data&Knowledge Engineering, 50 (2004) 291-304 [18] E.J.Pauwels, G.Frederix: Finding Salient Regions in Images Nonparametric Clustering for Image Segmentation and Grouping. Computer Vision and Image Understanding 75 (1999) [19] Hang Fai Lau, Martin D.Levine: Finding a small number of regions in an image using low-level features. Pattern Recognition 35 (2002) [20] E.Niebur and C.Koch. Computational architectures for Attention. In R.Parasuraman, The Attentive Brain. MIT Press (1997) A Color Image Quantization Algorithm Based on Particle Swarm Optimization Mahamed G. Omran and Andries P. Engelbrecht Department of Computer Science University of Pretoria Pretoria 0002, SOUTH AFRICA E-mail: mjomran@engineer.com, engel@cs.up.ac.za Ayed Salman Department of Computer Engineering Kuwait University KUWAIT Phone: +965-4811188-5833 Fax: +965-4817451 E-mail: ayed@eng.kuniv.edu.kw Keywords: Color image quantization, K-means clustering algorithm, particle swarm optimization, post-clustering quantization approaches, pre-clustering quantization approaches Received: February 6, 2005 A color image quantization algorithm based on Particle Swarm Optimization (PSO) is developed in this paper. PSO is a population-based optimization algorithm modeled after the simulation of social behavior of bird flocks and follows similar steps as evolutionary algorithms to find near-optimal solutions. The proposed algorithm randomly initializes each particle in the swarm to contain K centroids (i.e. color triplets). The K-means clustering algorithm is then applied to each particle at a user-specified probability to refine the chosen centroids. Each pixel is then assigned to the cluster with the closest centroid. The PSO is then applied to refine the centroids obtained from the K-means algorithm. The proposed algorithm is then applied to commonly used images. It is shown from the conducted experiments that the proposed algorithm generally results in a significant improvement of image quality compared to other well-known approaches. The influence of different values of the algorithm control parameters is studied. Furthermore, the performance of different versions of PSO is also investigated. Povzetek: Evolucijski algoritem na osnovi jate ptičev je uporabljen za barvno obdelavo slik. printing devices where only a small number 1 Introduction of colors can be displayed or printed simultaneously [20]. Color image quantization is the process of reducing the • Most graphics hardware use color lookup number of colors presented in a digital color image [2]. tables with a limited number of colors [8]. Color image quantization can be formally defined as follows [27] : Color image quantization consists of two major steps: Given a set of N S. colors where S and Nd is • Creating a colormap (or palette) where a the dimension of the data space. The color quantization is small set of colors 2(typically 8-256 [20]) is r . v ^ v u C • f AT Ì chosen from the (2 ) possible combinations a map /q : S ^ S where S is a set of N S, colors ^ , ,,, ^ q S of red, green and blue (RGB). such that S "e S ' and NS. < NS'. The objective is to • Mapping each color pixel in the color image minimize the quantization error resulting from replacing to one of the colors in the colormap. a color C G S'with its quantized value fq (c) G S" . Therefore, the main objective of color image Color image quantization is an important problem in the quantization is to map the set of colors in the original fields of image processing and computer graphics [27]: color image to a much smaller set of colors in the • It can be used in lossy compression quantized image [32]. Furthermore, this mapping, as techniques [27]; already mentioned, should minimize the • It is suitable for mobile and hand-held differencebetween the original and the quantized images devices where memory is usually small [18]; [8]. The color quantization problem is known to be NP- • It is suitable for low-cost color display and complete [30]. This means that it is not feasible to find the global optimal solution because this will require a prohibitive amount of time. To address this problem, several approximation techniques have been used. One popular approximation method is the use of a standard local search strategy such as K-means. K-means has already been applied to the color image quantization problem [22], [3]. However, K-means is a greedy algorithm which depends on the initial conditions, which may cause the algorithm to converge to suboptimal solutions. This drawback is magnified by the fact that the distribution of local optima is expected to be broad in the color image quantization problem due to the three dimensional color space. In addition, this local optimality is expected to affect the visual image quality. The local optimality issue can be addressed by using stochastic optimization schemes. In this paper, a new color image quantization algorithm based on Particle Swarm Optimization (PSO) is proposed. PSO is a population-based stochastic optimization algorithm modeled after the simulation of the social behavior of bird flocks and follows similar steps as evolutionary algorithms to find near-optimal solutions. PSO and other evolutionary algorithms that depend on heuristics to find 'soft' solutions are considered to be soft computing algorithms. This population-based search approach reduces the effect of the initial conditions, compared to K-means (especially if the size of the population is relatively large). The feasibility of the approach is demonstrated by applying it to commonly used images. The results show that, in general, the proposed approach performs better than state-of-the-art color image quantization approaches. The rest of the paper is organized as follows. Section 2 surveys related work in the field of color image quantization. An overview of PSO is shown in section 3. The proposed algorithm is presented in section 4, while an experimental evaluation of the algorithm is provided in section 5. Finally, section 6 concludes the paper and provides guidelines for future research. 2 Related Work Several heuristic techniques for color image quantization have been proposed in the literature. These techniques can be categorized into two main categories: pre-clustering and post-clustering. The next subsections discuss each of these categories. 2.1 Pre-clustering approaches Pre-clustering approaches divide the color into disjoint regions of similar colors. A representative color is then determined from each region. These representatives form the colormap. There are many fast algorithms in this category which are commonly used. The median cut algorithm (MCA) [10] is often used in image applications because of its simplicity [8]. MCA divides the color space repeatedly along the median into rectangular boxes until the desired number of colors is obtained. The variance-based algorithm (VBA) [28] also divides the color space into rectangular boxes. However, in VBA the box with the largest mean squared error between the colors in the box and their mean is split. The octree quantization algorithm [9] repeatedly subdivides a cube into eight smaller cubes in a tree structure of degree eight. Then adjacent cubes with the least number of pixels are merged. This is repeated until the required number of colors is obtained [5]. Octree produces results similar to MCA, but with higher speed and smaller memory requirements [8]. Xiang and Joy [32] proposed an agglomerative clustering method which starts with each image color as a separate cluster. Small clusters are then repeatedly clustered into larger clusters in a hierarchical way until the required number of colors is obtained. The abandoning of the fixed hierarchical division of the color space is a significant improvement over the octree approach [32]. A similar approach called Color Image Quantization by Pairwise Clustering was proposed by [27]. In this approach, a relatively large set of colors is chosen. An image histogram is then created. Two clusters that minimize the quantization error are then selected and merged together. This process is repeated until the required number of colors is obtained. According to [27], this approach performed better than MCA, VBA, octree, K-means and other popular quantization algorithms when applied to the two colored images used in their experiments. Xiang [31] proposed a color image quantization algorithm that minimizes the maximum distance between color pixels in each cluster (i.e. the intra-cluster distance). The algorithm starts by assigning all the pixels into one cluster. A pixel is then randomly chosen as the head of the cluster. A pixel that is the most distant from its cluster head is chosen as the head of a new cluster. Then, pixels nearer to the head of the new cluster move towards the new head forming the new cluster. This procedure is repeated until the desired number of clusters is obtained. The set of cluster heads forms the colormap. A hybrid competitive learning (HCL) approach combining competitive learning and splitting of the color space was proposed by [19]. HCL starts by randomly choosing a pixel as a cluster centroid. Competitive learning is then applied resulting in assigning all the image pixels to one cluster surrounding the centroid. A splitting process is then conducted by creating another copy of the centroid; competitive learning is then applied on both centroids. This process is repeated until the desired number of clusters is obtained. According to [19], HCL is fast, completely independent of initial conditions and can obtain near global optimal results. When applied to commonly used images, HCL outperformed MCA, VBA and K-means, and performed comparably with competitive learning [19], [20]. Braquelaire and Brun [2] compared the various pre-clustering heuristics and suggested some optimizations of the algorithms and data structures used. Furthermore, they proposed a new color space called H1 H2 H3 and argued that it improves the quantization heuristics. Finally, they proposed a new method which divides each cluster along the axis Hi, H2 or H3 of greatest variance. According to [2], the proposed approach generates images with comparable quality to that obtained from better but slower methods in this category. Recently, Cheng and Yang [4] proposed a color image quantization algorithm based on color space dimensionality reduction. The algorithm repeatedly subdivides the color histogram into smaller classes. The colors of each class are projected into a line. This line is defined by the mean color vector and the most distant color from the mean color. For each class, the vector generated from projection is then used to cluster the colors into two representative palette colors. This process is repeated until the desired number of representative colors is obtained. All color vectors in each class are then represented by their class mean. Finally, all these representative colors form the colormap. According to [4], this algorithm performed better than MCA, and performed comparably to SOM when applied on commonly used images. 2.2 Post-clustering approaches The main disadvantage of the pre-clustering approaches is that they only work with color spaces of simple geometric characteristics. On the other hand, post-clustering approaches can work with arbitrary shaped clusters. Post-clustering approaches perform clustering of the color space [4]. A post-clustering algorithm starts with an initial colormap. It then iteratively modifies the colormap to improve the approximation. The major disadvantage of post-clustering algorithms is the fact that it is time consuming [8]. The K-means algorithm is one of the most popular post-clustering algorithms. It starts with an initial set of colors (i.e. initial colormap). Then, each color pixel is assigned to the closest color in the colormap. The colors in the colormap are then recomputed as the centroids of the resulting clusters. This process is repeated until convergence. The K-means algorithm has been proven to converge to a local optimum [8]. As previously mentioned, a major disadvantage of K-means is its dependency on initial conditions. FCM [1] and Learning Vector Quantization [16] have also been used for color image quantization. Scheunders and De Backer [21] proposed a joint approach using both competitive learning and a dithering process to overcome the problem of contouring effects when using small colormaps. Fiume and Quellette [7] proposed an approach which uses simulated annealing for color image segmentation. Pre-clustering approaches were used to initialize the colormap. Self-Organizing Maps (SOMs) [15] were used by [5] to quantize color images. The approach selects an initial colormap, and then modifies the colors in the colormap by moving them in the direction of the image color pixels. However, to reduce the execution time, only samples of the colors in the image are used. According to [5], the algorithm performs better than MCA and octree. Rui et al. [18] presented an initialization and training method for SOM that reduces the computational load of SOM and at the same time generates reasonably good results. A hybrid approach combining evolutionary algorithms with K-means has been proposed by [8]. A population of individuals, each representing a colormap, are arbitrary initialized. Then, after each generation, the K-means algorithm (using a few iterations) is applied on each individual in the population. The standard error function of the Euclidean distance is chosen to be the fitness function of each individual. Based on the experiments conducted by [8], this hybrid approach outperformed both MCA and octree algorithms. The Genetic C-means algorithm (GCMA) uses a similar idea where a hybrid approach combining a genetic algorithm with K-means was proposed by [20]. The fitness function of each individual in the population is set to be the mean square error (MSE), defined as MSE = Z z(^, -m,)2 k=1 yZp^Ck_ NP (1) As in [8], each chromosome represents a colormap. GCMA starts with a population of arbitrary initialized chromosomes. K-means is then applied to all the chromosomes to reduce the search space. A single-point crossover is then applied. This is followed by the application of mutation which randomly decides if a value of one is added to (or subtracted from) the gene's value (i.e. mutating the gene's value with ±1). All the chromosomes are then pairwise compared and the chromosome with the lowest MSE replaces the other chromosome. This process is repeated until a stopping criterion is satisfied. A faster version of this approach can be obtained by applying K-means to the best chromosome in each generation. For the remaining chromosomes, an approximation of K-means is used where a single iteration of K-means is applied on a randomly chosen subset of pixels. This process is repeated a user-specified number of times using different subsets. GCMA outperformed MCA, VBA, K-means, competitive learning and HCL when applied on commonly used images [19], [20]. However, GcMA is computationally expensive. 3 Particle Swarm Optimization Particle swarm optimizers are population-based optimization algorithms modeled after the simulation of social behavior of bird flocks [12], [13]. PSO is generally considered to be an evolutionary computation (EC) paradigm. Other EC paradigms include genetic algorithms (GA), genetic programming (GP), evolutionary strategies (ES), and evolutionary programming (EP) [6]. These approaches simulate biological evolution and are population-based. In a PSO system, a swarm of individuals (called particles) fly through the search space. Each particle represents a candidate solution to the optimization problem. The position of a particle is influenced by the best position visited by itself (i.e. its own experience) and the position of the best particle in its neighborhood (i.e. the experience of neighboring particles). When the neighborhood of a particle is the entire swarm, the best position in the neighborhood is referred to as the global best particle, and the resulting algorithm is referred to as the gbest PSO. When smaller neighborhoods are used, the algorithm is generally referred to as the Ibest PSO [24]. The performance of each particle (i.e. how close the particle is to the global optimum) is measured using a fitness function that varies depending on the optimization problem. Each particle in the swarm is represented by the following characteristics: X,-: The current position of the particle; v{: The current velocity of the particle; y,: The personal best position of the particle. The personal best position of particle i is the best position (i.e. one resulting in the best fitness value) visited by particle i so far. Let f denote the objective function. Then the personal best of a particle at time step t is updated as yi (t+1)= 'y(t) \ff(x, (t+1))>f(y, (t)) (t+1) iff(x, (t+1))