219 Original scientific paper  MIDEM Society MDE-based Rapid DSE of multi-core embedded systems: The H.264 Decoder Case Study Manel Ammar1, Mouna Baklouti1, Maxime Pelcat2, Karol Desnos2 and Mohamed Abid1 1CES Laboratory, National Engineering School of Sfax, Sfax, Tunisia 2IETR, INSA Rennes, CNRS UMR 6164, UEB, Rennes, France Abstract: The recent advances in Unified Modeling Language (UML) give a valuable milestone for its application to modern embedded systems design space exploration. However, it is essential to remember that UML is unable to solve the difficulty associated with embedded systems analysis, but it only provides standard modeling means. A reliable Design Space Exploration (DSE) process which suits the peculiarities of complex embedded systems design is necessary to complement the use of UML for design space exploration. In this article, we propose a Model Driven Engineering-based (MDE) co-design flow that combines high- level data-intensive application analysis with rapid prototyping. In order to specify the embedded system, our methodology relies on the Modeling and Analysis of Real-Time and Embedded Systems (MARTE) UML profile. Moreover, the present contribution uses the Parameterized and Interfaced Synchronous Dataflow (πSDF) Model-of-Computation (MoC) and a model based on the IP-XACT standard as intermediate levels of abstraction to facilitate the analysis step in the co-design flow. The rapid prototyping process relies on the πSDF graph of the application and a system-level description of the architecture. This paper presents our Hw/Sw co- specification methodology, including its support for gradual refinement of the high-level models towards lower levels of abstraction for design space exploration purposes. Keywords: Co-Design; MP2SoC; MDE; MARTE; πSDF; S-LAM; PREESM; SoC Hiter DSE večjedrnih vgrajenih sistemov na osnovi MDE: Primer H.264 dekoderja Izvleček: Najnovejši napredki poenotenega modelirnega jezika (UML) ponujajo pomembne mejnike pri raziskovanju načrtovalskega prostora modernih vgrajenih sistemov. Poudariti pa je potrebno, da ULM ne rešuje problemov analize vgrajenih sistemov temveč določa le standard pri njihovem načrtovanju. Zanesljivo raziskovanje načrtovalskega prostora (DSE), ki ustreza posebnostim kompleksnih vgrajenih sistemov je potrebno za dopolnilno uporabo ULM. V članku predlagamo modelno gnan načrtovalni potek na osnovi inženirskega pristopa, ki združuje analizo podatkovno intenzivne aplikacije na visokem nivoju s hitrim izdelavi prototipov. Določitev vgrajenega sistema temelji na ULM profilu modeliranja in analize vgrajenih sistemov v realnem času. Dodatno, predlagana rešitev vključuje parametiziran in z vmesnikom sinhroniziran (πSDF) model toka podatkov (MoC) in model na osnovi IP-XACT standarda vmesnih nivojev za pospešitev korakov analize v poteku načrtovanja. Hitra izdelava prototipov temelji na πSDF grafih aplikacije in na opisu arhitekture sistemskega nivoja. Članek predstavlja programsko/strojno metodologijo, skupaj s podporo postopne izboljšave od modelov višjih nivojev do nizkih nivojev abstrakcije raziskovanja načrtovalskega prostora. Ključne besede: so-načrtovanje; MP2SoC; MDE; MARTE; πSDF; S-LAM; PREESM; SoC * Corresponding Author’s e-mail: manel.ammar@ceslab.org Journal of Microelectronics, Electronic Components and Materials Vol. 46, No. 4(2016), 219 – 228 1 Introduction At the present time, Massively Parallel Multi-Processors System-on-Chip (MP2SoC) are commonly dedicated to data-intensive processing applications where huge amounts of data are handled in a regular way by means of repetitive computations. As performance presents an important feature of emerging MP2SoCs, the design of such systems should meet strict time-to-market and cost constraints, while holding the guarantee of rising performance through parallelism. Performance relies on a diverse set of factors (granular- ity of the application, model of the architecture, par- titioning and allocation choices, etc.) and parameters 220 (number of processing units, memory sizes, etc). De- sign Space Exploration (DSE) means adjusting these factors and parameters while taking into account a set of metrics (execution time, latency, throughput, ener- gy, etc.) to find the optimal combination between the MP2SoC architecture and the data-intensive process- ing application at an early phase of the system design. The DSE of complex embedded systems involves three issues which are: - The modeling effort: which depends on the speci- fication methodology - The evaluation effort: which depends on the per- formance estimation techniques and tools - The results accuracy: which depends on the ex- ploration strategies that reduce the vast design space while reaching accurate performance num- bers Research on the DSE of modern applications run- ning on complex System-on-chip (SoC) is still emerg- ing. Several design frameworks have been suggested enabling high-level system specification. Based on the Model Driven Engineering (MDE) guidelines, the Uni- fied Modeling Language (UML) [1] semantics and the Modeling and Analysis of Real-Time and Embedded Systems (MARTE) [2] profile annotations, these frame- works guarantee a model-based specification method- ology that stresses the use of models in the embedded systems development life cycle and argues automation via meta-modeling, model transformation and code generation techniques. In addition, state-of-the art DSE frameworks rely on different evaluation techniques and exploration strat- egies. A common practice of embedded systems per- formance estimation in these approaches [3, 4, 5] is simulation. Although the simulation approach is more accurate, it is often time-consuming to be involved in- side the design space exploration loop. Moreover, it re- quires well-defined rich input models imposing exten- sive specification efforts. The COMPLEX framework [3], for example, uses MDE foundations for the co-design of embedded systems, but, it directly generates ex- ecutable files of the system to run a simulation-based DSE process. Providing such an executable model at an early phase of the design process may introduce an un- justified burden when making early design decisions. On the contrary, analytical design space exploration approaches [6, 7, 8] do not depend on simulators or on running code on real hardware. They rather take high- level specification of the embedded application, com- bine it with high-level model of the architecture and perform a static analysis to obtain performance meas- urements for this combination. For this reason, we pro- pose a purely analytical approach based on the high- level analysis of the embedded system. While building a simulation model is computationally costly, analytical estimations can be considered to accelerate the design process. In this paper, we propose an automatic approach that takes advantage from MDE and MARTE and defines two levels of abstraction that alleviate the analysis and gen- eration of data-intensive processing applications run- ning on multi-processor architectures. The first level is based on a novel extension of the famous Synchronous Data Flow (SDF) [9] Model-of-Computation (MoC), the Parameterized and Interfaced Synchronous Dataflow (πSDF) [10] model. Another level is introduced in our platform-based co-design flow facilitating IP integra- tion, architecture generation and system analysis. This level complies with a model based on the IP-XACT standard [11] named System-Level Architecture Model (S-LAM) [12]. High-level MARTE-based specification of the parallel architecture can be then refined in an MDE- based process to produce S-LAM description of the platform. These two abstraction levels were integrated with the PREESM [13] system-level rapid prototyping tool in an MDE-based co-design flow for complex em- bedded systems design. In previous work [14], the UML/MARTE methodology for modeling the data-parallel application and the au- tomatic generation of the πSDF specification have been presented. In [15] the automatic generation from the UML/MARTE specification of the S-LAM descrip- tion of the architecture was explained. In this paper, a complete overview of the proposed DSE flow is given in Section 2. Moreover, the paper contributes new fea- tures not addressed in previous work, specifically the integration of the PREESM rapid prototyping tool in- side the DSE framework and the validation of the pro- posed DSE methodology using a more complex case study (the H.264 decoder) which will be addressed in Section 3. 2 The proposed co-design flow Accurate performance numbers can be reached at the cost of very detailed modeling. On the other hand, a moderate effort for modeling leads to a high-level eval- uation task, but the accuracy is lost. In the proposed co-design framework, a balanced tradeoff has been made between design space exploration performance and accuracy allowing for extremely rapid system-level analysis while still yielding reliable estimations. In fact, our framework (Figure 1) is a complete and automatic Computer-Aided Design (CAD) tool for the co-specifi- cation, design space exploration and code generation M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 221 of MP2SoC systems that totally relies on MDE tech- niques. Being based on the Eclipse framework, front- end, transformation chains and back-end tools are grouped together in a fully-integrated flow. 2.1 UML/MARTE front-end: modeling concepts The proposed co-specification methodology supports the description of the architecture, the application, the allocation, and the deployment within a unified UML model. Description of the architecture, the application, and the allocation are declared by means of UML class- es (class diagram and composite structure diagram) annotated with MARTE stereotypes (Table 1). Deploy- ment of software and hardware IPs, describing imple- mentation details of application tasks and architecture components, are described using the UML deployment diagram decorated by stereotypes of our proposed de- ployment profile as shown in Table 1. Figure 1: Our proposed co-design flow 2.2 Refinements and abstraction levels inside the proposed flow Two MoCs were introduced in the proposed co-design flow to prune the design-space exploration step: πSDF and S-LAM. 2.2.1 The πSDF MoC Generally, MoCs can be evaluated based on their ex- pressiveness and analyzability. SDF MoC proved to be a very successful mean providing a good degree of ex- pressiveness while offering a lot of potential analysis [16] [17]. This combination makes this MoC very motivating in the domain of multimedia applications for embed- ded systems since throughput, storage requirements and latency can be easily estimated using analysis methods. Being able to specify complex hierarchic and parametric dataflow-based applications, the πSDF MoC extends the SDF MoC while preserving expressiveness and analyzability features. This MoC promotes rapid design space exploration and reconfigurable resource allocation of heterogeneous multicore systems. A πSDF graph is a directed graph represented by a tuple G=(A,F,I,π,Δ), where A is a set of actors and F is a set of FIFOs. The hierarchical compositionality mechanism is based on the set of hierarchical interfaces I. Fur- thermore, dynamism in πSDF relies on π and Δ, which describes respectively the set of parameters and their dependencies. 2.2.2 The S-LAM MoC To be properly analyzed and prototyped, the hardware architecture part of a given embedded system needs to be described at system-level. The S-LAM MoC, which facilitates such specifications, allows a simple and ex- Table 1: Used MARTE subset for the architecture (HW), the application (Sw), and the allocation models Concept Stereotype Package Model Processing resource HwProcessor MARTE:HRM Hw Storage resource HwMemory MARTE:HRM Hw Communication resource HwCommunicationResource MARTE:HRM Hw Task SwSchedulableResource MARTE:SRM Sw Communication port FlowPort MARTE:GCM Hw, Sw Repetitive component Shaped MARTE:RSM Hw, SW Complex link topology Tiler MARTE:RSM Hw, SW Complex link topology Reshape MARTE:RSM Hw, SW Simple allocation Allocate MARTE:Alloc allocation Repetitive allocation Distribute MARTE:Alloc allocation Hardware component HwResource MARTE:HRM Hw Hardware IP HwIP Deployment deployment Software IP SwIP Deployment deployment M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 222 pressive description while enabling rapid simulations. Being compatible with the IP-XACT model, the S-LAM meta-model does not use the entire IP-XACT meta- model, but it exploits a sub-set of concepts that cap- ture the needed information for the exploration phase [15]. There are two main motivations behind the use of S- LAM as intermediate abstraction level: Simplicity: S-LAM knows nothing about the implemen- tation details of each component of the hardware ar- chitecture while detailing its primary properties Compositional: S-LAM makes the hierarchical descrip- tion of the system possible which facilitates the speci- fication of massively parallel architectures complex structure. 2.2.3 Refinements using transformation chains Three transformation chains were defined in our co-de- sign flow. The first transformation generates the πSDF description of the data-parallel application. The second transformation chain generates an S-LAM compliant description of the parallel architecture. And the third chain generates a scenario file, the third design entry of the rapid prototyping tool. The implementation of a transformation flow in the MDE approach relies on the definition of ad-hoc meta-models for each abstrac- tion level. For this reason, three meta-models were pro- posed: the MARTE and Deployment meta-model, the πSDF meta-model and the S-LAM meta-model. In ad- dition, model-to-model and model-to-text transforma- tions were defined inside the transformation chains as depicted in Figure 2. In our approach, model-to-model transformation rules are defined using the QVTO lan- guage [18] and model-to-text transformation rules are described using the Acceleo tool. Following the MDE principles, an automatic transformation was de- veloped to generate a MARTE-compliant model from a UML-based specification. The first model-to-model transformation produces generic models of the appli- cation, the architecture and the allocation conform to the MARTE meta-model. The MARTE to πSDF transformation chain [14] takes as input the generic application model resulting from the first transformation and generates as output a πSDF specification which conforms to the πSDF meta-model. The S-LAM transformation chain [15] generates a mod- el that conforms to the S-LAM meta-model taking as entry point the generic architecture model. The generated πSDF and S-LAM models should be processed by two model-to-text transformations to produce .pi and .slam files of the application and the architecture. The generation of a scenario file aims at separating the algorithm and architecture constraints from system-level models. Figure 2: Refinements and abstraction levels inside the proposed flow 2.3 Rapid prototyping with PREESM Based on the previously described steps (co-specifica- tion, successive refinements), we described a multi-lev- el design space exploration methodology that relies on high-level models and refinement chains to enhance the rapid analysis of high-performance embedded sys- tems. The final step in the proposed approach is the rapid prototyping of the πSDF/S-LAM combination us- ing PREESM which will be described in this section. The flexible rapid prototyping process in PREESM con- sists of exploring the design tradeoffs at system-level while taking into account system constraints and ob- jectives present in the scenario file. The central feature of the rapid prototyping method is the multi-core scheduler. Before starting the scheduling phase, PRE- ESM performs two transformations aiming to expose the parallelism of the application and the architecture. In the one hand, the πSDF graph is transformed into a Hierarchical SDF, then into a Homogeneous SDF and finally into a Directed Acyclic Graph (DAG). The latter will be processed by the scheduler. On the other hand, a route model is generated from the S-LAM model aim- ing to facilitate the allocation task. The NP-complete scheduling process in PREESM consists of two separate operations: - Assignment: relies on the DAG model of the ap- plication to assign actors to operators M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 223 - Cost evaluation: relies on the route model and the scenario to estimate the cost of the proposed so- lution Such a scheduling process must satisfy both the data dependencies between tasks of the application and the execution constraints imposed by the execution platform. It also increases predictability, and allows precise performance estimations. At the end of the scheduling process, a Gantt chart of the execution is displayed plotting the optimal schedule. Memory stor- age requirements and speedup values are also estimat- ed and plotted in different charts. 3 Experimental results Our objective is to illustrate the effectiveness of the proposed co-design framework in terms of rapidity and accuracy of the exploration results. The H.264 decoder application [19], a typical data-intensive signal process- ing application, is chosen to demonstrate the efficiency of the proposed exploration tools. We mainly focus on a coarse-grain parallelization technique implemented in the literature [20] and try to predict the advantage of running such complex application on massively paral- lel architectures. Figure 3: Parallel motion compensation application block diagram 3.1 The H.264/AVC decoder Among numerous video compression standards, H.264 seems to be very effective in terms of compression and quality. Providing a compression efficiency gain of 50% compared to previous standards, the H.264 codec proves its effectiveness in high definition systems as well as low resolution devices. The H.264 AVC decoder splits each frame of a given video sequence into mac- roblocks (blocks of 16 × 16 pixels). These macroblocks are decoded in raser scan order using intra-prediction, inter-prediction and deblocking filter. With the uncontrollably evolution of video resolutions, the processing time of this decoder keeps increasing. Executing such complex application on parallel cores should solve this problem. However, dependencies, data coherency and synchronization introduced in the intra-prediction, inter-prediction and deblocking filter kernels are challenging characteristics making the par- allelization task very hard. In recent years, coarse-grain and fine-grain paralleliza- tion techniques were proposed. Coarse-grain methods allow decoding groups of pictures, frame or slices in parallel. Fine-grain techniques investigate smaller units named macroblocks. In this case study, we aim to study the motion compen- sation parallelization technique [20]. This technique di- vides the frame into rows of independent macroblocks as shown in Figure 3. The motion compensation stage is processed for each row of the frame. The decoder process begins with the entropy decoding. Then, de- quantization and inverse transformation are executed on the resulting data. Afterward, every row of mac- roblocks of a frame is inter-predicted (motion compen- sation task). At the end, the deblocking filter is applied. 3.2 The target model of the architecture MP2SoC, as presented in Figure 4, is composed of a par- ametric number of processing elements (PE), grouped into two clusters. The clusters can communicate via a global interconnection network. The first cluster, clus- ter0, contains one processing element connected to its local memory and can act as a global controller of the architecture. Inside the second cluster, each processing element is connected to its local memory and can com- municate to other processors via a local network. The presented architecture [21] is parametric and configur- able to satisfy a wide range of systematic signal pro- cessing applications. Its design is based on IP assembly approach. Figure 4: MP2SoC architecture M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 224 3.3 Modeling the H.264 decoder In Figure 5, the shaped stereotype associated with the instance of the MC_Row class denotes the data-paral- lelism of the application. Each repetition of the motion compensation task consumes one input pattern and produces one output pattern. A pattern corresponds to a row of macroblocks. Figure 5: Parallel motion compensation in UML Table 2 summarizes the multiplicities of consumed and produced patterns inside the Decode_Frm and MC classes where X and Y present the number of mac- roblocks in the horizontal and vertical directions, and nbrow specifies the number of rows processed in par- allel. To guarantee accuracy of the exploration results, the deadlineElements attribute of the SwSchedula- bleResource stereotype is used to specify how much of program execution time each elementary task used as seen in Figure 5. 3.4 Modeling the MP2SoC architecture Figure 6 shows the UML specification of the MP2SoC architecture. The main component of the architecture, named MP2SoC_Architecture, is composed of two clusters connected via a global network. The Cluster_1 class encloses a parametric number of processing units (PU) specified using the Shaped repetition concept. The Tiler connector (whose attributes are not shown in this figure for the simplicity purpose) between the global_net port of the PU and the global_net port of the Cluster_1 specifies how processing units are regu- larly connected to the global network. 3.5 Partial allocation view of the case study Data-parallel splitting of the motion compensation process while leaving parts of the application sequen- tial requires a constrained allocation view that will guide the rapid prototyping process. In Figure 7, the main component of the H.264 decoder and the main component of the hardware architecture are displayed. Since parallel processing is not needed for the EnDec_ IQT and DF tasks, they are completely mapped on the processing unit of Cluster_0 via the Allocate links. The data-parallel splitting of the MC_Row task imposes the distribution of the repetitions of this task onto the pro- cessing units of Cluster_1 using the Distribute stereo- type. Figure 7: Allocation for the motion estimation paralleli- zation 3.6 Deployment modeling Figure 8 presents the deployment of the PE elementary component onto the PE_IP artifact stereotyped hwIP. This class is deployed on the Nios II processor. The Nios Table 2: Multiplicities inside the parallel motion Class EnDec _IQT MC MC DF MC_Row MC_Row Port Frm Frm_in Frm_out Frm row_in row_out Multiplicity X × Y X × Y X × Y X × Y X × (Y ÷ nbrow) X × (Y ÷ nbrow) Figure 6: MP2SoC UML models M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 225 II processor IP is provided with the hardware library, as- sociated with our framework, which contains proces- sor, memory and communication network IPs. While the filePath attribute facilitates the generation of the MP2SoC source code, the vlnv attribute guides the S- LAM generation process as it gathers required IP prop- erties. Our framework integrates a source code generator that produces the implementation of a given MP2SoC archi- tecture. Currently, Nios II-based systems can be directly generated from a parameterized specification of the architecture in terms of processor numbers. Figure 8: Deployment of the PE 3.7 Executing transformation chains within the case study 3.7.1 Generation of πSDF files Executing the πSDF transformation chain on the par- allel motion compensation application generates one πSDF file for each hierarchic class: Decode_Frm.pi (Fig- ure 9(a)) and MC.pi (Figure 9(b)) files. The hierarchic structure of the application is conserved during the transformation phase for the two applications. Figure 9: .pi files of the parallel motion compensation application 3.7.2 Generation of S-LAM files Different configurations of MP2SoC were generated varying the number of processing units (by changing the shape value of the PU class). For the rapid proto- typing of the parallel motion compensation applica- tion, four architectures were produced containing 2, 4, 8 and 16 processing units in the Cluster1. Generating a complete MP2SoC architecture containing 8 process- ing units is illustrated in Figure 10. Figure 10: S-LAM files of the MP2SoC architecture 3.8 Exploration results in the case study In the parallel motion compensation approach, the motion compensation task for each row of one P-frame is executed in parallel on different cores. We experi- ment this parallelization method using the CIF (352 × 288) resolution. Figure 11: Speedup of the parallel motion compensa- tion parallelization method In CIF resolution, each frame has 22 horizontal mac- roblocks and 18 vertical macroblocks. Figure 11 shows the average speedup of the motion compensation task for different number of rows. For the CIF resolution, the maximum speedup of 4.75 is reached using 16 process- ing units and 18 rows. The speedup decreases as the number of rows decreases for the same processing unit M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 226 number. In fact, decreasing the row number leads to increasing the number of macroblocks inside each row, the fact that slows down the execution time. In contra- ry, increasing the row number intensifies the speedup. For example, doubling the row number (from 9 to 18) improves the speedup with 45% in an eight processing units-based architecture. In fact, the scheduler distrib- utes a set of rows of small size on the processing units, once the parallel execution of these rows completes rapidly, it distributes another amount of rows. The speedup for 18 rows is around 4 when eight pro- cessing units are used. Doubling the number of pro- cessing units rise the speedup to 4.75 which cannot be considered as efficient as expected since barely a poor improvement of 18% is gained. The main reasons are the extra-time needed for synchronization between processing units and the big amount of data transfer overhead that intensifies the execution time. 4 Conclusion The starting point of our study is to adapt a methodol- ogy for the co-design of complex embedded systems. Previous research works in the co-design domain focus on simulation for system analysis. While some other researches promote elevation of design abstraction levels, they do not benefit from the advantages of- fered by the MARTE profile and the novel πSDF MoC. The contribution of this paper is the definition of an MDE-based flow that takes as input the UML diagrams specified with the MARTE profile and transforms them into intermediate models corresponding to the πSDF and S-LAM models. These intermediate models add ad- ditional semantics and techniques, with the intended goal of analyzing the application, exploring the design space of possible implementations and generating the system implementation. Component-based approach provides means to de- compose a complex system into simpler components. As we have seen in this paper, our framework takes advantage of this approach for the specification of the data-intensive application and the massively parallel architecture. The complex structure of data-intensive applications makes them suitable to compositional or hierarchical design. Compositional design of MP2SoC architectures can be done using hardware compo- nents consisting of elementary or composite classes arranged in a hierarchical manner. In Section 3, we de- scribed a compositional specification technique that is totally based on the composite structure diagram concepts where a complex application is divided into simpler tasks and a given MP2SoC architecture is speci- fied based on a bottom-up approach that builds the hierarchy of the architecture using elementary compo- nents assembling. To benefit from component-based design for MP2SoC systems, the scheduler should be addressed for high performance applications running on clusters. The static scheduling algorithms implemented within the PREESM scheduler assign tasks to computing re- sources before applications are executed. At compi- lation time, task execution time and communication time are supposed to be known and specified as dis- cussed in Section 3. These algorithms, including the list scheduling and the FAST algorithms, are mainly dedi- cated to scheduling tasks on multi-core systems. Proto- typing the H.264 decoder using the scheduling kernel of PREESM brings some limitations including: Limitations in code generation: the current PREESM code generation supports exclusively static πSDF graphs. A new code generation based on a runtime system name Spider and supporting all πSDF features is currently studied Lack of energy and area cost estimation: performance is evaluated based on two metrics, throughput and la- tency. Although the optimization of these constraints is vital when dealing with high-performance applica- tions, power consumption and area occupation re- mains more important with the ever increasing num- ber of cores inside MP2SoC systems. Mapping task graph nodes onto clusters means clus- tering. The task graph clustering approach [22] for scheduling massive parallel tasks on cluster-based architectures seems to be effective for MP2SoC sys- tems. Researches in this field try to combine cluster- ing algorithms with power consumption reduction [23, 24]. These efforts use emerging power reduction techniques, for example, the Dynamic Voltage and Frequency Scaling (DVFS) [25] and try to adapt them for the cluster-based systems. Integrating such tech- nique into the PREESM scheduler seems to be a good direction to support the composition characteristic of the architecture and the application. Another motivat- ing point is that these techniques are based on a DAG description of the application [26], which is the entry point of the PREESM scheduler. The PREESM scheduler, as seen in Section 2, divides the assignment and cost evaluation tasks into two sub-modules. One advantage of this approach is that additional heuristics, like power and area, can be eas- ily integrated within the cost evaluation kernel. Since the S-LAM model of the application and the scenario file enclosing system constraints are the inputs of the cost evaluation task, additional values needed for the power and area estimation need to be generated from M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 227 high-level UML models and encapsulated into these files. These values include the processor frequency, hardware resources occupation area, and power con- straints, etc. Attributes associated with stereotypes of the MARTE profile will be used to specify these values. 5 References 1. Object Management Group. Unified Modeling Language specification, version 2.1. 2. Object Management Group. UML Profile for MARTE: Modeling and Analysis of Real-Time Em- bedded Systems, version 1.0. 3. K. Grüttner, P. A. Hartmann, K. Hylla, S. Rosinger, W. Nebel, F. Herrera, et al, “The COMPLEX refer- ence framework for HW/SW co-design and power management supporting platform-based design- space exploration,” MICROPROCESS MICROSY, vol.37, no.8, pp.966-980, 2013. 4. C. Silvano, et al., “Multicube: Multi-objective de- sign space exploration of multi-core architec- tures,” IEEE Computer Society Annual Symposium on VLSI, pp. 488–493, 2010. 5. A. Pimentel, C. Erbas, S. Polstra, “A systematic ap- proach to exploring embedded system architec- tures at multiple abstraction levels,” IEEE Trans- actions on Computers, vol.55, no.2, pp.99–112, 2006. 6. F. A. M. do Nascimento, M.F.S. Oliveira, F.R. Wag- ner, “A model-driven engineering framework for embedded systems design,” Innovations in Sys- tems and Software Engineering, vol.8, pp.19-33, Mars 2012. 7. B. Schatz, F. Holzl, T. Lundkvist, “Design-space ex- ploration through constraint based model-trans- formation” 17th IEEE International Conference and Workshops on Engineering of Computer Based Systems, ECBS 2010, pp. 173–182, 2010. 8. L. G. Murillo, M. Mura and M. Prevostini, “MDE Support for HW/SW codesign: A UML-based de- sign flow,” In Advances in Design Methods from Modeling Languages for Embedded Systems and SoC’s, Springer Netherlands, pp. 19-37, 2010. 9. E. Lee and D. Messerschmitt, “Synchronous data flow,” Proceedings of the IEEE, vol. 75, no. 9, p 1235-1245, September 1987. 10. K. Desnos, M. Pelcat, J.F. Nezan, S. Bhattacharyya and S. Aridh, “PiMM: Parameterized and Inter- faced Dataflow Meta-Model for MPSoCs Run- time Reconfiguration,” International Conference on Embedded Computer Systems: Architecture, Modeling and Simulation (SAMOS XIII), Samos, Greece, July 2013. 11. IEEE Standard for IP-XACT, Standard Structure for Packaging, Integrating, and Reusing IP within Tools Flows, IEEE Std 1685-2009, Feb. 2010, pp. C1-360. 12. M. Pelcat, J. F. Nezan, J. Piat, J. Croizer, and S. Aridhi,“A system-level architecture model for rap- id prototyping of heterogeneous multicore em- bedded systems,” In Proceedings of DASIP confer- ence., September 2009. 13. M. Pelcat, K. Desnos, J. Heulot, C. Guy, J.-F. Nezan, and S. Aridhi,“Preesm: A dataflow-based rapid prototyping framework for simplifying multicore DSP programming,” In 6 th European Embedded Design in Education and Research Conference, EDERC 2014, pp. 36-40, 2014. 14. M. Ammar, M. Baklouti, M. Pelcat, K. Desnos and M. Abid, “MARTE to πSDF transformation for data- intensive applications analysis,” In Conference on Design & Architectures for Signal & Image Pro- cessing, DASIP, October 2014. 15. M. Ammar, M. Baklouti, M. Pelcat, K. Desnos and M. Abid, “Automatic Generation of S-LAM Descrip- tions from UML/MARTE for the DSE of Massively Parallel Embedded Systems, “ 2015 Software En- gineering Research, Management and Applica- tions conference, SERA 2015 (to be appeared). 16. A.-H. Ghamarian, M. C W Geilen, S. Stuijk, T. Bas- ten, A. J M Moonen, M.J.G. Bekooij, B.D. Theelen, M.R. Mousavi, “Throughput Analysis of Synchro- nous Data Flow Graphs,” Sixth International Con- ference on Application of Concurrency to System Design, ACSD 2006, pp.25-36, June 2006 17. K. Desnos, M. Pelcat, J.-F. Nezan, S. Aridhi, “Mem- ory bounds for the distributed execution of a hierarchical Synchronous Data-Flow graph,” In- ternational Conference on Embedded Computer Systems, SAMOS 2012, pp.160-167, July 2012. 18. P. Guduric, A. Puder, R. Todtenhofer, “A Compari- son between Relational and Operational QVT Mappings,” In the 6th International Conference on Information Technology: New Generations, ITNG ‘09, pp.266-271, April 2009. 19. J. V. Team, “Advanced video coding for generic audiovisual services,” ITU-T Rec. H, vol. 264, pp. 14496–10. 20. E. Baaklini, S. Rethinagiri, H. Sbeity, et al, “Scalable row-based parallel H.264 decoder on embedded multicore processors,” In Signal, Image and Video Processing, pp. 1-15, 2014. 21. M. Baklouti, and M. Abid, “Multi-Softcore Archi- tecture on FPGA,” International Journal of Recon- figurable Computing, 2014. 22. X. Qin and H. Jiang, “A dynamic and reliability- driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters,” Jour- nal of Parallel and Distributed Computing, vol. 65, no. 8, pp. 885-900, 2005. M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228 228 23. G. L. Valentini, W. Lassonde, S. U. Khan, et al, “An overview of energy efficiency techniques in clus- ter computing systems,” Cluster Computing, vol. 16, no. 1, pp. 3-15, 2013. 24. Z. Zong, A. Manzanares, X. Ruan, et al, “EAD and PEBD: two energy-aware duplication scheduling algorithms for parallel tasks on homogeneous clusters,” IEEE Transactions on Computers, vol. 60, no. 3, pp. 360-374, 2011. 25. L. Wang, G. Von Laszewski, J. Dayal, et al, “To- wards energy aware scheduling for precedence constrained parallel tasks in a cluster with DVFS,” In10th IEEE/ACM International Conference on: Cluster, Cloud and Grid Computing, CCGrid 2010, pp. 368-377, 2010. 26. L. WANG, S. U. KHAN, D. CHEN, et al,“Energy- aware parallel task scheduling in a cluster,” Future Generation Computer Systems, vol. 29, no. 7, pp. 1661-1670, 2013. Arrived: 25. 04. 2016 Accepted: 13. 09. 2016 M. Ammar et al; Informacije Midem, Vol. 46, No. 4(2016), 219 – 228