Informatica 36 (2012) 287-295 287
Hardware-Software Co-design for Reconfigurable Field Programmable Gate Arrays Using Mixed-Integer Programming
Faridah M. Ali
Department of Computer Engineering,
Kuwait University, P.O. Box 5969, Safat 13060, Kuwait
E-mail: faridah@puc.edu.kw
Helal Al-Hamadi
Department of Information Science,
Kuwait University, P.O. Box 5969, Safat 13060, Kuwait
E-mail: helal@cfw.kuniv.edu
Ahmed Ghoniem
Department of Finance and Operations Management,
Isenberg School of Management, University of Massachusetts Amherst,
Amherst, MA 01002, U.S.A.
E-mail: aghoniem@isenberg.umass.edu
Hanif D. Sherali
Grado Department of Industrial and Systems Engineering (0118), Virginia Tech,
Blacksburg, VA 24061, U.S.A. E-mail: hanifs@vt.edu
Keywords: hardware-software co-design, FPGAs, task partitioning, scheduling, mixed-integer 0-1 programming Received: September 19, 2011
This paperpresents a novel mixed-integerprogramming formulation for scheduling non-preemptive, aperiodic, hard real-time tasks with precedence constraints. Itprovides an integrated partitioning and scheduling co-synthesis approach. The problem formulation maps some n precedence-related, indivisible jobs having specified processing requirements, release times, and due-dates to a system involving a single Central Processing Unit (CPU) and up to m potential reconfigurable Field Programmable Gate Arrays (FPGAs). We provide a time-indexed mixed-integer 0-1 programming formulati on that jointly assigns tasks to either the CPU or to one of the FPGAs, and determines the task sequence for each software or hardware component that is utilized, wi th the objective of minimizing a composi te cost of task partitioning and scheduling. Computational experience is provided using randomly generated instances to demonstrate the applicability of the proposed methodology.
Povzetek: Predstavljen je algoritem zaporazdeljevanje opravilpri snovanju programske in strojne opreme.
1 Introduction and Motivation
The task partitioning and scheduling problem bears practical significance in software/hardware co-design of hard real-time applications that arise in a host of applications such as flight and defense control, telecommunication, or nuclear power plants, to name a few. Specifically, we consider the problem of partitioning and scheduling n indivisible (no preemption), aperiodic (which could be considered as the body of a looped system), precedence-related jobs that are characterized by specific processing requirements, release times, and due-dates (which are deadlines that can-
not be violated). The system architecture is depicted in Figure 1, and involves a single Central Processing Unit (CPU) and a maximum of some m potential reconfigurable Field Programmable Gate Arrays (FPGAs) controlled by a single controller unit. The system components are connected with two explicit communication buses (channels). The first bus is the system bus, which is used for input/output (I/O) data transfers, whereas the second is used for FPGA reconfiguration transfers. The task processing effort is primarily carried out by the CPU, in general, and the FPGAs are incrementally utilized if the CPU alone cannot conform to all due-date restrictions. In contrast with scheduling problems
288 Informática 36 (2012) 287-295
F.M. Ali et al.
Figure 1: System Architecture
that arise in production and logistical systems where it is often desirable to meet imposed due-dates, it is imperative to comply with the specified due-dates in the problem under investigation.
Another reason for the use of such dual systems in practice resides in the benefits accruing from the cooperation between software and hardware components. As a consequence, the co-design problem has gained increasing attention over the last decade, (see [6], [7], [10], [12], [16], [19], and [20]). This process involves three main operations, namely, resource (software, hardware, and auxiliary components) allocation to the system, task partitioning among software and hardware processing components, and scheduling of the tasks over their assigned processing components (CPU or FPGA). Most works in the literature have addressed this joint problem via two-phase methods [14], where a task partitioning is achieved first, and then tasks are subsequently scheduled over the relevant processing components. It is important, however, to note that the task partitioning and scheduling operations are intertwined [9], and need to be dealt with concurrently in order to determine optimal operational solutions.
An algorithm that aims at partitioning and scheduling the tasks on two CPUs and several hardware components with the objective of minimizing the total execution time and the hardware cost was presented by Liu and Wong [13]. Arato et al. [2] designed a procedure for scheduling tasks on a software component, where tasks that violate their deadlines are partially assigned to an auxiliary hardware component. A mixed-integer 0-1 formulation for partitioning precedence-related tasks on a single-CPU-single-FPGA system, as well as a genetic algorithm to address larger problem instances was presented in [2]. Ali and Das [1] presented a heuristic algorithm that progressively assigns indivisible tasks to dynamically reconfigurable FP-GAs when certain tasks cannot meet their deadlines on the CPU. In contrast, this challenging scheduling problem is tackled in the present paper using a mixed-integer 0-1 programming model with the objective of minimizing a composite cost of task partitioning and scheduling on a single CPU along with several potentially reconfigurable FPGAs.
Jeong et al. [11] proposed a mixed-integer 0-1 formulation and a heuristic for hardware-software partitioning in systems consisting of a single CPU and a single FPGA, with the objective of minimizing the reconfiguration overhead. A mixed-integer programming model was developed by Niemann and Marwedel [14] that employs a two-phase method for hardware/software partitioning, where a tentative schedule is first proposed and is subsequently verified; if the timing constraints are violated, the partitioning step is repeated with timing constraints that are tighter than the estimated scheduling horizon length. Bender [3] discussed an alternative mixed-integer programming approach for mapping real-time precedence graphs into a system of Application Specific Integrated Circuits (ASICs) that are used as hardware components, and pipelined microprocessors that are used as software components. Initially, only a limited number of hardware components are used, and these are incrementally increased until a feasible solution is obtained.
Recently, hardware/software partitioning for multimedia and wireless mobile applications has gained great importance. Brogioli et al. [4] proposed a set of criteria for partitioning real-time embedded multimedia applications between software programmable Digital Signal Processors (DSPs) and hardware-based FPGA coprocessors. Dasu and Panchanathan [5] investigated the design and development of a dynamically reconfigurable multimedia processor that involves an optimal hardware/software codesign methodology. Furthermore, a hardware/software partitioning for multimedia application that utilizes process-level pipelining and a heuristic technique based on simulating annealing was presented by Juan et al. [12].
A design space exploration tool that supports both explicit communication and reconfigurable hardware was addressed by Haubelt et al. [10]. The developed algorithm strictly separates functionality from the architecture, and maps a process graph onto components such that data dependencies given by the process graph can be handled in the resulting implementation. The work presented in the present paper bears some similarity to this approach in that we also use explicit communication channels and reconfigurable hardware in our system architecture.
Observe that the problem at hand is also related to the challenging class of unrelated parallel machine scheduling problems (see [15]). It is characterized by the presence of release dates, imperative due-dates, precedence constraints, inter-task data communication, reconfiguration of hardware resources (FPGAs), and a composite objective function that minimizes the processing and resource utilization costs. Also, it is specially structured due to the fact that all processing components are identical, except for the CPU.
The main contribution of the present paper is a novel formulation of a time-indexed mixed-integer 0-1 programming model for the co-synthesis hardware/software integrated partitioning and scheduling problem. The motivation for the proposed work is two-fold: first, it provides a tool for simultaneously partitioning (or mapping) and
HARDWARE-SOFTWARE CO-DESIGN
Informatica 36 (2012) 287-295 289
scheduling tasks onto hardware/software units, and second, it offers a design space exploration tool to ascertain the minimum number of FPGAs required for a particular application before a system is actually built.
The remainder of this paper is organized as follows. In Section 2, we formally describe the problem under investigation along with our notation. Thereafter, we introduce in Section 3 a mixed-integer 0-1 programming formulation that simultaneously captures the requirements pertaining to the partitioning and scheduling operations. Section 4 delineates our data generation scheme, and reports our computational experience using a set of random test instances to demonstrate the effectiveness of the proposed solution approach. We close the paper in Section 5 with a summary of our findings.
2 Problem Description and Notation
We address the problem of scheduling some n precedence-related jobs having specified processing requirements, release times, and inviolable due-dates in a system involving a single Central Processing Unit (CPU) and a maximum of some m potential reconfigurable Field Programmable Gate Arrays (FPGAs). Each job can be processed either by the CPU itself, or it can be scheduled for processing on one of the m available FPGAs. In either case, no preemption is permitted, and each resource (CPU or FPGA) can process at most one job at any point in time. However, whenever an FPGA begins processing a job, it must be reconfigured to perform the required operation by a single available reconfiguration controller. This reconfiguration process consumes a specified duration that is part of the total processing time required for performing the job on the associated FPGA. Note that while the controller is reconfiguring any FPGA to begin processing a job, it is occupied and cannot simultaneously reconfigure another FPGA. Again, no preemption is permitted in the reconfiguration process. Also, not all the m available FPGAs need be used; in fact, there is a fixed cost for using an FPGA that competes with the cost related to achieving scheduling efficiency. When an FPGA finishes executing any task, it becomes available for the next task if needed. This is described more in detail in the model formulation given in Section 3.
In our analysis, the FPGAs cannot be preconfigured since it is not known in advance which task will be executed on the FPGA rather than on the CPU. Moreover, if the system has more than one FPGA, it is not known which FPGA will be used until scheduling is complete. The proposed system (Figure 1) is generic and can be used for executing any precedence-related jobs. However, for a given real-time set of jobs, once an optimal system configuration and scheduling decisions are determined, then a partial runtime reconfiguration can be performed during implementation where only the configuration bits of the particular task are transferred to the FPGA in order to reduce the configuration time.
Notation:
-	j = 1, ...,n: Index for jobs.
-	For establishing precedences, we define P =
{(ji, j'2) : ji ^ j2, i.e., the processing of job ji must precede that of job j2}.
-	m = maximum potential number of FPGAs available for use.
-	Index for time-slots: Let the time be discretized so that the scheduling time-line for the CPU and each FPGA contains s time-slots, where the end of time-slot s is estimated to be the maximum allowable makespan duration, based on due-dates. We index the timeslots over this maximum makespan duration as: t = 1,..., s. (Note that the actual duration of the time-slots is arbitrary and rescalable, and typically ranges from 1 to 300 seconds in practice. Also, the time measurement (in discretized units) begins at time t = 0 at the beginning of slot 1.)
-	Index for slots: These correspond to a sequential indexing of the foregoing time-slots over the resources, where the slots for the CPU are indexed as k = 1,..., s, the slots for FPGA 1 are indexed as k = s + 1,..., 2s, the slots for FPGA 2 are indexed as k = 2s + 1,..., 3s, and so on, up to slots k = ms + 1,..., (m + 1)s for FPGA m. (Note that the time-slots are numbered 1,..., s, whereas the slots are indexed contiguously over the CPU and the m FPGAs as k =1,..., (m + 1)s.)
-	Index for resources: r = 0,1,..., m, where r = 0 is the CPU and r = 1, ...,m index the FPGAs.
-	r(k) = resource corresponding to slot k. (So r(k) = 0 for k = 1,..., s, r(k) = 1 for k = s + 1,..., 2s, and so on.)
-	Sjr = processing time of job j on resource r (in integral time units that conform with the time-slot duration).
-	njr = reconfiguration time on resource FPGA r to process job j (in integral time units that conform with the time-slot duration). We assume that j is included within Sjr, Wj, Wr > 1.
-	a.j = release time of job j. That is, if a job j has no predecessors, then the earliest time-slot to start its processing is a.j + 1.
-	dj = due-date of job j.
-	lbj = lower bound on the starting time-slot for job
j. If a job has no predecessor, then lbj = a.j + 1;
otherwise we may simply take lbj = max{aj +
1, max {lbj1 + min{ jr}}}. ji:(jl ,j)eP	r
290 Informática 36 (2012) 287-295
F.M. Ali et al.
- k mod+ (s) = remainder for the division k/s, except that this is taken as s if the remainder is zero.
Principal Decision Variables:
The principal decision variables are defined below:
1 if job j is assigned to start at slot k 0 otherwise, Vj, k.
xjk
Vr =
1 if FPGA r is utilized 0 otherwise, r = 1,..., m.
Auxiliary Decision Variables:
The following auxiliary variables are defined based on the xjk -variables:
-	sj = time-slot at which the processing of job j starts.
-	fj = time-slot at which the processing of job j ends. Key Sets:
We define certain key sets based on individual job processing times (which could, in general, be CPU- and FPGA-dependent), reconfiguration times (which could again be FPGA-dependent), job release/availability times, and job due-dates:
-	Sj = {slots k: xjk = 1 is a possible decision based on release, due-date, processing, and reconfiguration times}, Vj = 1,..., n.
Observe that we may express Sj as
Sj = {k : k e {1,..., (m +1)s},lbj < k mod+(s) < dj - Sjr(k) + 1}, Vj = 1, ...,n.
-	Sjr —— {k e Sj : slot k is associated with resource r}, Vj	=
1, ..., n,r = 0, ..., m.
-	Jk = {(j,l) : j e {1,...,n},l e Sj, andXji = 1 would imply that slot k would be occupied by the reconfiguration/processing of job j}, Vk = 1,..., (m + 1)s.
Note that Jk can be expressed as
Jk = {(j+) : j e {1,+,n},l e Sj,r(£) = r(k), I mod+(s) < k mod+(s) < I mod+(s) + ¿jr(e)}.
-	Rt = {(j,l) : j e {1, ...,n}, I e Sj, and Xj£ = 1 would imply that during the time-slot t, the controller is busy performing a reconfiguration}, Vt = 1,..., s -
A, where A = min{Jjr — njr}.
j,r
We can formally state Rt as
Rt = j) : j e {1,...,n},l e Sj,1 > s + 1,1 mod+(s) < t < I mod+(s) + njrie)}.
Note that Rt = 0 for t = s — A + 1,...., s. Cost Parameters:
-	cjk = cost of commencing the operation of job j at the duration corresponding to slot k.
-	A = cost per FPGA used.
Remark 1. The cost of resources (number of FP-GAs used) and efficiency (as predicated by the term
n
J2 J2 cjkXjk) compete in the objective function of the
j=i keSj
mathematical program formulated in Section 3. In addition, observe that it might be desirable to preclude alternative optimal solutions that allow idleness on the available resources. To this end, we may require the hierarchy of cost parameters cjk associated with any job j to be strictly increasing with respect to the time-slot k mod+ (s). □
3 Mathematical Programming Formulation
We present below our proposed mixed-integer 0-1 programming formulation, denoted by HWSW, which ascertains the task partitioning and scheduling decisions in order to minimize the total processing and resource costs.
HWSW: Minimize ^ ^ cjk xjk + Vr	(1a)
j = 1 keSj	r=1
subject to y^ xjk = 1, = 1,...,
(1b)
keSj
xji ^1
(j,i)eJk
Vk = 1,..., (m +1)s	(1c)
J2 xji -1
(j,i)eRt
Vt =1,...,s - A	(1d)
Sj = ^^ [k mod+(s)]xjk, keSj
Vj =1,...,n	(1e)
fj = [k mod+(s) + Sjr(k) - 1]xjk, kesj
Vj = 1, ...,n	(1f)
fj! +1 - Sj2, V(ji,j2) € P	(1g)
Vr > ^ xjk,
Vj = 1,...,n, Vr = 1,...,m	(1h)
1 > V1 > V2 > ... > Vm > 0	(1i)
n	n
E E xjk > E E xjk,
j = 1 ke Sj r	j =i keSj,r + i
Vr = 1,..,m - 1	(1j)
x binary, V continuous.	(1k)
The objective function (1a) seeks to minimize the total processing and resource costs. Constraint (1b) requires each job to be feasibly scheduled on either the CPU or on an FPGA. Constraint (1c) asserts that no resource can be processing more than one job simultaneously during any associated slot. Likewise, Constraint (1d) enforces the restriction that the controller can be reconfiguring at most one job at any point in time. Note that whenever Xjk = 1

n
HARDWARE-SOFTWARE CO-DESIGN
Informatica 36 (2012) 287-295 291
for some job j G {1, ■ ■■,n}, and k G Sj, where slot k corresponds to FPGA r, say, then it is assumed that job j starts its reconfiguration by the controller on FPGA r at the time corresponding to the beginning of slot k, after which it immediately proceeds to be processed by FPGA r. Constraints (1e) and (1f) state the definitional identities for the start and finish time-slots for each job j in terms of the x-variables, and Constraint (1g) represents the precedence relationships. Constraint (1h), along with the second objective term, invokes that yr = 1 if and only if some job is processed on FPGA r, and is zero otherwise, even when restricted to be a continuous variable on [0, 1]. Constraints (1i) and (1j) attempt to defeat the inherent symmetry in the problem with respect to the FPGAs, assuming that the FPGAs are identical with respect to processing times. (Note that if there are subgroups of identical FPGAs, then these types of constraints can be incorporated within each such subgroup.) Specifically, Constraint (1i) requires that the lower-indexed FPGAs be utilized first, and more importantly, Constraint (1j) attempts to impart an identity to the utilized FPGAs by imposing the hierarchy that FPGA r should process at least as many jobs as FPGA r + 1. Without such hierarchical constraints, the inherent symmetry in the problem can hopelessly mire the solution process by requiring it to search among symmetric reflections of essentially the same sets of solutions (see Sherali and Smith [17]). Finally, (1k) represents the logical restrictions on the variables, where the y-variables would automatically turn out to be binary-valued at optimality, even when permitted to be continuous variables on the interval [0, 1].
The model is a linear mixed-integer 0-1 program (MIP), which can be solved using a commercial solver to any desired percentage of optimality.
Remark 2. It is possible to accommodate different alternative objective functions of practical interest within this modeling framework involving the makespan, resource usage (number of FPGAs, durations of usage of FPGAs, etc.), and the completion times of jobs, as desired. □
4 Computational Experience
In this section, we begin by delineating the data generation scheme for constructing random, small- to moderately-sized test instances. Next, we present our computational experience to test the efficiency of the proposed mathematical programming formulation. Our proposed formulation was coded in AMPL and solved using CPLEX 10.1 on a Dell Precision 650 workstation having a Xeon(TM) CPU 2.40 GHz processor and 1.50 GB of RAM.
4.1 Data generation
To demonstrate the usefulness of the optimization scheduling model, we have used randomly simulated, realistic
graphs along with the associated data. Although the resulting test cases do not pertain to actual hardware data, they simulate what one might expect in practice.
In our test-bed, the number of jobs n was selected to be 10, 20, or 30, and the number of potential FPGAs, m, was specified to ensure the feasibility of the resulting instance upon generating the different processing times and key sets.
Random Parameters:
-	The precedence relationships between the tasks were randomly generated according to the following scheme. Given j2 G {2,...,n}, and for all j1 G {1, ■■■,j2 — 1}, we generated pj1j2 using a uniform distribution over the range [0, 1]. For some threshold p, if pj1j2 > p, then the arc (ji, j2) was added to P, that is, task j 1 was required to be a predecessor of task j2. In our scheme, we took p = 0^75, which induced a desired density of the precedence arcs in the task graph. Also, redundant arcs were suppressed from the set P by invoking transitivity in the precedence relationships. That is, if the arc (j,jY) was generated while there also exists an alternative path from j to jY in the precedence graph, then the direct arc (j, jY) is redundant, and was consequently deleted from the set P.
-	The Sj0 -parameters and the release dates, a.j, were generated using a uniform discrete distribution over the sets {1,...,10} and {0,...,15}, respectively.
-	Following a scheme similar to that employed by Ali and Das [1], we took the reconfiguration times j to be given by LO^BCjJ, where Zj was randomly generated using a uniform distribution over the range [0, 200], and where L J denotes the rounding-down operation.
-	We set dj = l13lbj + OjJ, where Oj is a randomly generated value using a uniform distribution over the interval S - t — f ,S — t + f], and where S is the average processing time over the CPU, and t and A are parameters that influence the tightness of the due-dates. Here, the term [S — t — f ,S — t + f ] is based
on Fisher's method [8]; we took t in our experiments.
0.2 and A = 1
-	The processing costs were computed as Ojk = L^j J + k mod+(s), where ttj was generated using a uniform distribution over the interval [0, 5], and where s was computed as noted below.
Additional Deduced Parameters:
-	Sjr = [O^Sjo] + njr, yj, yr > 1.
-	s = max {dj}.
j=1,...,n
-	A = 4 max {dj}.
j=1,...,n
292 Informática 36 (2012) 287-295
F.M. Ali et al.
Remark 3. If the reconfiguration times, as well as certain processing times on FPGAs, are fractional, then all time-related parameters may be suitably rescaled to achieve data integrality. This process, however, could entail a significant growth in the size of the problem. An alternative approach would be to round up all fractional processing times (and to round down due-dates, as appropriate). The marginal time amounts that are introduced by this rounding process may be viewed as idleness-buffers on the relevant hardware or software components. By solving the problem instance with such integerized data, we would obtain a heuristic solution to the original problem. Further improvements can be achieved via a routine that shifts operations to the left to eliminate the marginal idleness that has been introduced by this rounding process. □
4.2 Illustrative example
As a prelude, we present below an example to illustrate the problem under investigation and to gain insights into the proposed formulation. Consider the problem instance where the jobs to be processed are related via the precedence graph depicted in Figure 2 and where the associated parameters are provided in Table 1 (with the remaining data being generated as prescribed in Section 4.1). The available resources include a single CPU and one FPGA for potential use. The discrete, time-indexed scheduling horizon has a projected length of s = 39 time-slots.
The solution produced by Model HWSW is summarized in Table 1, and is depicted in the Gantt-chart in Figure 3. This small-sized problem instance was solved to op-timality in 0.04 seconds. Observe that the slot values k for which xjk = 1, as specified in Table 1, indicate indirectly when operations start their processing, as well as the processing components on which these operations are scheduled (by observing the time-slot ranges attributed to each software/hardware component). For instance, both jobs 9 and 10 start at the beginning of time-slot 28 and are completed at the end of time-slot 31. However, since xg,28 = x10,67 = 1, job 9 is scheduled on the CPU, whereas job 10 is processed on FPGA 1.
Job j	aj	jo	nj1	dj	k :xjk = 1	sj	j
1	4	1	1	11	5	5	5
2	1	9	1	9	45	6	9
3	5	4	3	14	10	10	13
4	0	6	1	23	14	14	19
5	3	9	2	24	53	14	18
6	5	2	3	29	20	20	21
7	14	10	3	29	61	22	27
8	1	7	3	39	71	32	37
9	2	4	2	38	28	28	31
10	4	7	1	39	67	28	31
Table 1: Data and results for an illustrative example
4.3 Computational results
Table 2 summarizes the results obtained for instances having n = 10 and 20. The first column of this table specifies the instance number as well as the number of jobs
and the maximum number of FPGAs involved. The second column provides the length of the scheduling horizon (number of time-slots). In the next three columns, we present the results obtained for solving the continuous or linear programming (LP) relaxation of Model HWSW, denoted HWSW, by allowing the x-variables to assume continuous values between 0 and 1. The optimal objective value of the continuous relaxation (^(HWSW)), the ensuing computational time in seconds, and the % optimality gap, are reported for each instance. We define the % Gap as = 100 "(HW'Wh)WsWHWSW1 . The final two columns relate to solving Problem HWSW to optimality. Table 2 reveals that for these instances having up to 20 jobs, optimal solutions were obtained within manageable times.
Table 3 provides the results for the more challenging 30-job problem instances. Here, in addition to solving the problem to optimality, we demonstrate the effectiveness of employing two heuristics to derive good quality feasible solutions in a relatively timely fashion. In Table 3, in addition to the LP solution, we report the first MIP solution produced by CPLEX during its branch-and-bound (B&B) exploration, as well the best available MIP solution that the solver could obtain within a specified computational limit of 300 CPU seconds. We compare these two heuristic approaches to solving the problem to optimality by reporting the percentage deviation (% Dev.) between the heuristic solution value from the optimal solution value.
Whereas the LP relaxation objective values for n = 10 were particularly tight (within an average optimality gap of 0.7% as seen in Table 2), the LP relaxation for larger problem instances (n e {20, 30} over Tables 2 and 3) exhibited an average optimality gap of 19.5%. As a consequence, these larger problem instances required a significant amount of branching operations within the B&B algorithm employed by the solver. However, it is worthwhile mentioning that by considering the first MIP solution obtained by the solver, or by imposing a time-bound (of 300 CPU seconds) on the computational effort, the resulting heuristic solutions respectively sacrificed only 2% and 0.18% of optimality on average for n = 30, while achieving an average computational savings of 98.6% and 93.8%, respectively, as compared with determining optimal solutions. This indicates that near-optimal solutions are typically identified at early stages of the B&B exploration and, hence, motivates the use of such B&B-based heuristic approaches for larger problem instances. It is also important to highlight that the efficiency of such B&B-based heuristic approaches is predicated on the integration of a rounding heuristic scheme and a relaxation-induced neighborhood search (RINS) heuristic that is implemented within CPLEX. At a frequency that the user may control (based on the difficulty and the size of the test instance under investigation), the solver triggers the rounding scheme and/or the RINS heuristic at any node of the B&B tree in an attempt to identify a good quality MIP solution. In our quest for the first MIP solution and the best MIP solution within 300 CPU seconds, the rounding scheme and the RINS heuristic
HARDWARE-SOFTWARE CO-DESIGN
Informatica 36 (2012) 287-295 293
were triggered by the solver at every node explored in the B&B tree.
5 Conclusions
We have proposed a novel formulation for partitioning and scheduling precedence-related jobs on both hardware and software components over a time-indexed horizon. Our model effectively captures the nonpreemption assumption, the precedence constraints, the inviolable due-date restrictions, and the processing times required over hardware and software components. Computational experience reported using randomly generated test instances reveals that optimal solutions can be computed (for n < 20) within about 20 seconds. For n = 30, we demonstrated the effectiveness of two branch-and-bound-based heuristic approaches in producing near-optimal solutions. In particular, using the branch-and-bound algorithm of CPLEX 10.1 to output the first MIP solution and the best MIP solution within a timelimit of 300 CPU seconds, the resulting heuristic solutions respectively sacrificed only 2% and 0.18% of optimal-ity on average, while achieving an average computational savings of 98.6% and 93.8%, respectively, as compared with determining optimal solutions. Thus, we recommend the use of such heuristic approaches for large instances in order to obtain near-optimal solutions with manageable effort. Also, we have focused our attention on minimizing the total processing and resource costs. However, the proposed model is flexible enough to accommodate different alternative objective functions of practical interest within this same modeling framework, which involve the makespan, resource usage (number of FPGAs, durations of usage of FPGAs, etc.), and the completion times of jobs, as desired.
Acknowledgement
This work has been partially supported by the National Science Foundation under Grant CMMI-0552676.
References
[1]	Ali, F. M. and Das A. S. (2004), Hardware-software co-synthesis of hard real-time systems with reconfigurable FPGAs, Computers and Electrical Engineering, 471-489.
[2]	Arato P., Juhasz S., Mann Z. A., Orban A. and Papp, D. (2003), Hardware-software partitioning in embedded system design, IEEE International Symposium on Intelligent Signal Processing, 197-202.
[3]	Bender, A. (1996), Design of an optimal loosely coupled heterogeneous multiprocessor system, Proceedings of European Design and Test Conference, 275281.
[4]	Brogioli, M., Radosavljevic, P. and Cavallaro, J. R. (2006), A general hardware/software co-design methodology for embedded signal processing and multimedia workloads, Fortieth Asilomar Conference on Signals, Systems and Computers, 1486-1490.
[5]	Dasu, A. and Panchanathan, S. (2001), Reconfigurable media processing, Proceedings of International Conference on Information Technology: Coding and Computing. 2-4 April 2001, 300 - 304.
[6]	Dittmann, F., Gotz, M. and Rettberg, A. (2007), Model and methodology for the synthesis of heterogeneous and partially reconfigurable systems, Parallel and Distributed Processing Symposium, 2007, IPDPS 2007, IEEE International, 26-30 March 2007, 1-8.
[7]	Edwards, M.D. and Forrest, J. (1995), Hardware/software partitioning for performance enhancement, IEEE Colloquium on Partitioning in HardwareSoftware Codesigns, 2, 1-5.
[8]	Fisher, M. L. (1976), A dual algorithm for the one machine scheduling problem, Mathematical Programming, 11, 229-251.
[9]	Giovanni, D.M. and Rajesh, K. G. (1997), Hardware/software co-design, Proceedings of the IEEE, 85(3), 349-365.
[10]	Haubelt, C., Otto, S., Grabbe C. and Teich, J. (2005), A system-level approach to hardware reconfigurable systems, Proceedings of the Design Automation Conference ASP-DAC 2005, 298-301.
[11]	Jeong, B., Yoo, S., Lee, S. and Choi, K. (2000), Hardware-software cosynthesis for run-time incrementally reconfigurable FPGAs, Proceedings of the Design Automation Conference ASP-DAC 2000, 169174.
[12]	Juan P. C., David, S., Onassis, C. and Alvaro, S. (2000), Pipelining-based tradeoffs for hardware/software codesign of multimedia systems, 8th Euromicro Workshop on Parallel and Distributed Processing, 383-390.
[13]	Liu, H. and Wong, D. F. (1998), Integrated partitioning and scheduling for hardware/software co-design, Proceedings of the International Conference on Computer Design, 609-614.
[14]	Niemann, R. and Marwedel, P. (1997), An algorithm for hardware/software partitioning using mixed integer linear programming, Design Automation for Embedded Systems, 2(2), 165-193.
[15]	Pinedo, M. (1995), Scheduling: Theory, Algorithms and Systems, Prentice-Hall, NJ.
294 Informática 36 (2012) 287-295
F.M. Ali et al.
[16]	Saul, J. M. (1999), Hardware/software codesign for FPGA-based systems, Proceedings of the 32nd Annual Hawaii International Conference on System Sciences.
[17]	Sherali, H. D. and Smith, J. C. (2001), Improving discrete model representations via symmetry considerations, Management Science, 47, 1396-1407.
[18]	Shin, Y. and Choi, K. (1997), Enforcing schedulabil-ity of multi-task systems by hardware-software code-sign, International Workshop on Hardware/Software Co-Design, 3-7.
[19]	Sipper, M. and Sanchez, E. (2000), Configurable chips meld software and hardware, IEEE Computer, 33(1), 120-121.
[20]	Takeuchi, Y., Shibata, K. and Kunieda, H. (1994), Codesign methodology on programmable hardware and software system, IEEE Asia-Pacific Conference on Circuits and Systems, 182-187.
HARDWARE-SOFTWARE CO-DESIGN
Informatica 36 (2012) 287-295 295
1 -^ 2 -^ 3-^ 5-- -J 7
10
Figure 2: Task precedence graph for the illustrative example involving 10 tasks
CPU
FPGA 1
1		3	4 6		9	
1 2 3 4 5 -1-1-1-1-	6 7 8 9	10 11 12 13 -1-1-1-	14 15 16 17 18 19 20 21	22 23 24 25 26 27	28 29 30 31	32 33 34 35 36 37 38 39
	2 —i—i—i—		5 —i—i—i—i--1—i—	7 —i—i—i—i—i—	10 —i—i—i—	8 —i—i—i—i—i--1—i
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
78
Figure 3: Gantt chart for an illustrative example involving 10 tasks and one FPGA
Instance #, (n, m)	s	LP Solution			MIP solution	
		v(HWSW)	Time (s)	% Gap	v (HWSW)	Time (s)
1,(10,3)	42	387	0.05	1.5	393	0.35
2, (10,3)	69	413	0.09	0	413	0.09
3, (10,3)	59	363.8	0.11	1.6	370	0.20
4, (10,3)	71	431	0.12	0.2	432	0.14
5, (10,3)	62	396	0.09	0.2	397	0.12
6, (20,2)	40	639.66	0.12	20.2	803	0.39
7, (20,2)	44	649.30	0.11	22.4	838	1.45
8, (20,2)	58	829.14	0.14	23.0	1079	17.12
9, (20,2)	72	1008	0.10	22.2	1296	0.54
10, (20,2)	63	841	0.17	23.5	1100	20.93
Table 2: Performance of Model HWSW for instances having n = 10 and 20
Instance #, (n, m)	s	LP Solution			First MIP			MIP within 300 s		Optimal MIP	
		v (HWSW)	Time (s)	% Gap	v(HWSW)	Time (s)	% Dev.	v(HWSW)	% Dev.	v(HWSW)	Time (s)
11, (30,2)	77	1444.7	0.18	17.9	1769	6.3	0.4	1761	0	1761	10.8
12, (30,2)	74	1228.11	0.21	20.0	1571	17.4	2.2	1537	0	1537	103.93
13, (30,2)	57	1166.49	0.23	17.7	1484	21.6	4.5	1433	0.9	1419	6647.9
14, (30,2)	92	1175.98	0.39	19.2	1485	165.6	1.9	1466	^ 0	1456	2819.3
15, (30,2)	81	1639.24	0.39	9.6	1836	17.6	1.1	1815	0	1815	6943.7
Table 3: Performance of Model HWSW for instances having n = 30