Computers and Electrical Engineering 78 (2019) 242–258
Contents lists available at ScienceDirect
Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng
An optimization framework for dynamic pipeline management in computing systemsR Syed Rameez Naqvi a,∗, Anjum Zahid a, Lina Sawalha b, Syed Saud Naqvi c, Tallha Akram a, Sajjad Ali Haider a, Kumar Yelamarthi d, Maksim Jenihhin e a
Department of Electrical and Computer Engineering, COMSATS University Islamabad, G. T. Road, Wah Cantonment 47040, Pakistan Department of Electrical and Computer Engineering, Western Michigan University, Kalamazoo, MI 49008, USA c Department of Electrical and Computer Engineering, COMSATS University Islamabad, Park Road, Tarlai Kalan, Islamabad 44000, Pakistan d Department of Electrical and Computer Engineering, Central Michigan University, Mount Pleasant, MI 48859, USA e Department of Computer Systems, Tallinn University of Technology, Ehitajate tee 5, Tallinn 19086, Estonia b
a r t i c l e
i n f o
Article history: Received 7 March 2019 Revised 23 July 2019 Accepted 23 July 2019
MSC: 05B05 46N10 68M07 68M20 90C56 Keywords: Computer pipelines Nonlinear Collision avoidance Reservation table Optimization Genetic algorithm
a b s t r a c t Dynamic computer pipelines use feedback loops for iterative computation, due to which, data items, while traversing them often collide with one another. These pipelines, therefore, employ specialized mechanisms to detect and avoid such collisions in real-time. Mostly, this is done by deciding when, and when not, to allow newer data items to enter the pipeline. We will show that such decisions may lead to different throughput, and guaranteeing optimal performance is an optimization problem. We begin by building a mathematical model of dynamic pipelines, and make use of genetic algorithm to yield near-optimal throughput. The proposed optimization technique accounts for the hardware overhead, incurred due to the extensions in the pipeline organization, as suggested by itself. Our confidence in the results stems from simulation of 10,0 0 0 dynamic pipeline organizations, and for verification of results, we present hardware implementation of one of those. The proposed framework will specifically be useful in embedded systems for digital signal processing applications. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction Linear computer pipelines comprise a set of function units connected in a cascaded manner, whereby each unit executes its task in parallel. They have been the core of various commercial microprocessors, since they usually promise to provide massive throughputs at the expense of increased latencies [1]. Amdahl’s law [2] suggests that for the number of pipeline stages approaching infinity, the speedup shall approach the number of data items to load (or the number of instructions in a program in case of a pipelined processor). However, deeper pipelines will incur increased power consumption and R This paper is for regular issues of CAEE. Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. Guillermo Botella. ∗ Corresponding author. E-mail address:
[email protected] (S.R. Naqvi).
https://doi.org/10.1016/j.compeleceng.2019.07.013 0045-7906/© 2019 Elsevier Ltd. All rights reserved.
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
243
larger area overheads, due to which the speedup to area ratio becomes more relevant [3]. It has been the utopia of several researchers to optimize this comparison metric, especially in context of application specific embedded systems design. These days, the electronic design and automation (EDA) tools effectively perform this optimization, thereby relieving the designers from such complex analyses. However, there may be situations where the pipeline stages are interlocked through some feedback loops,1 thus called nonlinear or dynamic pipelines [4]. Such pipelines are typically seen in floating point addition and multiplication [5], due to which various dynamic pipeline architectures have been proposed for digital signal processing during the past three decades [6]. In those situations, it is usual for data that are fed back to collide with the following (newer) data before they could access shared resource. It is then up to the controller to arbitrate the resource among the competing data, and controlling the backpressure simultaneously (by not allowing further data to enter the pipeline) [7,8]. This problem is not new and has been addressed several times in the literature. In Section 3 we will present a simple example of such a pipeline to demonstrate a method of automatically loading new data into the pipeline without causing any collisions. We will also observe how different loading strategies (when to, and when not to allow new data to enter the pipeline) would lead to different throughputs. Finding the best among those strategies, with respect to throughput to area ratio, is an architecture level problem, and to the best of our knowledge, there is no generalized methodology that could optimize this metric for any given nonlinear pipeline. Proposing an intelligent framework that effectively does this job, therefore, makes the primary contribution of this work. In what follows, we enumerate the significant contributions and major constituents of the proposed framework: 1. Using the existing knowledge, we develop a mathematical model of dynamic pipelined organizations, which formally explains their management - namely collision avoidance, and throughput computation. 2. An optimization engine, based on genetic algorithm (GA), is proposed, which, beside preventing collisions between data items, guarantees (near) optimal throughput to area ratio. The engine is primarily supposed to determine the time slots on which loading newer data items should yield optimal throughput. 3. We perform exhaustive testing of the proposed framework on numerous nonlinear pipelined organizations by means of MATLAB simulations. 4. Upon obtaining acceptable results, few of the given nonlinear pipelined organizations are modified as suggested by the optimization engine. Finally, we implement the latter on an FPGA for verification of results in terms of maximum operational frequency and resource utilization, and for a quantitative comparison with the original design. The rest of the paper is organized as follows: The mathematical model of a complex pipeline, and its management in terms of improving throughput, are presented in Section 2. For better understanding of the model, an example of pipeline management is given in Section 3. Our problem statement and the proposed framework are presented in Section 4 and Section 5, respectively. The simulation results and analysis are given in Section 6, and we conclude the article in Section 7. 2. Mathematical modeling of a nonlinear pipelined organization 2.1. Assumptions and definitions Let P = {P1 , . . ., Pn } be a set of n function units, Lij be a link from Pi to Pj , L = {Li j } ∀(i, j ) ∈ N and i = j, where N = {1, . . ., n}, and dw be a datum (valid or dummy) being processed by some Pk at a given point in time, where w ∈ {1, . . ., ∞}, then: • Linear Pipelining is an arrangement of Pi , ∀i ∈ N and Pi ∈ P, in series, such that ∃!Li j ∀(i, j ) ∈ N and j = i + 1. Also, P1 and Pn , respectively have LE1 and LnE with environment (E), as input and output of the system. At any point in time, each Pi processes a unique dw , without interlocking Pi−1 and Pi+1 . • Latency (φPi ) of Pi , where Pi ∈ P, is the total time elapsed (δ t) from the point at which an input is applied (ti ), to the point at which the corresponding output is available (to ). Let φ = {φi } be the set of latencies, where φi = φPi , and, for simplicity, let φi = 1, ∀i ∈ N. Therefore, δt = (t0 − t1 ) = 1; henceforth called a time slot (τ ). Similarly, φP = ni=1 φi , i.e. the total number of time slots dw stays within the system. • Throughput () of a system is the number of datums it delivers on the output per τ . Ideally, for a linear pipeline, lin = 1. Nonlinear pipelines, beside requiring slight alteration in the above definitions, introduce a few new concepts: • Nonlinear Pipelines (PNL ) are linear pipelines except that j and i are independent in ∃!Li j ∀(i, j ) ∈ N, i.e. there may exist feedback paths. Also, ∃ LEi and ∃ LoE , where (i, o) ∈ N. • Sequence (S) is the order of P which each dw must traverse - represented asdw S; for linear pipelines, Slin = {P1 , P2 , . . .Pn }, i.e. (Pj ≺ Pj+1 ) ∀ j ∈ N. In nonlinear pipelines, however, the sequence is arbitrary, and defined by S =
{Pl , . . ., Pm } ∀(l, m ) ∈ N. An instantaneous traversal of Pk by dw from Lrk at τ 0 will be represented asdwLrk τ0 Pk . 1 Feedforward loops may also exist; in this work, however, we will limit our focus on feedback paths alone, for simplicity. The same framework may be easily extended for feedforward paths.
244
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
• Collision between dw and dx occur at Pk , when both attempt to traverse Pk , from Lrk and Lsk , respectively in the same τ 0 .
This is mathematically represented as(dwrk , dxsk ) τ0 Pk , where (r, s ) ∈ {E, N}, and (w, x ) ∈ {1, . . ., ∞}. Similarly, no collision will be represented by. L
L
Lemma 2.1. In nonlinear pipelines, for D = {d1 , . . ., dz }:
Lrs ∧ [(r − s + 1 ) < z] −→ ( < 1 )
(1)
Proof. By contraposition, Eq. (1) may bewritten as: ¬( < 1 ) −→ ¬ Lrs ∧ [(r − s + 1 ) < z]
⇒ ¬( < 1 ) −→ ¬(Lrs ) ∨ ¬[(r − s + 1 ) < z] Since 1, ¬( < 1 ) −→ ( = 1 ) ⇒ ( = 1 ) −→ ¬(Lrs ) ∨ ¬[(r − s + 1 ) < z] For = 1, the definition of throughput given above asserts that Lrs ∀(r, s ) ∈ N, i.e. it must be a linear pipeline (no feedL L back paths), else (dwrk , dxsk ) Pk , i.e. there must be no collisions. Here ¬[(r − s + 1 ) < z] indicates the datums to enter the pipeline to be too small to cause a collision (fewer than the length of the feedback path) - hence the proof. Although, not very different, the definition of throughput given above may be modified to be more meaningful for nonlinear pipelines as follows. • For D = {1, . . ., di , . . ., dz }, (r, s ) ∈ {E, N}, k ∈ N, (w, x ) ∈ {1, . . ., i}, and τ 0 representing the first time slot:
nonlin =
length D(1 : i )
τc − τ0
,
(2)
where(dwrk , dxsk ) τ0 :τc Pk , and τc+1 is the first time slot at which di+1 may enter without causing a collision. This is interpreted as follows: Throughput is the number of data items that may enter the pipeline, without causing collisions, per time slot. From this point further, = nonlin , unless mentioned otherwise. c = {vl−1 , vl−2 , . . ., v0 }, where l = φi − 1, called a collision vector, and vi is defined by the following • For a given PNL , ∃V rule: L
L
vi = (∀Pr , ∀Ps , ∃Lrs | r − s = i ) & (∀dw ∈ D, dwLrs Ps ) ?1 : 0
That is, vi = 1 if(dwrk , dxsk ) τ0 :τc L
1)
L
(3)
Pk for x = w + 1, and hence dx must not be allowed to enter the pipeline on τ0 + (i +
time slot, if dw entered on τ 0 .
Finally, we are able to define a nonlinear pipelined organization as follows: c , }, where a set of function units (P) having a set • A nonlinear pipelined organization (ONL ) is a septuple {P, φP , L, D, S, V of latencies (φP ), connected over a set of links (L), process a set of data (D), following a certain sequence (S) of P, in c ), and hence yields such a way that a few data items may collide with one another, determined by a collision vector (V a restricted throughput (). 2.2. Avoiding collisions c , defined above, is only capable of sensing collisions among consecutive datums; however, Note that one V
(dwrk , dysk ) τ0 :τc Pk for y > w + 1, is also possible. What is usually done is that we associate an instance of the same Vc L
L
d , which, besides checking for a collision with dw+1 , may be exploited with each datum that enters the pipeline, call it V w = {gl−1 , gl−2 , . . ., g0 }, and initialize it as G =V c . In order to to avoid other collisions with dy . For doing so, assume a vector, G according to the following rule to detect a collision. be able to avoid collisions, during each τ , update G
= g0 == 0 ? SL (G ) + V c : SL G , G R R
(4)
where SRL and + are binary operators for shift right logical and OR, respectively. This rule states that if the right most bit of is zero, it is safe to load (enter) a new data item, say dw+1 , (without causing collisions) at the next consecutive time slot, G say τ0+1 . Loading or not at τ0+1 is still optional, and the decision may lead to different throughputs. Handling this choice, such that the throughput is optimal, will be addressed later. In case a datum is loaded, g0 becomes irrelevant for dw , and c needs to be brought into the system - this is done hence may be dropped, by means of SRL , and another instance of V c or if the sequence starts repeating. This process by the OR operator. This process continues for the entire length of the V may be depicted by a finite state machine (FSM), having at least one state, and a maximum of 2length(G) . The former specifically refers to the case of a linear pipeline. In the other case, i.e. (g0 == 1 ), the new datum cannot be entered; therefore, will be indicated as U . simply drop g0 by means of SRL . From here onwards, application of this update rule on G G
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
245
Fig. 1. Example of cycles between a & c, d & f, 0 & g, and Vc & h.
2.3. Determining maximum throughput c | V c = 0, where fi = f j ∀(i, j ) ∈ {1, . . ., m}, f1 = 0, f2 = V c , fi = G i−2 , for Let F = { f1 , f2 , . . ., fm } be the FSM for a given V 2 < i ≤ m, Gi−2 follows Eq. (4), and the choice of loading must also be accounted for. Then from the above discussion, it follows:
fi (g0 ) −→ ∃! f j
| f j = SRL ( fi )
fi (¬g0 ) −→ ∃!
f j , fk
(5)
| f j = SRL ( fi ) ∧ fk = SRL ( fi ) + Vc
(6)
It is usual to observe in F that iterative U fi −→ f j | U f j = fi , i.e. a cycle - comprising v=j-i+1 states - is formed between
the two states. An example of such observation is presented in Fig. 1. We represent this cycle as Ci: j = { fi , . . ., f j }. If FL = { fk } and FN = { f j } from Eq. (6), then from Eq. (2):
Ci j =
length(Ci: j ∩ FL )
(7)
v
Lemma 2.2. Different loading strategies of data items may lead to different throughputs. For FSMs comprising multiple cycles, therefore:
max = max(Ci j , . . ., Cmn )
(8)
This proposition may be proven by using contradiction. If this statement were not true, then each cycle would have yielded the same throughput, which could be negated by submitting a counter example, e.g. given in Section 3. 2.4. Increasing throughput beyond maximum Respectively from the definition given by Eqs. (2) and (7):
↑ ⇒ ↑ ⇒
τc − τ0 ↓ ∨ length D(1 : n ) ↑
(9)
v ↓ ∨ length Ci: j ∩ FL ↑
(10)
The former term in each Eqs. (9) and (10) will constrain , and, therefore, carry insignificant weight in improving beyond a certain threshold. On the other hand, the more the datums that enter the pipeline, the greater the shall be this is represented by the latter term in each of the above equations. Theorem 2.3. Increasing latency may increase throughput. Proof. Let di ≺dj , and (di mk , d j nk ) τx Pk . We can prevent this collision by forcing di to access a dummy Rmy at τ x , and allow L
L
the newer datum dj to enter and traverse the entire pipeline without collisions: (di my τx Rmy ) −→ ((di , d j ) τx Pk ), where Pupdated = {P1 , . . ., Pn , Rmy }, and Lupdated = {. . ., Lmy , Lyk , . . .}. Consequently, φPupdated = ni=1 φi + φRmy , i.e. latency of the system increases, leading to the following: L
updated =
length D(1 : i + 1 )
Hence the proof.
( τc + 1 ) − τ0
>
(11)
Here updated may be generalized for e newer data items that may be inserted at the expense of g additional time slots, given by Eq. (12).
= updated =
length D(1 : i + e )
( τc + g ) − τ0
(12)
246
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
3. Pipeline management - an example and related works As stated earlier, dynamic pipelines are usually seen in application specific systems - namely digital signal processors, where they are used in efficient floating point multiplication, fast fourier transform engines, and digital filter designs [9,10]. While all of them agreed on improving throughput by increasing latency; to the best of our knowledge, none of them proposed a mechanism to optimize it. The closest to our framework was a mathematical model of dynamic pipelines proposed by Zargham [11], in which throughput to area ratio was defined as the objective function. In what follows we make use of a simple nonlinear pipelined organization, as an example, and demonstrate how to avoid collisions, determine and improve the throughput at the cost of increased latency. This example is mainly an extract of what has been discussed in the literature on various occasions. We will motivate the requirement for our contribution by raising a few questions on the conventional methodology before concluding the section. 3.1. Example of nonlinear pipeline management c , }; Consider ONL = {P, φP , L, D, S, V P = {P1 , P2 , P3 , P4 } φ = {1, 1, 1, 1} L = {LE1 , L12 , L23 , L32 , L34 , L42 , L4E } D = {d1 , . . ., ∞} S = {P1 , P2 , P3 , P2 , P3 , P4 , P2 , P3 , P4 } Since, for the given S, di at P2 may arrive from multiple sources (P1 , P3 or P4 ), the resulting ONL will have a multiplexer as shown in Fig. 2. Fig. 3 (left) shows a reservation table for this pipeline. Here d1 enters the pipeline on time slot 1 (τ 1 ), and leaves the pipeline on τ 9 according to S; hence φP = 9. It is understandable that this data item will collide with the c = {v7 , . . ., v0 } will be “0 0 010110”. following data items at P2 , P3 , and P4 after a few time slots. The corresponding V While v0 = 0 suggests that it is possible to load d2 on τ 2 , v1 = 1 indicates a collision if we tried to enter d3 on τ 3 . This may be confirmed by manually inspecting Fig. 3 (right). The same figure also confirms that τ 8 is the first time slot on which d3 may be loaded without causing any collisions, followed by d4 on τ 9 and so on. According to def. (2), therefore, = 27 . On the other hand, recall that it is up to the control unit to decide in real-time if loading a data item on a particular time slot would cause a collision or not using rule 4. Luckily, there exists a simple mechanism based on an array of OR-gates and
Fig. 2. An example of a nonlinear pipelined organization.
Fig. 3. Reservation table: left (forming collision vector), right (loading multiple datums).
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
247
Fig. 4. Hardware for real-time collision detection.
Fig. 5. State transition diagram.
a shift-register that does this job effectively, shown in Fig. 4. Each bit of the collision vector is hardcoded on one input of an OR-gate, and the shift register is initialized to all zeros. On every clock edge (clk) the contents of the register are right shifted once (unconditionally - mimicking the data traversing the pipeline). However output of each OR-gate is loaded into the register, using the Load signal, only when it is safe to load new data. Recall that whether to load a new datum at, e.g., τ 2 was optional, and the choice should lead to multiple cycles, Ci: j and hence a different throughput. The FSM, in Fig. 5, presents all the possible loading strategies (cycles); C1: 3 , i.e. the cycle between “0 0 0 0 0 0 01” and “0 0 0 0 0 011” is called a greedy strategy, in which a datum is loaded on every option. Usually, the greedy strategy will result in the maximum throughput; however, this may not always be true. According to Eq. (7), C1:3 = 27 (with dashes), C1:2 = 16 (with solid lines), and C11:13 = 14 (with dots and solid lines). By making use of Theorem 2.3, we may insert d3 in τ 3 , and prevent d1 from traversing P2 in τ 4 . This will require us to include R32 in the system, which will increase φP to 10, as shown in Fig. 6 (left). The right table in the same figure verifies that d1 to d3 may now enter without collisions; yielding an increased = 39 , which indicates 17.8% improvement. Note that c will be different, i.e. “0 010 010 0”, for this mechanism, and so will the FSM be - hence possibly resulting in various new V cycles yielding different throughputs. However, this time we have incurred an overhead of one register, R32 , on the overall area of the circuit. A few obvious questions that come to mind from this analysis are: 1. Why not insert another register on P32 , and insert d4 on τ 4 ? Similarly, we may place another register and hence insert d5 on τ 5 , and so forth - thereby constantly increasing the throughput at the expense of additional registers. When should we stop? 2. How do we choose the path, to place the registers on, in order to maximize the throughput? 3. Is there a formal method to determine the optimal throughput, and therefore an optimal number of registers and the corresponding path? Answering all these questions makes the core contribution of this work. 3.2. Other related works Shar and Davidson [12], in a pioneer work, coined the concept of using pipelines for achieving multiprocessor. They proposed to break up each instruction phase into a number of minor cycles that could be performed in sequence. By making use of reservation tables for collision avoidance, they concluded that the pipeline type of architecture permitted the throughput to be doubled by using less than twice the number of components. Furthermore, this technique also enabled each processing stage to be smaller, leading to reduced dynamic power consumption. Benoit and Robert [13] considered a simplified model for linear pipelines and fork graphs, and showed that several crucial criteria, including latency and throughput, should be optimized for such applications. They motivated to replicate
248
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
Fig. 6. Reservation table for improved throughput.
pipeline, and fork, stages to increase the throughput, by optimally mapping successive data items on to different processing elements. Also, the authors proposed to data-parallelize each pipeline stage, such that computation of each stage was shared by multiple processing elements. This resulted in a decrease of latency, and increase in the throughput in turn. Similarly, Subhlok and Vondran [14] considered a linear chain of pipelined processing elements, called tasks, and identified best mapping by determining optimal tradeoffs between latency and throughput against various constraints. They proposed an algorithm that primarily minimized the latency for minimum throughput, and demonstrated that it could also be used for maximizing the throughput for a minimum latency constraint. They used Fast Fourier Transform (FFT) as the benchmark program, and implemented the proposed method as an automatic mapping tool for the Fx compiler. Among the more recent developments on performance optimization of data pipelines, the works carried out by Smirnov and Taubin [15] and Gill and Singh [16] have been the most prominent. Each of those works analyzed asynchronous pipelines, and proposed heuristic based approaches for analyzing and optimizing their throughputs. The first of the two works focussed on asynchronous pipelines with data independent token flow, which throughput is related to the number of tokens in the pipeline, and latency of each individual pipeline stage. In the second work, Gill and Singh came up with an analytical approach, based on canopy graphs, which could achieve target throughput by systematically eliminating systemwide bottlenecks. The latter was achieved by a thorough tree search, where each node represented a solution - throughput, which if satisfied the target in presence of various cost functions, including area and energy, was considered the optimal solution. Although each of these works proposed a novel solution towards optimizing data pipelines, they all considered the simpler case of linear pipelines. Something that makes our proposed framework unique is its ability to handle intricacies involved with feedback loops, and exploits genetic algorithm to achieve near optimal throughput, in presence of area constraints. 4. Problem and objective statement The example, presented above, does demonstrate that there may be multiple solutions for avoiding collisions - each resulting in a different throughput. Generally, each method requires the latency of a data item through the pipeline to be increased by means of placing storage elements (registers) on the execution paths. Yet, the covered example does not point out the most efficient solution in terms of throughput, since the more the registers, the higher the throughput may become. It is therefore reasonable to believe that even an infinite number of examples will not be sufficient to determine the optimal loading strategy, especially since each register will cost some area on the silicon wafer once fabricated. We believe there is a need to have a formal mechanism to know which feedback path to place the storage elements on, and how many such elements will lead to the optimal throughput to area-on-the-die ratio. Our objective statement (OS) is given by Eq. (13).
OS : max L,R
subject to: C1 : A > 0
A
(13)
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
249
C2 : 0 < length(L ) ≤ lmax C3 : 0 ≤ length(R ) ≤ rmax C4 : A%R = 0 Here L represents a vector of all feedback paths, i.e. L = {Li j } ∀i > j and R is a vector of additional registers inserted on each Lij ; therefore, there is one-to-one mapping between L and R, and together they are referred to as the decision variables. C1 ... C4 are called constraints, and are interpreted as follows. C1 states that the area of the organization, pre- and postoptimization, can neither be zero nor negative. The constraints C2 and C3 limit the number of feedback paths within the organization, and the number of registers that may be placed on each path to some integers lmax and rmax , respectively; otherwise there may be infinitely many solutions to our optimization problem. lmax and rmax are inputs to our system, which we shall define in Section 6. Note that length(L ) cannot be zero, according to C2 , since a nonlinear pipeline needs to have at least one feedback path. Also, R can be an empty set, φ , according to C3 , since the original organization, without inserting a register, could also yield the optimal result. C4 , for simplicity, binds area of the organization to be an integer multiple of the area of an individual register R. Our OS is interpreted as follows: we have to tune L and R in such a way that A is maximized; where is given by Eq. (12), and A is the modified area given by A = Aoriginal + length(R ). 5. Proposed research methodology The proposed framework comprises two phases: Pipeline Management and Genetic Algorithm. 5.1. Phase 1: Pipeline management The first phase begins by automating the complete procedure described above. That is, a program is developed in MATc and . Our model LAB, which upon execution, asks the user to enter P and S. In response, it generates ONL , including V determines feedback paths within ONL using S, and eliminate the redundant ones. We assume that in a linear pipeline, Si+1 > Si ; in case this is violated by a pair {Pi , Pj }, it will be considered as a feedback path. Algorithm 1 presents our methodology for determining the unique feedback paths. c may Note that the unique feedback paths will be fed to phase 2, where the population will be generated. Meanwhile, V c considers the time slots at which a certain function unit is be generated once the organization is known. Recall that V iteratively accessed; and the corresponding bits in the vector are turned on, otherwise they remain zero. Our methodology is presented in Algorithm 2 . c is subsequently fed to the state machine generator given by Algorithm 3 . The generator makes use of depth-first The V strategy: the greedy strategy is computed first (a new data item is loaded upon each even number encountered), which yields a column vector called next _F SM. The latter is then iteratively accessed for each even number, where the choice of loading is applied (by not loading against the first even number). While handling each even number, another column vector is generated, and appended to the previous ones, given by current _F SM. In this manner we obtain two strategies for each even number, and the handled even number is preserved in a list called done_list, to avoid an indefinite loop - otherwise the generator would keep handling the same number repeatedly. The algorithm terminates once all the columns in the current _F SM have been exhausted, and each even number is on the done_list. In order to apply the rule given by Eq. (5), the FSM generator exploits Algorithm 4 . The inputs to this algorithm include c a column vector, to which the rule is to be applied, and the entire state machine. Each column in the FSM has 0 and V
Algorithm 1: Determine unique feedback paths. 1 2
3 4 5 6 7 8 9 10 11 12 13 14
Input: P = {P1 , P2 , . . ., Pn }, S = {P1 , . . ., Pn , . . ., Pm } Parameters: l ← length(S), fB ← [2 × l] matrix of feedback paths, Zi ← vector of nonzero indices ∈ fB , m ← length(Zi ) Output: UfB ← [2 × l] matrix of unique feedback paths fB = zeros (2, l) for i = {1 to (l − 1 )} do if S(i ) > S(i + 1 ) then fB ( 1, i ) = S ( i ) fB ( 2, i ) = S ( i + 1 ) Zi = find(fB (1, : )) // find nonzero indices in fB UfB = Zi for j = {1 to (m − 1 )} do for k = {1 to m} do if fB (:, Zi ( j )) == fB (:, Zi (k )) then UfB (:,k) = [ ] // delete the redundant paths
250
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
as its first two elements. This algorithm starts accessing the array from the second element, and checks if the number it encounters is even. If the latter is true, then the rule applied is SRL followed by logical OR, otherwise only SRL is applied. Upon obtaining a loading strategy by Algorithm 4, its corresponding throughput may be computed using Algorithm 5 . It , representing a loading strategy, and a point of return, R; for instance 0, Vc , a, and d are points of return requires a vector, V , and divides it by in Fig. 1. This algorithm counts even numbers - representing loads - between R and the last element of V the number of transitions between these two points to determine , as given by Eq. (7). 5.2. Phase 2: genetic algorithm In the second phase, a metaheuristic optimization technique of genetic algorithm (GA) [17] is exploited, which is inspired by the process of natural selection. Mostly this technique is exercised in situations where the solution search space is not well-understood and relatively unstructured [18]. GA typically requires a fitness function, which, as a single figure of merit, identifies how close the proposed design is to achieving the set objectives [19]. In our case, the fitness function is A , and the population comprises chromosomes, which are bit vectors encoding target parameters to be optimized, i.e. a representation for all the paths in the pipeline and the number of registers. As stated earlier, since there can be infinitely many solutions to our OS, the GA, suits us the most.
c . Algorithm 2: Determine the collision vector V
10
Input: S c ), tv ← temporary vector Parameters: l ← length(S), lv ← length(V c ← collision vector Output: V lv = l − 1 tv = zeros (1, lv ) // check if the same Pi is accessed more than once for i = {1 to l } do for j = {(i + 1 ) to l } do if S(i ) == S( j ) then tv ( j − i ) = 1
11
c = rotate_lr(tv ) // flip tv vector in left to right direction V
1 2 3 4 5 6 7 8 9
Algorithm 3: Generate FSM and calculate . c 1 Input: V 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
), ← vector throughput, done_list, col_in_FSM, row_in_FSM Parameters: current_FSM, next_FSM, lv ← length(V Output: mat ← matrix of all throughputs c ) V = binary to decimal (V = [0,V]t V , current_FSM) // Generate greedy strategy, and compute its [,next_FSM] = Compute (V mat = current_FSM = next_FSM done_list(1) = 1 j=1 while (1) do = current _F SM ( : , j) V for i = done_list ( j ) + 1 : lv do (i ), 2 ) && V (i ) = 0 then if ¬mod (V = [V (1 : i ); SL (V (i ))] //SL ← shift right logical V R R , current_FSM) [,next_FSM] = Compute (V mat = [mat ] [col_in_FSM, row_in_FSM] = size(next _F SM) done_list(col_in_FSM) = i current _F SM = next _F SM if j < col _in_F SM then j++ else break
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
251
Algorithm 4: Core computation and calculate . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Subroutine: Compute (Eq. (4)) , current_FSM Input: V ), term ← last element in V , R ← index of term Parameters: l ← length(V Output: , next_FSM c = decimal to binary V (2 ) V while (1) do (l ) term = V if ¬mod (term, 2 ) then c ) term = binary to decimal (SRL (decimal to binary (term)) ∨ V else term = binary to decimal (SRL (decimal to binary(term))) then if term ∈ /V = [V ; term] V else ) // R ← index of term, if term ∈ V = Determine (R, V break ] next_FSM = [current_FSM, V
Algorithm 5: Calculate . 1 2 3 4 5 6 7 8 9 10 11 12 13
Subroutine: Determine Input: R, V ), b2d ← binary to decimal, d2b ← decimal to binary Parameters: lv ← length(V Output: = zeros (2, 1) c =d2b(V ( 2 )) V for i = {R to (lv − 1 )} do (i ), 2 ) && V (i + 1 ) == b2d (d2b((V ( i )/2 ) ) ∨ V c ) then if ¬mod (V (1 ) + + ( lv ) // Repeat the same check for V (R ) == b2d (d2b((V ( lv )/2 ) ) ∨ V c ) then if V (1 ) + + (l ) − (R + 1 ) (2 ) = V Table 1 Chromosome for the organization with R32 inserted. Chromosome
R1
R12
R23
R32
R34
R42
R4
Collision Vector
Binary equivalent
000
000
000
001
000
000
000
000,000,000,100,100
Table 2 Chromosome for the organization with two registers on L32 . Chromosome
R1
R12
R23
R32
R34
R42
R4
Collision Vector
Binary equivalent
000
000
000
010
000
000
000
000,000,001,001,100
In GA, the initial population is randomly generated, and then in each round, fitness for all the chromosomes in the population is computed [20]. The chromosomes with the highest fitness, called the parents, are kept in the next population, and the rest are dropped. Then a set of genetic operations, crossover and mutation, are performed on the parents to produce their offspring, which combine portions of their parents, and are likely to carry higher fitness values [21]. These offspring along with their parents are then included in the next generation of the population, thereby keeping the number of individuals in the population the same as before. This process is repeated until the algorithm converges to a possible solution. At some point in time, A starts to saturate, and this is where the algorithm terminates. The chromosomes that could represent the system, given by Fig. 2, with R32 inserted, and another organization having two registers on the same L32 path are, respectively given in Tables 1 and 2. Note that b bits (here b = 3 as an example)
252
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258 Table 3 Post-crossover offspring of the chromosomes of Tables 1 and 2. Chromosome
R1
R12
R23
R32
R34
R42
R4
Collision Vector
Binary equivalent
000
000
000
011
000
000
000
000,000,010,010,100
have been reserved for 2b possible registers on each path in the organization, where R1 and R4 indicate the input to P1 and the path from P4 to output, respectively. Since 2b registers, if inserted, will add a delay of the same number of clock cycles, the length of the collision vector must be adjusted accordingly. Note that the collision vector depends upon the number of registers inserted; hence it is not randomized. 5.3. Arithmetic crossover The first of the two specialized GA operators, called crossover [22], is defined as follows. It generates an offspring (F) by interleaving two chromosomes (X): Fi = (xi1 , xi2 , . . ., xij ), ∀i ∈ {1, 2} where
x1k = λXk1 + (1 − λ )Xk2 , x2k = λXk2 + (1 − λ )Xk1 λ is a constant called crossover point, and we have assumed it to be a random variable, where λ ∈ {0. . .1}, e.g. λ = 0.5 would mean, interleaving half of the bits from each chromosome. The chromosomes are expressed as follows: X1k = (X11 , X21 , . . ., X j1 ), X2k = (X12 , X22 , . . ., X j2 ) If the crossover was applied to the two chromosomes of Tables 1 and 2 on bit 11, it would have produced the offspring shown in Table 3. Note that the first 11 bits of the second chromosome replace those of the first chromosome. The resulting bit vector now reflects a total of three registers on L32 . 5.4. Mutation The other operator, called mutation [22], randomly flips a few bits in the chromosome, resulting in a genetically diverse offspring. Let Gmax be the maximum number of generations and G p be a generation on which mutation is applied, then:
Xi =
Xi + δ ( p, κi − Xi ) Xi + δ ( p, Xi − τi )
if Xi = 0 if Xi = 1,
(14)
where (κ i , τ i ) ∈ {0,1}, with 10−1 probability. It is said that the crossover is likely to maintain diversity among individuals, and the mutation operation avoids local minima. Crossover and mutation are applied to chromosomes with highest fitness in every round of execution to produce better offspring, with a hope to maximize the throughput to area ratio. Furthermore, we follow rank based selection method [23] to select 50% of the population having best fitness, during each round of the execution, as survivors, while the other half is replaced by the set of newly generated offsprings [24]. Fig. 7 presents operation of the entire framework in a nutshell. Whilst, most of the aspects are self-explanatory, it is essential to point out that the two components, i.e., GA Optimizer (GAO) and Pipeline Management Unit (PMU), process and exchange data back and forth. From the generated population, a chromosome is selected, and sent to the PMU. The latter is responsible for generating the collision vector, FSM, and compute throughput for each loading strategy available within the FSM. The best amongst those is chosen, and sent back to the GAO. This process continues until the population is expired. After that, rank based selection is applied by the GAO, followed by crossover and mutation operators to yield better offspring. The entire process is repeated iteratively on the offspring for several generations. 6. Simulation results and analysis In what follows, we present results of the proposed optimization algorithm, and post-layout simulations of a few selected pipelined organizations implemented on a Xilinx FPGA. Our optimization framework utilized MATLAB implementation, which, in the first phase randomly generated 10,0 0 0 nonlinear pipelined organizations. In the second phase, the developed algorithm was run for each organization, and throughput was calculated. 6.1. Randomly generated nonlinear pipelined organizations Following are eight, randomly selected, organizations from the total of 10,0 0 0 different nonlinear pipelined organizations, comprising a few from ISCAS High-Level Benchmark Models [25]. In the rest of the manuscript, we will keep our focus on these eight, and on the first one presented in Fig. 2. Note that organizations O5 . . .O7 are given as samples in Fig. 8. O1 :S1 = {P1 , P2 , P4 , P5 , P3 , P1 , P6 , P7 , P4 , P5 , P6 , P8 , P9 , P10 , P7 , P6 , P8 , P9 , P10 } O2 :S2 = {P1 , P6 , P4 , P8 , P10 , P2 , P3 , P1 , P5 , P6 , P7 , P8 , P9 , P10 , P6 , P7 , P8 , P9 , P10 }
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
253
Fig. 7. Proposed Framework in Operation.
O3 :S3 = {P1 , P2 , P3 , P4 , P5 , P6 , P7 , P9 , P10 , P6 , P7 , P8 , P9 , P10 , P7 , P8 , P9 , P9 , P10 } O4 :S4 = {P1 , P1 , P2 , P2 , P3 , P3 , P4 , P5 , P6 , P7 , P5 , P6 , P7 , P8 , P9 , P10 , P10 , P9 , P10 } O8 :S4 = {P2 , P3 , P1 , P2 , P4 , P3 , P4 , P5 } 6.2. Optimization results and discussion In order to keep minimal complexity, C2 and C3 , from Section 4, have been utilized in the following manner. (i) lmax = 6, (ii) rmax = 8, beyond this point, the length of each chromosome would exceed 18, and therefore, the complexity should be enormous, even for as low as 100 generations per organization. C1 and C4 are applied the same way they were defined. Figs. 9 and 10 present convergence of GA to the best values of fitness function, respectively for organizations O3 and O4 . Part (a) in each figure corresponds to the case where original organization costs the same in terms of area utilization as an individual register. This is the most pessimistic scenario2 , since, circuits that are usually seen, let alone complete processors, may be hundreds of thousands of times larger than a register. Inserting registers in this scenario will badly affect the fitness function, since the denominator (A) will grow in multiples. In most of such cases, the objective function will be smaller than that of the original organization, and as a result, our framework will point to the latter as the best solution. Similarly, parts (b), (c), and (d) correspond to the cases where a register to be inserted, is 100, 10 0 0, and 10,0 0 0 times smaller than the original organization, respectively. Increasing this difference in area utilization is realistic, and seems to favor our proposition of increasing latency to increase the throughput. However, there exists a threshold point, beyond which difference in area will have no effect, and the fitness function starts to saturate. These findings are evident in Table 4. The latter summarizes the fitness function for O1 to O8 , where area of each organization was varied between 1 and 10,0 0 0 times of an individual register. It may be observed for each organization that as the area exceeds 100 times than that of a register, the fitness function shows marginal improvement. Note that the values have been rounded to two decimal places, otherwise the values in the bottom row were the largest, only by a negligible amount though. 2 This case is in fact unrealistic, since overall area of an organization comprising n pipelined stages, and therefore n − 1 pipelined registers, cannot be equal to a single register.
254
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
Fig. 8. Block diagrams for three randomly generated organizations O5 . . .O7 . Table 4 Comparison of throughput-to-area ratio for organizations O1 to O8 . Area Variation
Original (Eq. (8)) Aoriginal = R Aoriginal = R × 102 Aoriginal = R × 103 Aoriginal = R × 104
Throughput () to Area Ratio O1
O2
O3
O4
O5
O6
O7
O8
0.30 0.15 0.33 0.33 0.33
0.26 0.05 0.30 0.30 0.30
0.17 0.03 0.24 0.25 0.25
0.25 0.06 0.32 0.33 0.33
0.25 0.08 0.33 0.33 0.33
0.28 0.14 0.50 0.50 0.50
0.42 0.21 0.50 0.50 0.50
0.33 0.16 0.50 0.50 0.50
Table 5 Number of feedback paths inserted for optimal /area. Feedback path 1 2 3 4 5 6
O1
O2
O3
O4
2 1 1
1
O5
O6
2
O7
O8
1 1
1
4 2 1
The values in Table 4 are computed as follows. Consider column O5 for reference. The original throughput that the framework computed was 0.25, and the optimizer placed two registers on the feedback path from P2 to P2 to yield optimal 0.3333 results. See O5 in Fig. 8. For the row corresponding to the case Aoriginal = R × 102 , A = 100 R+2R . If we assume, for simplicity,
0.3333 Aoriginal = 1 then R = 0.01. As a result, A = 1+2 (0.01 ) = 0.3267 ≈ 0.33. Similarly, we need to know the number of registers placed on the feedback paths to compute all other values in Table 4. Table 5 summarizes this information.
6.3. Optimized hardware implementation In order to validate results of the proposed approach, and demonstrate its effectiveness in real-world applications, we designed, simulated, and implemented all eight organizations (O1 . . .O8 ) on a Virtex-5 Field Programmable Gate Array (FPGA)
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
255
Fig. 9. Optimization results for the organization O3 .
Table 6 Post-route comparison between the two organizations. Resource/Metric
Original organization
Optimized organization
Slice registers Slice LUTs LUT flip-flop pairs used Max. operational frequency /Area
5 4 9
6 4 10
730.514 MHz 0.0156
924.30 MHz 0.0165
from Xilinx. Note that some of these organizations (O1 . . .O3 ) consist of ten pipeline stages, which is rarely seen even in larger designs such as crypto-systems. For the proof of concept, yet due to limitation of space, here we just use the same example that was presented in Section 3, and replaced Pi by different logic gates. For simplicity in the design, we considered data to be 2-bit wide, and logic level 1 (high) should always enter the pipeline as a new input via an AND-gate. This is depicted in Fig. 11. Note that the register drawn with bold-dashed line on L32 , is only considered in the optimized organization. The select line, Sel[3:0], is controlled by a different state machine each for the original and the optimized organizations. The post-route simulations for each organization are, respectively given in Fig. 12a and b. It may be observed that original = 27 (Dout remains high for two successive clock cycles, and remains low for five). Similarly, optimized = 39 ; each result conforms with our theoretical observations in Section 3. In terms of the computed throughput, the optimized organization proved to be approximately 1.5 times more efficient than the original one, where as the overhead, in terms of device utilization, may be observed in Table 6. The bottom row in the table verifies that the optimized organization is approxi-
256
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
Fig. 10. Optimization results for the organization O4 .
Fig. 11. The nonlinear pipelined organization of Fig. 2 with Pi = Si logic gates.
mately 1.507 times more efficient than the original one, in terms of throughput to area ratio. Naturally, this difference will be more pronounced for larger circuits.
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
257
Fig. 12. Post-route simulations for (a) original; (b) optimized organizations.
7. Conclusion We have proposed a mathematical model for complex computer pipelines, and used it to demonstrate that data in such pipelines are likely to collide with one another. Furthermore, we have provided ways of handling those collisions in realtime, and have shown that throughput that each solution yields is usually different from the others. Our framework, for managing complex pipelines, also comprises an optimization method based on evolutionary techniques, which finds the solution that yields maximum throughput, however, at the expense of additional storage elements within the given pipeline organization. In order to constrain the overall impact of additional registers on system’s area to an acceptable level, the optimization method assumes throughput-to-area ratio as the cost function. Our results are based on simulating 10,0 0 0 different pipeline organizations in MATLAB, and have been cross-verified using Field-Programmable Gate Array implementation and post-route simulations. Conflict of interest The authors declare no conflict of interest. Acknowlgedgments This research work is partially sponsored by Pakistan Science Foundation against project number PSF/Res/PCIIT/Engg (159). Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.compeleceng. 2019.07.013. References [1] Vijay M, Punithavathani DS. Implementation of memory-efficient linear pipelined IPV6 lookup and its significance in smart cities. Comput Electr Eng 2018;67:1–14. [2] Hill MD, Marty MR. Amdahl’S law in the multicore era. Computer 2008;41(7). [3] Zhang X, Parhi KK. High-speed vlsi architectures for the AES algorithm. IEEE Trans Very Large Scale Integrat (VLSI) Syst 2004;12(9):957–67. [4] Chao L-F, LaPaugh AS, Sha E-M. Rotation scheduling: a loop pipelining algorithm. IEEE Trans Comput Aided Des Integr Circuits Syst 1997;16(3):229–39. [5] Ertl MA, Krall A. Instruction scheduling for complex pipelines. In: Proceedings of the international conference on compiler construction. Springer; 1992. p. 207–18. [6] McKinney BC, El Guibaly F. A multiple-access pipeline architecture for digital signal processing. IEEE Trans Comput 1988;37(3):283–90. [7] Najvirt R, Naqvi SR, Steininger A. Classifying virtual channel access control schemes for asynchronous NOCS. In: Proceedings of the IEEE 19th international symposium on asynchronous circuits and systems (ASYNC). IEEE; 2013. p. 115–23. [8] Naqvi SR, Najvirt R, Steininger A. A multi-credit flow control scheme for asynchronous NOCS. In: IEEE 16th international symposium on design and diagnostics of electronic circuits & systems (DDECS). IEEE; 2013. p. 153–8. [9] Kogge PM. The architecture of pipelined computers. CRC Press; 1981. [10] Nguyen H, Khan S, Kim C-H, Kim J-M. A pipelined FFT processor using an optimal hybrid rotation scheme for complex multiplication: design, FPGA implementation and analysis. Electronics 2018;7(8):137. [11] Zargham MR. Computer architecture: single and parallel systems. Prentice-Hall, Inc.; 1996. [12] Shar LE, Davidson ES. A multiminiprocessor system implemented through pipelining. Computer 1974;7(2):42–51. [13] Benoit A, Robert Y. Complexity results for throughput and latency optimization of replicated and data-parallel workflows. Algorithmica 2010;57(4):689–724.
258
S.R. Naqvi, A. Zahid and L. Sawalha et al. / Computers and Electrical Engineering 78 (2019) 242–258
[14] Subhlok J, Vondran G. Optimal latency-throughput tradeoffs for data parallel pipelines. In: Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures. ACM; 1996. p. 62–71. [15] Smirnov A, Taubin A. Heuristic based throughput analysis and optimization of asynchronous pipelines. In: Proceedings of the 15th IEEE symposium on asynchronous circuits and systems. IEEE; 2009. p. 162–72. [16] Gill G, Singh M. Automated microarchitectural exploration for achieving throughput targets in pipelined asynchronous systems. In: Proceedings of the IEEE symposium on asynchronous circuits and systems. IEEE; 2010. p. 117–27. [17] Zheng Z, Zheng Z. Towards an improved heuristic genetic algorithm for static content delivery in cloud storage. Comput Electr Eng 2018;69:422–34. [18] Anderson-Cook CM. Practical genetic algorithms. J. Am. Stat. Assoc. 20 05;10 0(471):1099. [19] Mirjalili S. Genetic algorithm. In: Evolutionary algorithms and neural networks. Springer; 2019. p. 43–55. [20] Naqvi SR, Akram T, Haider SA, Kamran M, Shahzad A, Khan W, et al. Precision modeling: application of metaheuristics on current voltage curves of superconducting films. Electronics 2018;7(8). [21] Yin B, Guo Z, Liang Z, Yue X. Improved gravitational search algorithm with crossover. Comput Electr Eng 2018;66:505–16. [22] Mc Ginley B, Maher J, O’Riordan C, Morgan F. Maintaining healthy population diversity using adaptive crossover, mutation, and selection. IEEE Trans Evolut Comput 2011;15(5):692–714. [23] Sylvester EV, Bentzen P, Bradbury IR, Clément M, Pearce J, Horne J, et al. Applications of random forest feature selection for fine-scale genetic population assignment. Evolut Appl 2018;11(2):153–65. [24] Naqvi SR, Akram T, Iqbal S, Haider SA, Kamran M, Muhammad N. A dynamically reconfigurable logic cell: from artificial neural networks to quantum– dot cellular automata. Appl Nanosci 2018;8(1–2):89–103. [25] Hansen MC, Yalcin H, Hayes JP. Unveiling the ISCAS-85 benchmarks: a case study in reverse engineering. IEEE Design Test Comput 1999;16(3):72–80. Syed Rameez Naqvi received MS in electronic engineering from The University of Sheffield, UK, in 2007, and Ph.D. in computer engineering from Vienna University of Technology, Austria, in 2013. He is currently with the Department of Electrical and Computer Engineering, COMSATS University Islamabad, Wah Campus, Pakistan. His research interests include asynchronous logic, computer architecture, hardware optimization, and artificial intelligence. Anjum Zahid recently completed his MS in Electrical Engineering at The Department of Electrical and Computer Engineering, COMSATS University Islamabad, Wah Campus, Pakistan. Lina Sawalha is with the Department of Electrical and Computer Engineering at Western Michigan University. Her research interests include computer architecture, and high performance & energy efficient computing. She received her MS and Ph.D. in Electrical & Computer Engineering from University of Oklahoma in 2009 and 2012, respectively, and her BS in Computer Engineering from Jordan University of Science and Technology in 2006. Syed Saud Naqvi received his Ph.D. degree in 2016 from the School of Engineering and Computer Science at Victoria University of Wellington, New Zealand. He is currently with the Department of Elctrical and Computer Engineering at COMSATS University Islamabad, Pakistan. His research interests include computer vision and machine learning. Tallha Akram received MS in embedded systems and control engineering from Leicester University, UK, in 2008, and Ph.D. in computer vision and pattern recognition from Chongqing University, China, in 2014. He is currently with the Electrical and Computer Engineering Department, COMSATS University Islamabad, Wah Campus, Pakistan. His research interests include computer vision, machine learning, artificial intelligence, and applied optimization. Sajjad Ali Haider is with the Department of Electrical and Computer Engineering, COMSATS University Islamabad, Wah Campus. He completed MS in Embedded Systems and Control Engineering from Leicester University, UK in 2007, and Ph.D. from Chongqing University, China in 2014. His research interests include, Modeling and Simulation, Control Systems and Machine Learning. Kumar Yelamarthi received his Ph.D. and MS from Wright State University in 2008 and 2004, respectively. He is currently a Professor of Electrical & Computer Engineering in the School of Engineering and Technology at Central Michigan University. His research interest is in the areas of Edge/Fog Computing, embedded systems, and engineering education. He has published over 120 articles in archival journals and conference proceedings. Maksim Jenihhin received his Ph.D. and MS from Tallinn University of Technology in 2008 and 2004, respectively. He is currently with the Faculty of Information Technology at Tallinn University of Technology. His research interests include reliability and security of nanoelectronics systems and Internetof-Things & edge computing. Beside having over 115 publications to his name, he has provided administrative services to numerous international journals and conferences.