Efficient hardware architecture based on generalized Hebbian algorithm for texture classification

Efficient hardware architecture based on generalized Hebbian algorithm for texture classification

Neurocomputing 74 (2011) 3248–3256 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Efficie...

2MB Sizes 0 Downloads 15 Views

Neurocomputing 74 (2011) 3248–3256

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Efficient hardware architecture based on generalized Hebbian algorithm for texture classification Shiow-Jyu Lin a,b, Yi-Tsan Hung a, Wen-Jyi Hwang a, a b

Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 116, Taiwan Department of Electronic Engineering, National Ilan University, Yilan 260, Taiwan

a r t i c l e i n f o

a b s t r a c t

Article history: Received 16 October 2010 Received in revised form 13 February 2011 Accepted 3 May 2011 Communicated by R. Capobianco Guido Available online 12 June 2011

The objective of this paper is to present an efficient hardware architecture for generalized Hebbian algorithm (GHA). In the architecture, the principal component computation and weight vector updating of the GHA are operated in parallel, so that the throughput of the circuit can be significantly enhanced. In addition, the weight vector updating process is separated into a number of stages for lowering area costs and increasing computational speed. To show the effectiveness of the circuit, a texture classification system based on the proposed architecture is designed. It is embedded in a system-on-programmable-chip (SOPC) platform for physical performance measurement. Experimental results show that the proposed architecture is an efficient design for attaining both high speed performance and low area costs. & 2011 Elsevier B.V. All rights reserved.

Keywords: Generalized Hebbian algorithm Principal component analysis System-on-programmable-chip

1. Introduction The goal of principal component analysis (PCA) [1] is to obtain a compact and accurate representation of the data that reduces or eliminates statistically redundant components. The PCA transform is also known as Karhunen–Loeve transform (KLT) [2] for pattern recognition, classification, or data compression. Although the PCA-based algorithms are widely used [3], many algorithms are not suited for real time applications because of high computational complexity for high dimension feature vectors. To accelerate the computational speed of PCA, a number of algorithms [4–7] have been proposed. However, because most of these algorithms are implemented by software, only moderate acceleration can be achieved. Although hardware implementation of PCA and its variants are possible, large hardware resource consumption and complicated circuit control management are usually required. The PCA hardware implementation may therefore be possible only for small dimension [8,9]. An alternative for the PCA implementation is based on the PCA neural network, which can be implemented by the generalized Hebbian algorithm (GHA) [10,11]. Nevertheless, slow convergency of the GHA is usually observed. A large number of iterations therefore is required, resulting in long computational time for many GHA-based algorithms. Although analog GHA hardware architectures have been proposed [12,13] to

 Corresponding author. Tel.: þ 886 2 7734 6670; fax: þ886 2 2932 2378.

E-mail address: [email protected] (W.-J. Hwang). 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.05.010

reduce the computational time, these architectures are difficult to be directly used for digital devices. In this paper, a digital GHA hardware architecture for fast PCA is presented. Although large amount of arithmetic computations are required for GHA, the proposed architecture is able to achieve fast training with low area cost. The proposed architectures can be divided into three parts: the synaptic weight updating (SWU) unit, the principal components computing (PCC) unit, and memory unit. The memory unit is the on-chip memory storing synaptic weight vectors. Based on the synaptic weight vectors stored in the memory unit, the SWU and PCC units are then used to compute the principal components and update the synaptic weight vectors, respectively. In the SWU unit, the long datapath for the updating of synaptic weight vectors is separated into a number of stages. The results of precedent stages will be used for the computation of subsequent stages for expediting training speed and lowering the area cost. Moreover, in the PCC unit, the input vectors are allowed to be separated into smaller segments for the delivery over data bus with limited width. Both the SWU and PCC units can also operate concurrently to further enhance the throughput. To demonstrate the effectiveness of the proposed architecture, a texture classification system on a system-on-programmablechip (SOPC) platform is constructed. The system consists of the proposed architecture, a softcore NIOS II processor [14], a DMA controller, and a SDRAM. The proposed architecture is adopted for finding the PCA transform by the GHA training, where the training vectors are stored in the SDRAM. The DMA controller is used for the DMA delivery of the training vectors. The softcore processor is

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

3249

Fig. 1. The neural model for the GHA.

only used for coordinating the SOPC system. It does not participate the GHA training process. As compared with its software counterpart running on Pentium IV CPU, our system has significantly lower computational time for large training set. All these facts demonstrate the effectiveness of the proposed architecture.

2. Preliminaries Fig. 1 shows the neural model for GHA, where xðnÞ ¼ ½x1 ðnÞ, . . . ,xm ðnÞT , and yðnÞ ¼ ½y1 ðnÞ, . . . ,yl ðnÞT are the input and output vectors to the GHA model, respectively. The output vector yðnÞ is related to the input vector xðnÞ by yj ðnÞ ¼

m X

wji ðnÞxi ðnÞ,

ð1Þ

i¼1

where the wji(n) stands for the weight from the ith synapse to the jth neuron at time n. Each synaptic weight vector wj ðnÞ is adapted by the Hebbian learning rule: " # j X wki ðnÞyk ðnÞ , ð2Þ wji ðn þ 1Þ ¼ wji ðnÞ þ Z yj ðnÞxi ðnÞyj ðnÞ

Fig. 2. The proposed GHA architecture.

3.1. Definition of area complexities and latency of the GHA architecture Two types of performance are considered in this paper: the area complexities and latency. Because adders, multipliers and registers are the basic building blocks for the GHA architecture, the area complexities are separated into three categories: the number of adders, the number of multipliers and the number of registers. Given the current synaptic weight vectors wj ðnÞ, j ¼ 1, . . . ,l, the latency of the proposed GHA architecture is defined as the time required to produce the new synaptic weight vectors wj ðn þ 1Þ, j ¼ 1, . . . ,l. The latency of PCC unit is defined as the time required for computing yðnÞ given the current synaptic weight vectors wj ðnÞ, j ¼ 1, . . . ,l. Finally, the latency of SWU unit is the time required for computing wj ðn þ 1Þ, j ¼ 1, . . . ,l, given yðnÞ, xðnÞ and wj ðnÞ, j ¼ 1, . . . ,l. 3.2. SWU unit

k¼1

where Z denotes the learning rate. After a large number of iterative computation and adaptation, wj ðnÞ will asymptotically approach to the eigenvector associated with the jth principal component lj of the input vector, where l1 4 l2 4    4 ll . To reduce the complexity of computing implementation, Eq. (2) can be rewritten as " # j X wji ðn þ 1Þ ¼ wji ðnÞ þ Zyj ðnÞ xi ðnÞ wki ðnÞyk ðnÞ : ð3Þ k¼1

The design of SWU unit is based on Eq. (3). Although the direct implementation of Eq. (3) is possible, it will consume large hardware resources. To further elaborate this fact, we first see from Eq. (3) that the computation of wji ðn þ 1Þ and wri ðn þ 1Þ P shares the same term rk ¼ 1 wki ðnÞyk ðnÞ when r r j. Consequently, independent implementation of wji ðn þ 1Þ and wri ðn þ 1Þ by hardware using Eq. (3) will result in large hardware resource overhead. To reduce the resource consumption, we first define a vector zji(n) as

A more detailed discussion of GHA can be found in [10,11]. zji ðnÞ ¼ xi ðnÞ

j X

wki ðnÞyk ðnÞ,

j ¼ 1, . . . ,l,

ð4Þ

k¼1

3. The proposed GHA architecture As shown in Fig. 2, the proposed GHA architecture consists of three functional units: the memory unit, the synaptic weight updating (SWU) unit, and the principal components computing (PCC) unit. The memory unit is used for storing the current synaptic weight vectors. Assume the current synaptic weight vectors wj ðnÞ, j ¼ 1, . . . ,l, are now stored in the memory unit. In addition, the input vector xðnÞ is available. Based on xðnÞ and wj ðnÞ, j ¼ 1, . . . ,l, the goal of PCC unit is to compute output vector yðnÞ. Using xðnÞ, yðnÞ and wj ðnÞ, j ¼ 1, . . . ,l, the SWU unit produces the new synaptic weight vectors wj ðn þ 1Þ, j ¼ 1, . . . ,l. It can be observed from Fig. 2 that the new synaptic weight vectors will be stored back to the memory unit for subsequent training.

and zj ðnÞ ¼ ½zj1 ðnÞ, . . . ,zjm ðnÞT . Integrating Eqs. (3) and (4), we obtain wji ðn þ 1Þ ¼ wji ðnÞ þ Zyj ðnÞzji ðnÞ,

ð5Þ

where zji(n) can be obtained from zðj1Þi ðnÞ by zji ðnÞ ¼ zðj1Þi ðnÞwji ðnÞyj ðnÞ,

j ¼ 2, . . . ,l:

ð6Þ

When j ¼1, from Eqs. (4) and (6), it follows that z0i ðnÞ ¼ xi ðnÞ:

ð7Þ

The architecture for implementing SWU unit, depicted in Fig. 3, is derived from Eqs. (5) and (6). It consists of l stages, where each stage j produces wj ðn þ 1Þ and zj ðnÞ. Note that the zj ðnÞ produced by stage j will be used at subsequent stages for lowering the area

3250

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

cost of those stages. There are m modules at each stage, as shown in Fig. 4. Each module ji produces wji ðn þ 1Þ and zji ðnÞ. Fig. 5 shows the architecture of each module ji, which consists of adders, multipliers and registers. The computation at each module ji can

be separated further into two sub-stages: the sub-stage for the computing zji(n) and sub-stage for computing wji ðn þ 1Þ. As shown in Fig. 6, given an index i, operations at the second sub-stage for computing wji ðn þ 1Þ in the module ji can be overlapped with those at the first sub-stage for computing zðj þ 1Þi ðn þ 1Þ in the module ðj þ 1Þi, where j ¼ 1, . . . ,l1. The parallel architecture therefore requires only l þ1 clocks for computing the new synaptic weight vectors wj ðn þ1Þ, j ¼ 1, . . . ,l. The latency of SWU unit is then l þ 1. Because all the modules in SWU unit are implemented by Eqs. (5) and (6), these modules have the same architecture, regardless of j and i. Since each stage contains m modules, and there are l stages in Fig. 4, the SWU unit consists of m  l modules. The adder, multiplier and register area complexities of SWU are therefore Oðm  lÞ, which grow linearly with m and l, respectively. 3.3. PCC unit Next we consider the design of PCC unit. A simple implementation of PCC unit is based on Eq. (1), as shown in Fig. 7. As m becomes large, it may be difficult to obtain x1 ðnÞ, . . . ,xm ðnÞ at the same time for computing yj(n) due to the width limitation on data bus. One way to solve the problem is to separate x1 ðnÞ, . . . ,xm ðnÞ into q segments, where each segment contains b elements (pixels), and

Fig. 3. The architecture of the SWU unit.

q ¼ m=b:

ð8Þ

The PCC unit then fetches one segment at a time for the computation of yj(n). The operation of the PCC unit can then be described as follows: yj ðnÞ ¼ Fig. 4. The architecture of stage j of SWU unit.

m X

wji ðnÞxi ðnÞ ¼

i¼1

Fig. 5. The architecture of the module ji.

Fig. 6. The parallel operations of SWU architecture.

q1 X b X p¼0d¼1

wj,d þ bp ðnÞxd þ bp ðnÞ:

ð9Þ

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

Based on Eq. (9), a modified PCC architecture, termed PCC-I architecture is presented [15]. As shown in Fig. 8, the architecture consists of b multipliers, one adder, and one accumulator for the computation of yj ðnÞ, j ¼ 1, . . . ,l, where each accumulator consists of one adder and one register. Therefore, the adder, multiplier and register area complexities of PCC-I are O(l), Oðb  lÞ and O(l), respectively. It will take q clock cycles to complete the computation of yj(n) in Eq. (1). The latency of PCC-I architecture is then q¼m/b clocks. It can be observed from Fig. 8 that each segment of xðnÞ will be immediately used for the computation of yðnÞ when that segment enters the PCC unit. Therefore, all the synaptic weight vectors should be available at that time. This implies that both PCC and

3251

SWU units cannot operate concurrently, as shown in Fig. 9. We can then see from Fig. 9 that the latency of GHA based on PCC-I is qþ l þ 1 clock cycles. To allow the concurrent operations of PCC and SWU units for reducing the latency, a novel PCC architecture, termed PCC-II architecture, is proposed. It removes the accumulator, and employs a q-stage shift register for the collection of segments of an input vector xðnÞ, as shown in Fig. 10. The segments enter the shift register one at a time. The segments will not be used for the computation of yðnÞ until all the segments have been received. In this way, the SWU unit is allowed to update synaptic weight vectors while PCC unit is fetching segments of xðnÞ. The latency then can be significantly reduced. Fig. 11 shows the corresponding timing diagram for PCC-II architecture. Because PCC and SWU now operate concurrently, it can be observed from Fig. 11 that the latency of GHA based on PCC-II is maxðq þ 1,l þ1Þ. The shift register in PCC-II should contain the entire xðnÞ. Therefore, it consists of m cells. Moreover, when xðnÞ and all the synaptic weight vectors wj ðnÞ, j ¼ 1, . . . ,l, are available, the computation of yðnÞ will be completed in one clock cycle. Therefore, m  l multipliers are required. The adder, multiplier and register area complexities of PCC-II are O(l), Oðm  lÞ and Oðm þlÞ, respectively. 3.4. Memory unit

Fig. 7. The basic architecture of PCC unit.

The goal of memory unit is to store the synaptic weight vectors produced by the SWU unit, and then broadcast them to PCC and SWU units for subsequent training operations. Fig. 12 depicts the architecture of the memory unit. As shown in the figure, there are l registers in the memory unit, where each register consists of m pixels. Each register contains one synaptic weight vector. The register area complexity of the memory unit is then Oðm  lÞ. 3.5. Analysis of area complexities and latency of the proposed GHA architecture

Fig. 8. The PCC-I architecture.

Table 1 summarizes the area complexity and speed performance of the proposed architecture, which have been discussed in the previous subsections. It can be observed from Table 1 that all area complexities of PCC and SWU units grow linearly with l. Moreover, all the area complexities of SWU unit grows linearly with m. The register area complexity of the memory unit is also O(ml). Therefore, the consumption of hardware resources of the proposed GHA architecture increases linearly with the dimension m and the number of principal components l. In addition, as b om, PCC-II unit has higher multiplier area complexity as compared with PCC-I unit. Nevertheless, PCC-II can operate with SWU unit in parallel. Therefore, the operation of PCC-II unit with SWU unit requires smaller latency as compared with the operation of PCC-I unit with SWU unit, as shown in Table 1.

Fig. 9. The timing diagram of PCC and SWU units based on PCC-I architecture.

3252

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

Fig. 10. The PCC-II architecture.

3.6. SOPC-based GHA training system The proposed architecture is used as a custom user logic in a SOPC system consisting of softcore NIOS CPU [14], DMA controller, and SDRAM, as depicted in Fig. 13. All training vectors are stored in the SDRAM and then transported to the proposed circuit via the DMA bus. The softcore NIOS CPU runs on a simple software to coordinate different components, including the proposed custom circuit in the SOPC. The proposed circuit operates as a hardware accelerator for GHA training. The resulting SOPC system is able to perform efficient on-chip training for GHA-based applications.

4. Experimental results This section presents some experimental results of the proposed architecture applied to texture classification. Fig. 14 shows the five textures considered in this paper. The dimension of training vectors is 4  4 (i.e., m ¼ 16). There are 32 000 training vectors in the training set. Due to the limitation of data bus, four pixels of each

training vector will be conveyed into proposed circuit at a time (i.e., b ¼ 4). The target FPGA device for all the experiments in this paper is Altera Cyclone III [16]. Table 2 shows the hardware resource consumption of the NIOSbased SOPC system with the proposed GHA architectures embedded as the custom user logic. There are two types of GHA architectures: the GHA with PCC-I architecture [15] and GHA with PCC-II architecture. Three different area costs are considered in the table: logic elements (LEs), embedded memory bits, and embedded multipliers. The LEs are used for the implementation of adders and registers in the proposed GHA architecture. The embedded memory bits are mainly used for implementing the on-chip memory of the SOPC system. Both the LEs and embedded memory bits are also used for the implementation of NIOS CPU of the SOPC system. The embedded multipliers are used for the implementation of the multipliers of the proposed GHA architecture. It can be observed from Table 2 that the area costs grow linearly with l, which is consistent with the analytical results shown in Table 1. In addition, given the same l, the GHA with PCCII architecture consumes slightly higher hardware resources as compared with the GHA with PCC-I architecture. However,

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

3253

Fig. 11. The timing diagram of PCC and SWU units based on PCC-II architecture: (a) q o l þ 1 and (b) q 4 l þ 1.

Fig. 12. The architecture of memory unit.

Table 1 Area complexity and speed performance of the proposed GHA architecture.

PCC-I PCC-II SWU Memory unit GHA arch. with PCC-I GHA arch. with PCC-II

Add/sub

Multipliers

Registers

Latency

O(l) O(l) O(ml) Oð1Þ O(ml) O(ml)

O(bl) O(ml) O(ml) Oð1Þ O(ml) O(ml)

O(l) Oðmþ lÞ O(ml) O(ml) O(ml) O(ml)

m/b m/b þ1 l þ1 m/b þ lþ 1 maxðm=b þ 1,l þ 1Þ

because both GHA architectures share the same SWU and memory units, the difference for the hardware resource consumption is not much. To demonstrate the effectiveness of the proposed architecture, the principal component based k nearest neighbor (PC-k NN) rule is adopted. Two steps are involved in the PC-k NN rule. In the first step, the GHA is applied to the input vectors to transform m dimensional data into l principal components. The synaptic weight vectors after the convergence of GHA training are adopted

to span the linear transformation matrix. In the second step, the kNN method is applied to the principal subspace for texture classification. Fig. 15 shows the CPU time of the NIOS-based SOPC system for various numbers of training epoches with l ¼4. It can be observed from the figure that the GHA with PCC-II architecture has lower CPU time as compared with the GHA with PCC-I architecture. In addition, the gap in CPU time enlarges as the number of epoches increases. The GHA with PCC-II architecture has superior speed performance because of the parallel operations of PCC and SWU units. By contrast, the GHA with PCC-I architecture requires the sequential operations of PCC and SWU units, which results in longer latency as shown in Table 1. The CPU time of the software counterpart of our hardware GHA with PCC-II implementation is depicted in Fig. 16. The software training is based on the general purpose 3.0-GHz Pentium IV CPU. It can be clearly observed from Fig. 16 that the proposed hardware architecture attains high speedup over its software counterpart. In particular, when the number of training epoches reaches 1800, the CPU time of the proposed SOPC system is 986.73 ms. By contrast, the CPU time of Pentium IV is 31 849.13 ms. The speedup of the proposed architecture over its software counterpart therefore is 32.28. Table 3 compares the area complexity and speed performance of the proposed architecture with the existing circuit [17]. It can be observed from Table 3 that the proposed architecture has lower latency because it is able to perform concurrent operations on principal component computation and synaptic weight updating. The architecture in [17] is based on sequential operations, so that the area complexity can be reduced. However, the sequential

3254

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

Fig. 13. The SOPC system for implementing GHA.

Fig. 14. Textures considered in the experiments.

operations may result in long computational time when the vector dimension m and/or the number of principal components l are large.

The classification success rates of the NIOS-based SOPC system for various numbers of principal components are shown in Fig. 17. Define T as the set of test vectors for the classification success

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

Table 2 Hardware resource consumption of the SOPC system using the proposed GHA architectures as custom user logic. l

3 4 5 6 7

GHA with PCC-I [15]

GHA with PCC-II

LEs

Memory bits

Embedded multipliers

LEs

Memory bits

Embedded multipliers

33 534 41204 49012 57004 64805

2 215 216 2420016 2624816 2829616 3034416

208 276 344 412 480

33 723 41510 49487 57457 65384

1 600 944 1805744 2010544 2215344 2420144

244 324 404 484 564

3255

Table 3 Area complexity and speed performance of various GHA architectures.

Proposed GHA arch. with PCC-II GHA arch. in [17]

Add/ sub

Multipliers Registers Latency

O(ml)

O(ml)

O(ml)

maxðm=b þ 1,l þ 1Þ

O(l)

O(l)

O(ml)

2mþ l1

Fig. 17. The classification success rates of the NIOS-based SOPC system for various numbers of principal components.

successful. Let IðxÞ ¼ 1 when RðxÞ ¼ CðxÞ, and IðxÞ ¼ 0 otherwise. The classification success rate is then defined as P IðxÞ CR ¼ x A T : ð10Þ t Fig. 15. The CPU time of the proposed hardware architectures for different number of training epoches.

That is, the classification success rate is the number of test vectors which are successfully classified divided by the total number of test vectors. We can see from Fig. 17 that the classification success rate improves as l increases. As l¼ 7, the classification success rate attains 89.82%. All these facts demonstrate the effectiveness of the proposed architecture.

5. Concluding remarks

Fig. 16. The CPU time of the proposed GHA with PCC-II and its software counterpart for different number of training epoches.

Experimental results reveal that the proposed GHA with PCC-II has superior speed performance over its hardware and software counterparts. Specifically, the speedup of the architecture over the software counterpart running at 3.0-GHz Pentium IV is 32.28. In addition, the architecture is able to attain near 90% classification success rate for texture classification when l¼7. The architecture also has low area cost for implementing the SWU unit by the employment of an efficient architecture separating weight vector updating process into l stages. The classification success rate may be further improved by extending the vector dimension m and/or increasing the number of principal components l at the expense of larger area costs. Alternatively, when only limited hardware resources are available, the proposed architecture is able to perform realtime texture classification with the best effort classification accuracy. References

rate measurement. Let t be the total number of test vectors in T . For any test vector x A T , let RðxÞ be the actual texture x belongs to. In addition, let CðxÞ be the texture assigned by the proposed classifier for x. If RðxÞ ¼ CðxÞ, then the classification of x is

[1] I.T. Jolliffe, Principal Component Analysis, second ed., Springer, Berlin, 2002. [2] J. Karhunen, J. Joutsensalo, Generalization of principal component analysis, optimization problems, and neural networks, Neural Networks 8 (4) (1995) 549–562.

3256

S.-J. Lin et al. / Neurocomputing 74 (2011) 3248–3256

[3] K.I. Kim, M.O. Franz, B. Scholkopf, Iterative kernel principal component analysis for image modeling, IEEE Trans. Pattern Anal. Mach. Intell. 27 (9) (2005) 1351–1366. [4] C.T. Lin, S.A. Chen, C.H. Huang, J.F. Chung, Cellular neural networks and PCA neural networks based rotation/scale invariant texture classification, in: Proceedings of International Joint Conference on Neural Networks, 2004, pp. 153–158. [5] S. Gunter, N.N. Schraudolph, S.V.N. Vishwanathan, Fast iterative kernel principal component analysis, J. Mach. Learn. Res. 8 (2007) 1893–1918. [6] I. Sajid, M.M. Ahmed, I. Taj, Design and implementation of a face recognition system using fast PCA, in: Proceedings of IEEE International Symposium on Computer Science and its Applications, 2008, pp. 126–130. [7] A. Sharma, K.K. Paliwal, Fast principal component analysis using fixed-point algorithm, Pattern Recognition Lett. (2007) 1151–1155. [8] W. Boonkumklao, Y. Miyanaga, K. Dejhan, Flexible PCA architecture realized on FPGA, in: Proceedings of International Symposium on Communications and Information Technologies, 2001, pp. 590–593. [9] D. Chen, J.Q. Han, An FPGA-based face recognition using combined 5/3 DWT with PCA methods, J. Commun. Comput. 6 (10) (2009) 1–8. [10] S. Haykin, Neural Networks and Learning Machines, third ed., Pearson, 2009. [11] T.D. Sanger, Optimal unsupervised learning in a single-layer linear feedforward neural network, Neural Networks 2 (1989) 459–473. [12] G. Carvajal, W. Valenzuela, M. Figueroa, Subspace-based face recognition in analog VLSI, Advances in Neural Information Processing Systems, vol. 20, MIT Press, Cambridge, MA, 2007, pp. 225–232. [13] G. Carvajal, W. Valenzuela, M. Figueroa, Image recognition in analog VLSI with on-chip learning, Lecture Notes in Computer Science (ICANN 2009), vol. 5768, 2009, pp. 428–438. [14] NIOS II Processor Reference Handbook ver 10.0, Altera Corporation. /http:// www.altera.com/literature/lit-nio2.jspS, 2010. [15] S.J. Lin, Y.T. Hung, W.J. Hwang, Fast principal component analysis based on hardware architecture of generalized Hebbian algorithm, Lecture Notes in Computer Science (ISICA 2010), vol. 6382, 2010, pp. 505–515. [16] Cyclone III Device Handbook, Altera Corporation. /http://www.altera.com/ products/devices/cyclone3/cy3-index.jspS, 2010. [17] A.R. Mohan, N. Sudha, P.K. Meher, An embedded face recognition system on a VLSI array architecture and its FPGA implementation, in: Proceedings of IEEE Conference on Industrial Electronics, 2008, pp. 2432–2437.

Shiow-Jyu Lin received her B.Eng. degree in Control Engineering and M.S. degree in the college of EECS from National Chiao Tung University, Hsinchu, Taiwan. Since 1992, she has been with the faculty of the Department of Electronics Engineering at National Ilan University, Yilan, Taiwan. She is now working towards her Ph.D. degree in the Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, Taiwan.

Yi-Tsan Hung obtained his B.Mgmt. degree in Information Management at Chung Shan Medical University, Taichung, Taiwan, in 2008. He is now working towards his M.S. degree in the Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, Taiwan. His research interests include system-on-chip, VLSI design, and FPGA fast prototyping.

Wen-Jyi Hwang received his diploma in Electronics Engineering from National Taipei Institute of Technology, Taiwan, in 1987, and M.S.E.C.E. and Ph.D. degrees from the University of Massachusetts at Amherst in 1990 and 1993, respectively. From September 1993 until January 2003, he was with the Department of Electrical Engineering, Chung Yuan Christian University, Taiwan. In February 2003, he joined the Department of Computer Science and Information Engineering, National Taiwan Normal University, where he is now a Full Professor and the chairman of the department. Dr. Hwang is the recipient of the 2000 Outstanding Research Professor Award from Chung Yuan Christian University, 2002 Outstanding Young Researcher Award from the Asia-Pacific Board of the IEEE Communication Society, and 2002 Outstanding Young Electrical Engineer Award from the Chinese Institute of the Electrical Engineering.