Fault detection and diagnosis based on C4.5 decision tree algorithm for grid connected PV system

Fault detection and diagnosis based on C4.5 decision tree algorithm for grid connected PV system

Solar Energy 173 (2018) 610–634 Contents lists available at ScienceDirect Solar Energy journal homepage: www.elsevier.com/locate/solener Fault dete...

9MB Sizes 0 Downloads 75 Views

Solar Energy 173 (2018) 610–634

Contents lists available at ScienceDirect

Solar Energy journal homepage: www.elsevier.com/locate/solener

Fault detection and diagnosis based on C4.5 decision tree algorithm for grid connected PV system Rabah Benkercha, Samir Moulahoum

T



Research Laboratory of Electrical Engineering and Automatic LREA, University of Médéa, Algeria

A B S T R A C T

In this paper, a new approach based on decision tree algorithm to detect and diagnose the faults in grid connected photovoltaic system (GCPVS) is proposed. A nonparametric model to predict the state of GCPVS by learning task is used; a data set is collected from GCPVS by the acquisition system under several weather conditions. Three numerical attributes and two targets are chosen to form the final used data, the attributes are temperature ambient, irradiation and power ratio calculated from measured and estimated power, the first target is either healthy or faulty state for detection; the second one contains four classes’ labels named free fault, string fault, short circuit fault or line-line fault for diagnosis. The Sandia model is applied to estimate the power generated from GCPVS operating in healthy state. The data set has been divided into two parts, where 66% was used for the learning and the remained for testing. Subsequently, a new data was recorded from five days in order to evaluate robustness, effectiveness and efficiency of both models. Testing result indicate that the models have a high prediction performance in the detection with high accuracy while the diagnosis model have accuracy equal to 99.80%. Moreover, the models have been evaluated in five days; the added data guarantees the prediction efficiency resulting in high accuracy for the detection and the diagnosis, whereas the classification is correct for 99%.

1. Introduction In recent years, photovoltaic (PV) systems technologies have developed rapidly and have a significant impact on electrical co-generation systems, where it is considered one of the most important energy sources that belong to the green energy. Currently, several works are done on the cells (or modules) and PV systems to improve the electrical performance of photovoltaic panels and reduce energy losses in PV installations (Peled and Appelbaum, 2017). The most relevant among PV systems is the grid connected system (GCPVS), this system is composed by three main parts: The Photovoltaic array, the inverter and the grid. Several contributions are looking for a technique to improve the GCPVS performances by the enhancement of maximum power point tracking (Benkercha et al., 2017; Blaifi et al., 2018). Hence, the GCPVS can be subjected, during its operation, to different faults and anomalies leading to decrease performances or total malfunction of the system. In most cases, the fault occurs and becomes a risk and will obviously reduce the productivity of the installation, in addition to the cost of maintenance to restore the system with normal conditions (Villarini et al., 2017). To reduce the maintenance cost and increase the system availability and productivity at optimal performances, we proceed to early fault detection and diagnosis. It is possible to classify the faults by looking for different aspects, in particular: the component which presents the



defect, the causes, the effects and the system response regarding this defect. In order to perform fault classification, it is interesting to divide the GCPVS into three sides: AC side, DC side and the inverter. Although, the DC side is the one with the lowest amount of failures, it is the most complicated to analyze and detect the origin of the defect due to the number of components included and the nonlinear PV module characteristics. Distinguish between the variety of faults in the DC side can be difficult. For this reason, several approaches are proposed in the literature to find and to distinguish the faults occurring at this side. These approaches can be arranged into two categories, the first one is based on the statistic outdoor I-V measurement while the second one uses the current and the voltage at maximum power point. The I-V characteristic of the PV array is approximated by a nonlinear function, under outdoor condition. Several functions are used in the literature to simulate the I-V curve as single or double diode model (Benkercha et al., 2016), the single diode model (SDM) has five undefined parameters. An optimization process is applied to extract the optimal parameters values leading to a minimum error between the output model and the measured I-V, this model can help in fault diagnosis. However, investigations and field inspections by the study of PV systems based on I-V or P-V characteristics are elaborated in (Bonsignore et al., 2014; Ding et al., 2012), hybrid Neuro-Fuzzy method is used to estimate the six involved PV parameters, and thus the system status is determined by comparing and evaluating the norms

Corresponding author. E-mail address: [email protected] (S. Moulahoum).

https://doi.org/10.1016/j.solener.2018.07.089 Received 30 April 2018; Received in revised form 10 July 2018; Accepted 28 July 2018 0038-092X/ © 2018 Elsevier Ltd. All rights reserved.

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

αmp βmp n K Ns Np DT C D Info Xi Gain G_ratio Sj a REP N E f e z c

Nomenclature PV OPD MPP MPPT GCPVS FPA Imp Imp_STC Vmp Vmp_STC Pmp STC G E0 Ee T Tc TSTC TNOCT δ(Tc)

photovoltaic overcurrent protection devices maximum power point maximum power point tracker grid connected PV system flower pollination algorithm current at MPP current at STC voltage at MPP voltage at STC power at MPP standard test condition irradiation nominal irradiation effective irradiation ambient temperature cell temperature temperature at STC nominal operating cell temperature thermal voltage

temperature coefficient of Imp temperature coefficient of Vmp ideality factor constant of Boltzmann number of PV modules in series number of PV modules in parallel decision tree the set classes the set instances the entropy or the information an attribute the information gain the gain ratio a splitting point a value of an attribute reduced-error pruning number of instances reach a node or a leaf number of instances misclassified the observed error the predicted error rate the threshold confidence the confidence factor

impact on these indicators, the status of the system is determined. Moreover, a detailed feature procedure is offered in (Silvestre et al., 2013), the comparison process between both simulated and measured yields is proposed, while analysis of the deviation errors in the current and the voltage allows to define the system status either health or fault and specifies the fault type. A monitoring system is implemented in LABview (Chouder et al., 2013), this system is performed for a sophisticated modeling and supervision of the GCPVS. Eventually, the goal of this new monitoring system is for helping and integrating the diagnosis approach. In addition, (Silvestre et al., 2014) describes an enhanced procedure based on the revised inductors such the current and the voltage with determining and defining thresholds values for each indicators respectively. Furthermore, these thresholds have high correlation with system structure; the fault is defined according to the calculated and the compared threshold. A fault detection and localization strategy in automatic manner for PV system plant; the deviation between the observed and the whole data of the string current is the main idea for this approach. The deviation gap is evaluated by a modified local outlier factor algorithm; this algorithm is proposed for the fault degree evaluation, which leads to the fault determination and localization. A decision model is used to deduct the fault label, this model has been built by the decision tree learning algorithm (Zhao et al., 2012), the learning set is obtained by a recording system incorporated with the PV array system, the data for system operating is saved within or without fault for two clear days, the shown test result in this paper indicates that the model have a small size and possess a high classification accuracy. Further, the tree form can be converted to the decision rules model within based on if/ else instruction. In Zhao et al. (2013a, 2013b), the purpose is to construct an outlier rules. Consequently, three main aspects are used, where 3-Sigma, Hampel identifier, and Boxplot rule are offering for fault detection without using the meteorological data, (Zhao et al., 2013a, 2013b) focus the analysis on the line-line fault at various situations, such to protect the PV array system from the damage and to safeguard operators from the risk, while the fault source is the short-circuit or double ground faults. The paper also included the effect of the overcurrent protection devices (OPDs) during the fault, an examination method is realized for looking to the challenge of OPDs under two different times, the first is arisen at low level irradiance and the second is happening at night and continues grow to the transit from night to day.

parameters with threshold values for normal and fault case. In (Chine et al., 2016) the artificial neural network is trained to isolate and identify the fault type by calculated attributes from the simulated and the measured I-V curve of the PV string, then the powers obtained from the simulation and the measurement are compared to defined threshold leading to fault detection, whereas the ANN model outputs are adopted for the fault isolation and implemented in FPGA environment. Besides, (Chen et al., 2017) passed by three steps to reach the diagnosis goal, first step is to find the key points and parameters values, the next step is to establish emerging kernel model using extreme learning manner with Nelder-Mead Simplex algorithm to search the optimal parameters of the single diode model, in the final step, a simulation model is used to record data of faults, where partial shading, open circuit, degradation and short circuit are treated. Das et al. (2018) studied the popular degradation faults such as open and short circuit with the aim of identifying, locating and distinguishing between these faults under non-uniform distribution of both irradiance and temperature. The improved coded genetic algorithm as a metaheuristic optimization algorithm is applied to predict most of the fault pattern, and to estimate the power matching the measured power. Actually, the fault diagnosis methods based on the I-V curve contain several difficulties, where an optimization process is performed to find I-V curve parameters for healthy system, and become harder for a faulty system. Hence, the pattern techniques for distinguishing between faulty statuses very complicated. Thus, a limited number of faults can be studied by I-V characteristics, because the data got from faulty is insufficient. Due to these complexities other contributions used the MPP of the I-V curve in the dynamic method are useful. Some methods are proposed in the literature with the aim to classify the system state based on the dynamic variation of the PV system outputs such as the current, the voltage or the power (Solórzano and egido, 2013). Although, a simulation process of the PV array parametric model is performed for the healthy state under different weather conditions (Roumpakias and Stamatelos, 2017). This model is used to assess the gap variations between measured and simulated variables. Furthermore, many contributions are using the monitoring system at PV installations to ease faults diagnosis (Triki-Lahiani et al., 2017). (Chouder and Silvestre, 2010) developed an approach to enable fault detection in the GCPVS by the analysis of the power losses, four main indicators are adopted, which are the ratio of intensity, voltage and captured losses in thermal or miscellaneous. Depending on the fault 611

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

behaviors and can be adequate with any PV technology, in addition to its suitability for the large number of associated PV arrays (De Soto et al., 2006). Moreover, this model can predict the PV module performances during its operation due to its simplicity. Where, Imp, Vmp and Pmp are expressed in Eqs. (1)–(5).

Among the undesirable features of the aforementioned contributions is the bad reliability, high cost, or the complexity for build, utilization and implementation of the model at real condition. Other factors must take in consideration such as prediction time, robustness and model generalization. In the presented work, a new diagnosis model is constructed by C4.5 algorithm (Quinlan, 1996), as one of the most popular machine learning algorithm used for the classification problem. For the learning process, the data set is necessary to build the decision tree. Therefore, an acquisition system is realized to be able to record and stock the data such as the climatic variation (temperature and irradiation) and the electrical variables specifically: the current, the voltage and the power at the MPP, then three attributes are selected which including the temperature ambient and the irradiation, while the third attribute is calculated from estimated power by the Sandia model and the measured power of GCPVS production, this last attribute is called the power ratio. The Sandia model is expressed by an empirical relationship applied to estimate the generated power from the system in healthy state at MPP with standard test condition (STC) data (Kratochvil et al., 2004), this model possess unknown parameters, the identification process is done by Flower Pollination Algorithm (FPA) to find the optimal parameters values corresponding to the minimum root mean square error between both estimated Sandia output and the measured power. Regarding, each fault provides energy losses with decrease of the production performances, therefore, the power ratio have a high correlation with the system state, in each instance data a nominal attribute called target is defined as class label in the purpose predict accurately these faults. The building phase is done by two main paths. Firstly, a splitting criteria is applied to choose the best split attribute, the tree progressively grows as long as this procedure recursively with an iterative manner in order to classify all instances or one of stopping criteria is verified. Secondly, after got the tree model a pruning process is done by removing the unnecessary sub-trees to avoid the overfitting phenomenon which can lead to decrease the model complexity by reduce tree size. The decision tree algorithm has several advantages factors as: the algorithm can be operated without hypothesis on the system; also it can treat the numerical attributes with large data size. The obtained model is easy for the implementation and can be readable by inexpert operators. The generalization, the effectiveness, the robustness and the data bruit resisting are among to the prediction model features which leading to choose this algorithm. This paper is formed by five sections, the currently section introduces a background of some contributions methods. The second section is devoted to explain the PV array modeling by Sandia model. In the third section, the GCPVS with studied faults is described. The fourth section detailed the different aspects of the proposed algorithm, the experimental results are shown in the fifth section and conclusion is outlined in the last section.

Imp = Imp _STC (C0 Ee + C1 Ee2)(1 + αmp (Tc−TSTC ))

(1)

Vmp = Vmp _STC + C2 Ns ·δ (Tc )·ln(Ee ) + C3 Ns (δ (Tc ) ln(Ee ))2 + βmp (Tc−TSTC ) (2)

Ee =

G E0

(3)

δ (Tc ) = n·k ·(Tc + 273.15)/ q

(4)

Pmp = Imp· Vmp

(5)

where Imp_STC, Vmp_STC and TSTC are maximum current, maximum voltage and temperature at the STC respectively. K is the Boltzmann constant and q is the electron charge. Tc is cell temperature and δ(Tc) is thermal voltage. Ee, E0 and G are the effective, the nominal and the measured irradiation, αmp is the Imp temperature coefficient and βmp is the Vmp temperature coefficient. C0 to C3 and n are empirical parameters that will be identified. Indeed, the recorded temperature in our acquisition system is the ambient temperature, and as previously mentioned the cell temperature is necessary to enable predict the power using Sandia model, whereas the cell temperature must be estimated. Thus, several approaches are proposed in the literature in purpose to estimate the cell temperature, the most popular one is Ross thermal model, where this model used temperature ambient and the solar irradiation (Olukan, T. A., & Emziane, M). Ross model is expressed in Eq. (6).

Tc = T + (TNOCT −20∘C) G

(6)

where T is the ambient temperature, TNOCT is the Nominal Operating Cell temperature is equal to 44 °C which given by solar module manufacturer. 2.1. Parameters extraction To improve the accuracy of the aforementioned Sandia model, parameters extraction has been done using FPA (Yang, 2012; Benkercha et al., 2016). Based on a real measurement of the current Idc and the voltage Vdc which are used to calculate the power Pdc generated by the PV array for one day, seven parameters were included in this identification. Basically, the process consist of minimization of the cost function expressed in Eqs. (7) and (8) which represents the root mean square error between the measured and the simulated power curve; where, the extracted parameters are listed in the Table 1.

RMSE = 2. PV array modeling

1 k

k

∑ f (Pdc, Pmp (θ))

(7)

i=1

f (Pdc , Pmp (θ)) = Pdc−Ns ∗Np ∗Ppm (θ)

Modeling of PV array can be introduced by several approaches either a parametrical based model such as: single and double diode or non-parametrical such as neural network and neural-fuzzy based models (Rawat et al., 2016). Because of its good dynamic performances in terms of accuracy, the empirical model developed by Sandia National Laboratories is considered for this study (Kratochvil et al., 2004). This model takes into account all PV electrical, thermal and optical

(8)

where Ns and Np are the numbers of modules in series and in parallel within the PV array system respectively, Pdc is maximum power generated from the PV array system and it is calculated by the product between Idc and Vdc which are measured current and voltage in the DC side at the MPP, k is total number of measured samples, Pmp (θ) is the predicted power by Sandia model that have a vector of unknowns

Table 1 The extracted parameters. Parameters

α_mp

β_mp

C0

C1

C2

C3

n

Values

−0.0001

−0.0600

0.9211

0.0096

−0.0009

−18.502

1.8624

612

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 1. The curve of error variation during the extraction parameters process.

Fig. 2. Scatter plot of the measured VS predicted power.

parameters θ, thus θ = [α_mp β_mp C0 C1 C2 C3 n]. The error variation during the running of the optimization process is shown in Fig. 1. Since, the curve indicates that the minimum value of RMSE which is 8.82 Watt, where the correlation between the two powers is plotted in the Fig. 2, while the correlation coefficient is equal to 0.9911. As well as, Fig. 3 shows the matching between the measured and the simulated power. In addition, the obtained Sandia model is tested with another profile day which has been recorded in different season as represented in the Fig. 4. Furthermore, the value of RMSE is equal to 10.44 W and the correlation coefficient is 0.9727 obtained for this day. The Fig. 4 shows the similarity between both measured and predicted powers, which also significant that the Sandia model has a good reliability.

and electrical grid. The system, studied in this work, is composed of 10 PV modules, with their characteristics listed in Table 2 configured into two parallel strings; each string contains 5 modules in series, hence the total system power is 500 W, which means that each module provides 50 W at nominal conditions. Furthermore, the inverter used is called Ginlong Solis-mini 700 with their manufacture information given in Table 3, where the converter contains a DC/DC boost converter for the MPPT and a DC/AC converter that allow the PV system to connect and to inject the generated power into the Algeria public electrical network. Thus, the power is injected into a single phase of the low voltage which characterized of 230 V–50 Hz (Hadj Arab et al., 2005). This system belongs to Research Laboratory in Electrical Engineering and Automatic LREA of Médéa University in Algeria. The data acquisition system (DAQs) is carried out to collect and require the measured data. Therefore, an external card for measuring electrical parameters and signal conditioning of the four sensors has been realized, while these sensors are used to measure both meteorological and electrical parameters during the system operating, as well as the chosen sensors with their calibration factors are offered in the Table 4. In addition, the NI_USB 6008 (DAQ) multiplexer card that contains eight single-ended channels and four differential channels is used for data acquisition. Subsequently, the recording process is performed by LabVIEW software.

3. PV system description and fault definition In this section, a closely look to the PV and monitoring systems as well as the faults created in the PV system array with their data. 3.1. PV system description Generally, the grid connected PV systems (GCPVS) are composed of four main parts, namely: PV array, DC/AC inverter, Monitoring system 613

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 3. A profile of clear sky day (a) temperature ambient; (b) Estimated cell temperature; (c) irradiation; (d) Predicted vs measured power of the GCPVS (20 July 2017).

3.2. Faults definition

The electrical variables as the current and the voltage are measured before the inverter in the DC side, consequently these sensors are placed between the PV system and the inverter to acquire the electrical variables at the MPP. As well as, the monitoring system acquired and recorded the PV system inputs/outputs variation each 1 s. Fig. 5 presents an overview of both GCPVS and the monitoring system.

Occasionally, the GCPVS during it operating can be susceptible to faults; these faults can be arranged into two types namely: major losses and minor losses (Madeti and Singh, 2017). Oftentimes, the faults which cause major losses can be leads to a strongly damage on the system and can also expose the people to risks among them: string fault, short circuit fault, line-line fault. Moreover, the minor losses are 614

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 3. (continued)

3.3. Learning set

through from the accidental faults such as: partial shading and MPPT error. In our work, three types of faults are considered namely: string fault, short circuit fault and line-line fault.

In this section, the data set used for the learning and the evaluation are described.

i. String fault 3.3.1. Electrical attributes To elaborate the proposed approach, a dataset has been gathered from GCPVS during their operation by a monitoring system. The process consists of recording a dataset which contain both meteorological conditions (ambient temperature and irradiation) and electrical parameters (Idc, Vdc and Pdc). Knowing that the photovoltaic system inputs have a random variation and they are independent, a good data set is that covers the maximum of possible scenarios along the system operation. Hence, a data base which contains 471,518 samples of several days (cloudy and clear sky days) has been recorded that sweeps a large variation of temperature and irradiation. Furthermore, a prepossessing task is done to clean the erroneous measurement data, therefore 464,092 samples have been obtained after this process. The data set that contains 464,092 samples is divided into two subsets, the first subset is collected at the healthy operating state while the second is got in faulting operate state, as well as the data of faults contains the sets of three faults. In addition, these sets are bearing 126,639 samples of string fault, 93,886 samples of short circuit fault and 113,983 samples of line-line fault. Fig. 7 represent the histogram of the data set. Furthermore, the measured ambient temperature varies from 28 °C to 38 °C; whereas, the irradiation range is [200 W/m2, 1000 W/m2]. This wealth in scenarios enables us to obtain a generalized decision model in terms of matching prediction with the real state. To avoid the insignificant samples that due from error of measurement, a preprocessing task is introduced before the recording process. Fig. 8 shows the evolution of the Pdc as function the Vdc for normal operation and in the presence of faults at several weather conditions. It is visible that the dataset present an overlap zones which increases the complexity to classify the different states. Consequently, other attributes are needed in which the proposed approach can distinguish and predict correctly the system states. Therefore, three attributes have been used namely: temperature, irradiation and the power ratio.

The string fault is one of most common faults leading to big power losses in the GCPVS (Chine et al., 2014). Frequently, this fault can appear following a damage of protection elements such as anti-return diode or a fuse, a disconnection between two successive series of modules or a wire cut in the string. ii. Short circuit fault This fault takes place when a short circuit occurs between the conductors of the modules or a connection between the positive pole and the negative pole in same module (Chao et al., 2008). For significant time, this fault can cause damage or performances degradation. iii. Line-Line fault The Line-line fault is involved by the connection between two modules wires at different points with different potentials or with a module wire and the ground, so the line-line fault can be seen as short circuit fault with low resistance between PV panels (Zhao et al., 2013a, 2013b). Moreover, the second case can be led inverter to bad damage. As presented in Fig. 6, three faults are created in the experimental bench of GCPVS namely: string fault (S_f), short circuit fault (SC_f) and line-line fault with resistor of R = 7O. The faults are artificially created in the PV system array, where the string fault is simply realized by disconnect a string from the array, therefore five PV modules in series have unplug, this fault is kept in the system meanwhile six days, and these days are 25, 26, 27 and 28 July, and 10, 17 August 2017. Moreover, the short circuit fault is carried out by short circuit of one PV module in pre-selected string; this state continues in the system for four days, therefore the chosen days are 30, 31 July, and 18, 19 August 2017. Eventually, the line-line fault achieved by removing of one PV module and replace it by a resistor of 7O, and the system remains with this fault throughout five days, and these days are 11 and 12 July, and 22, 23 and 24 August 2017. For the normal operation, six days have selected which are 19 and 20 of July and also 2, 3, 26 and 27 August 2017.

3.3.2. Selected attributes In the proposed approach, two other attributes are used such as the temperature and the irradiation, because of these two attributes have a deep relation with the system performances. In other words, the temperature impacts the PV system voltage and thermal losses, as well as the irradiation have tying directly to the generated current intensity 615

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 4. A profile of cloudy sky day (a) temperature ambient; (b) Estimated cell temperature; (c) irradiation; (d) Predicted vs measured power of the GCPVS (e) Scatter plot of the correlation between the measured and predicted powers (4 October 2017).

latter which have already depending to the irradiation as expressed in Eq. (6). Certainly, the meteorological attributes are insufficient to enable classify the faults. For this purpose, another attribute must be added to the learning set, therefore an attribute called power ratio is introduced. Furthermore, the power ratio have a high correlation with

(Kratochvil et al., 2004). In addition, to build a good classifier model is preferably to choose independent attributes to give more information about the system, and also that have a high correlation with the target. Therefore, the reason for select the ambient temperature as attribute instead to the cell temperature is due to the information given by this 616

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 4. (continued)

GCPVS at normal operation. Fig. 9 represents a 3D plot of these attributes. From Fig. 9, each instance of the learning set is composed by the three attributes and the class label, where the class is either healthy or faulty state for fault detection, otherwise is the free fault, string fault, short circuit fault or line-line fault, for fault diagnosis. The purpose is to build two models able to predict with high precision the system state and determination of the fault type. Therefore based on the data set, a nonparametric model is constructed using C4.5 decision tree algorithm.

Table 2 Data PV module manufacturing. Parameters

Voc (V)

Isc (A)

Vmpp (V)

Impp (A)

Pmpp (W)

value

21.2

3.11

16.44

3.04

50

Table 3 Inverter manufacture characteristic. Parameters

value

Max input Power Max input Voltage Max input Current Start-up input voltage MPPT voltage range MPPT Efficiency Safety/EMC standard

0.9 kW 450 V 10 A 60 V 50–400 V 99.9 EN61000-6-1:2007; EN61000-6-3:2007 IEC62109-1/2; AS3100

4. The proposed algorithm Among the popular methods used in the literature to detect and diagnose the faults in the GCPVS are those based on a prediction model at the normal operation and compared with the measured values, and when a significant gap appear an alarm signal is triggered (Chine et al., 2014). Nevertheless, these methods showed limits to identifying the type of fault accurately. In the present work, a new intelligent approach is proposed to detect and to classify the faults with high prediction accuracy. The decision tree induction is one of most algorithms used to solve the classification problem which is considered to be one of the most sophisticated algorithms in this domain. A decision tree is getting by divide and conquer methods which

system performances, as well as the faults affect on the power by losing some production energy, also it is taken into account to increase the prediction accuracy. This attribute is calculated from the power estimated by Sandia model and the measured one, while the power predicted by Sandia model is considered as the power produced by the 617

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Table 4 The sensors description with their calibration value. Measured parameter Electrical Meteo-rological

DC Current DC Voltage Ambient Temperature Global inclined Irradiance

Symbol

Unit

Sensor type

Sensor Reference

Accuracy

Calibration value

Idc Vdc T G

A V °C W/m2

Hall Effect

Allegro ACS712 LEM LV 25-P LM35 SPEKTRON 200

± 0.9% 1.5% 0.5 °C ± 5%

1.453 9.8 4.95024 139.86

Semiconductor temperature PV reference cell

Fig. 5. synoptic scheme of GCPVS with data acquisition system.

Fig. 6. Cases faults with causes in the GCPVS.

performing on the learning set, as well as the obtained model is a representation of a decision procedure to determine the class of a given target. Generally, the model form possesses several nodes connected with each other by branches and down to terminal nodes within a pyramid shape. Therefore, a pathway from the high node to terminal node corresponds to a series of attributes (or questions) with their values (responses) (Han et al., 2011). The structure of the decision tree can be represented by equivalent rules which can be translated by “if… else”, thence this justifies a prediction making by following one of these paths. In addition, the decision rules are very similar to that naturally manipulated by humans. Fig. 7. Histogram of the data set. 618

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 8. Characteristic P-V at MPP under various weather conditions.

Fig. 9. 3D Plot of the proposed inputs DT model.

Fig. 10. The representation of the decision Tree.

619

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 12. Root node of the DT model example.

Fig. 11. The splitting representation of the set D by the attribute Xi. Table 5 Learning set for C4.5 algorithm example. Vdc (V)

Idc (A)

System state

80.65 78.92 79 66.1 79.9 60.3

5.15 6.3 5.08 3.3 1.3 2.42

Healthy state Healthy state Healthy state Faulty state Faulty state Faulty state

Fig. 13. Root node with the left leaf.

Fig. 14. The final tree form.

Table 6 The frequency table of the class D.

4.1. Decision tree induction

System state (D) Healthy state 3

Faulty state 3

The leaning task consists of building a generalized tree shape as non-parametric model in the purpose to predict the class label which provides a nominal output (nominal class) based on the data set, that’s enable to correctly classify the unseen data in most cases. Basically, the key idea of this algorithm is to find the best attribute with representative information on the data. Although, the data set is split into subsets by a well-defined criterion, this process is applied recursively to grow the tree subsequently; thus, the data becomes gradually smaller in each iteration up to that the stopping criteria is verified. Structurally, the Decision tree contains three types of nodes namely: the roots node, the internal nodes and the terminal nodes (or leaves). The root node is situated at the highest tree level; on other meaning, the first splitting attribute will be considered as root node according to a specified choosing. Moreover, the conjoined nodes are interconnected with each other by branch. Fig. 10 illustrates briefly the DT form. In addition, the learning set are divided into two categories; the first category contains the attributes which are in our case the PV array inputs (the ambient temperature and the irradiation) and the ratio between the measured power and the predicted power by Sandia model. Whereas, the second is the target according to data set either healthy or faulty state for the detection and free fault, string fault or line-line fault for the diagnosis.

Total instances 6

Table 7 Calculating information for the attribute Vdc. 60.3 ≤

66.1 >

0 3 1 2 0.8091 0.1909 0.65 0.2936



78.92 >

0 3 2 1 0.5409 0.4591 0.9183 0.5



79 >

1 2 2 1 0.9183 0.0817 1 0.0817



79.9 ≤

80.65 ≤

>

2 1 3 0 0.8091 0.1909 0.65 0.2936

3 3 1 0 1 0

0 0

5.15

6.08

>

2 1 2 1 1 0 0.9183 0

>

Vdc (V)

Healthy state Faulty state Info (bits) Gain Split Info G_ratio

Table 8 Calculating information for the attribute Idc. 1.3 ≤

2.42 >

0 3 1 2 0.8091 0.1909 0.65 0.2936



3.3 >

0 3 2 1 0.5409 0.4591 0.9183 0.5

5.08



>



0 3 0 1 1 1

3 0

1 2 3 0 0.5409 0.4591 0.9183 0.5

>



>

2 1 3 0 0.8091 0.1909 0.65 0.2936

Idc (A)



>

3 3 1 0 1 0

0 0

Healthy state Faulty state Info (bits) Gain Split Info G_ratio

4.2. Splitting criteria The growing of the tree is based on splitting criteria which is applied recursively on the data set to find the accurate model that can provide the best decision to the instances classification. Hence, the selection of the attribute which can be split the data set is done according to information size given by this attribute, therefore a specific criteria is used to calculate this information size by taking into count the purity of the obtained subsets. The criteria uses the information gain introduced in an algorithm called Iterative Dichotomiser 3 (ID3), this algorithm is devolved by Ross Quinlan (Quinlan, 1986). Structurally, the top-down induction is the basic process for building the decision tree model (DT model). Thus, the tree begins to grow from the root node continuing forward down until to the leaf. At each step a test on the current node is performed to choose which attribute is placed in the splitting position. In another words, a test

Table 9 Best threshold values for each attribute. The attribute

Vdc (V)

Idc (A)

Threshold Gain ratio

60.4 0.5

3.1 1

620

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 15. A tree part from the DT model for the fault detection.

The information gain represents the difference between the information needed to identify an element of D and the information needed to identify an element of D knowing that the value of Xi attribute has been obtained, and is represented by the Eq. (12).

Gain (Xi , D) = Info (D)−Info (Xi , D)

(12)

Hence, the attribute which has a maximum information gain is chosen to be placed in the splitting. Partitioning of the learning set into subset is done according to the possible attribute values. Whereas, the branches corresponding to each value of the attribute and this process is repeated for each new node still to a stopping criteria is validated, a leaf is placed in the tree with label of the most probable class. However, the ID3 algorithm has several missing factors such as: – – – –

To complete some missing factors shown in the ID3 algorithm, (Quinlan, 1996; Quinlan, 2014) proposed C4.5 algorithm that is an enhancement of ID3 algorithm; improvements have been carried out such as: numerical attributes treatment, work with missing values and introduce a pruning process. Moreover, the gain ratio, expressed in Eqs. (13) and (14), is used instead the information gain which gives more accurate splitting.

Fig. 16. A rules part from the DT model for the fault detection.

consists to measure the impurity by calculate the entropy given in Eq. (9) to determine this attribute. k

Info (D) = − ∑ j=1

|Cj |

|Cj | ⎞ log 2 ⎛ ⎝ |D| ⎠ ⎜

|D|



(9)

where C represents the set of classes which has a k number of classes. Whereas, |Cj| is the number of instances that belong to the same class Cj. D represents the set of instances with |D| is the cardinal value of the set D. To partition the learning set D based on the values of an attribute Xi in subsets Di where i = 1, …, n, therefore this is done by calculating the extracted information by this attribute using other information which called conditional entropy, as illustrated in the Fig. 11. Besides, the conditional entropy is the amount of information needed to identify the class of a D element by knowing the values of the Xi attribute, where is expressed by the Eqs. (10) and (11) n

Info (Xi , D) = − ∑ i=1

|Cj | |D| k

Where: Info (Di ) =

∑ j=1

·Info (Di ) |Cj |

G _ratio (Xi , D) =

|Cj | ⎞ ·log 2 ⎛ | ⎝ Di | ⎠

Gain (Xi , D) Split _Info (Xi , D) n

with Split _info (Xi , D) = − ∑ i=1

|Di | |D | log 2 ⎛ i ⎞ |D| ⎝ |D| ⎠

(13)

(14)

Besides, the attribute that have a maximum gain ratio is used at the splitting point. Whereas, each numerical attribute has n values as well as n thresholds, each value converted into binary split ≥a or < a in other words, each node has two edges that corresponding to the threshold chosen. In addition, the discretization step must be done for each numeric attribute, to perform that we following the next points: neatly, all values contains in this attribute are ranked at ascending order, for each values Sj of the attribute, the learning set is divided into two categories

(10)



|Di |

Does not consider numeric attributes; Missing values in attributes are not taken into account; The pruning process not included in the algorithm; The algorithm doesn't treat the data with high dimensions.

– Those which have a value less than or equal to Sj; – Those which have a value greater than Sj.



(11) 621

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 17. A tree part from the DT model for the fault diagnosis. Table 11 Testing result of the detection model. Parameters

Value

Time taken to build model Time taken to test model on training split Size of the tree Number of Leaves Correctly Classified Instances (157582) Incorrectly Classified Instances (209) Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

12.64 s 0.25 s 171 86 99.8675% 0.1325% 0.9967 0.0022 0.0341 0.5495% 7.5994% 157,791

Table 12 Confusion matrix of diagnostic model. Classified as

a

b

c

d

a = Free fault b = String fault c = Short Circuit fault d = Line-line fault

44,117 0 125 5

0 42,853 1 7

80 0 31,790 23

18 21 23 38,728

Fig. 18. A rules part from the DT model for the fault diagnosis. Table 13 Testing result of the diagnostic model.

Table 10 Confusion matrix of detection model. Classified as

a

b

Parameters

Value

a = Healthy state b = Faulty state

44,123 117

92 113,459

Time taken to build model Time taken to test model on training split Size of the tree Number of Leaves Correctly Classified Instances (157488) Incorrectly Classified Instances (303) Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances

14.1 s 0.72 s 207 104 99.808% 0.192% 0.9974 0.0015 0.0284 0.409% 6.5739% 157,791

For each of these partitions, the calculation of the gain ratio is performed, and seeking the best partition that maximize of the gain ratio.

4.3. Stopping criteria The process of building the DT model has several stopping criteria, where are listed below: 622

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 19. Curves of (a) the ambient temperature, (b) Estimated cell temperature (c) the irradiation, (d) the current, (e) the voltage, (f) estimated and measured power and (g) the power ratio.

– The number of instances per node is less than a set threshold; – The maximum information gain obtained is less than a fixed threshold; – The accuracy of the tree no longer increases significantly such as no

– All samples belong to the same class or there is no longer an attribute to use; – The depth of the tree reaches a fixed limit; – The number of leaves reaches a maximum set; 623

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

data. Therefore, the model can be very complicated and can leads to erroneous prediction for the unseen set. Consequently, to avoid the overfitting process, to increase the accuracy and to reduce the size of the obtained tree, the pruning process is necessary. This process removes the useless branches or sub-trees. Strategically, the pruning is divided into two types namely: pre-pruning and post-pruning. The prepruning is performed during the growth of the tree; in each developed node, this process is applied to decide to continue growing or stopping by place a leaf that corresponds to the majority class. The post-pruning process involves with backward strategy; in other words, the postpruning is executed when the obtained final tree form which this tree form might contains some unnecessary sub-tree that can be replaced by leaves. The C4.5 algorithm uses the post-pruning strategy called reduced-error pruning (REP) which is expressed in the Eq. (15) (Witten et al., 2016). The estimation of error rate in the sub-tree as well as in the leaf supposed to be replaced. Where, the comparison between both errors values gives the decision to apply the pruning or not.

f+ e=

z2 2N

+ z· 1+

Fig. 19. (continued)

a

b

a = Healthy state b = Faulty state

26,263 0

3 0

Table 15 Testing result of the detection model. Parameters

Value

Time taken to test model Correctly Classified Instances (26263) Incorrectly Classified Instances (3) Total Number of Instances

0.36 s 99.9886% 0.0114% 26,266

a

b

C

d

a = Free fault b = String fault c = Short Circuit fault d = Line-line fault

26,266 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

Value

Time taken to test model Correctly Classified Instances (26266) Incorrectly Classified Instances (0) Total Number of Instances

0.13 s 100% 0% 26,266

z2 N

(15)

The description of the pruning algorithm is by the following pseudocode: As long as there is a sub-tree that can be replaced by a leaf without increasing the actual error estimated Then prune that sub-tree. In addition, an example will be introduced for more clarification about the construction of a DT model by the C4.5 algorithm. The learning set of the example is presented in the Table 5. Therefore, the model is built by following steps: Step 1: The class’s entropy is calculated using the frequency table of the class label D (Table 6), where D is the system state in this case;

Table 17 Testing result of the diagnostic model. Parameters

z2 4N2

– Calculate probability of the expected error on each leaf of actual sub-tree e (Nj , f j , α ) with j is the st leaf; – Calculate the number of predicted errors on each leaf, which is equal to the number of cases on this leaf multiplied by the error rate predicted Nj × e (Nj , f j , α ) ; – Calculate the sum of anticipated error by all leaves in the sub-tree ∑j Nj × e (Nj , f j , α ) ; – Calculate the value of prospected error in the root node of these leaves; which done by the multiplication of the samples number with the error rate predicted on current node NT × e (NT , fT , α ) ; – Compare between the values obtained from the two preceding steps: if the error given by the root node is less than the error given by the leaves then we prune the sub-tree.

Table 16 Confusion matrix of diagnostic model. Classified as

+

where N is the number total of instances reached to the node or to the leaves, f = E / N represents the error observed and E is the number of misclassified instances. Quinlan used the confidence factor to estimate the predict error rate e based on the probability function. In addition, the standard normal distribution is used to find the threshold confidence z. In the C4.5 algorithm the value of c is equal to 25% which mean z = 0.69. The following steps must be done to decide whether to prune a sub-tree or not:

Table 14 Confusion matrix of detection model. Classified as

f f2 − N N

3 3 3 3 Info (D) = Info (3, 3) = − log 2 ⎛ ⎞− log 2 ⎛ ⎞ 6 ⎝6⎠ 6 ⎝6⎠

attribute improves accuracy.

Info (D) = 1bit In most case, these criteria has lie within the pruning setting phase.

Step 2: the organization in a growing order of the values for each attribute; Step 3: Calculate the gain ratio for each attribute value separately to choose the best splitting attribute.

4.4. Pruning The model is built explicitly based on particular learning set. In other words, the obtained model is generated expressly only on this

a. For the attribute Vdc: 624

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 20. Curves of (a) the ambient temperature, (b) Estimated cell temperature (c) the irradiation, (d) the current, (e) the voltage, (f) estimated and measured power and (g) the power ratio.

(9)–(12) are used to show model building process, whose calculation is as follows: for Vdc = 60.3 V:

Table 7 gives the results of the gain ratio for each attribute value of Vdc, notice that each Vdc value is considered as threshold and the splitting criteria is applied to calculate the gain ratio for these thresholds, and then the attribute value that has a maximum gain ratio is chosen. Furthermore, as mentioned in the Section 4, the expressions

Info (Vdc , D) =

625

1 5 × Info [Vdc ⩽ 60, 3] + × Info [Vdc > 60, 3] 6 6

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

As described in the Table 7, the threshold that have the great gain ratio is Vdc = 66.1 V, then this value is take as best split by the attribute Vdc, the same procedure is done for the attribute Idc. b. For the attribute Idc: As it is noticed in the Table 8, this attribute has a perfect split threshold in Idc = 3.1 A, where for this value the gain ratio is equal to one. Step 4: Choose the best splitting attribute Table 9 summarizes the calculation of each attribute, where the presented attribute thresholds have the great gain ratio compared with other values. In this step, the attribute that have the maximum gain ratio comparing with other attributes is selected to be placed as a root node of the tree form. Moreover, according to the Table 9 the attribute that has the highest ratio gain is the Idc attribute. Therefore, this attribute presents the root node as shown in the Fig. 12. Furthermore, the root node have two branches right and left, where the edges under the branch are respectively Idc > 3.1 A and Idc ≤ 3.1 A, a condition is made to decide whether the next node is a leaf or an internal node, if the majority class on this node is big than or equal to two then this node is considered as a leaf. In addition, the left branch possess a majority class and it is greater than 2 (see Table 9 for Idc value 3.1 A), so this branch ends with the faulty state label in the leaf as illustrated the Fig. 13. The same condition is applied on the right branch, it's clear from the Table 9 in the row Idc > 3.1 that the majority class greater than 2 is the healthy state, so the right branch ends with this state label in the leaf, see the Fig. 14. This model can be translated into decision rules: If Idc ≤ 3.1 then the decision is Faulty state, else Idc > 3.1 then the decision is Healthy state. Step 5: the pruning process is applied after the construction of the model tree using the reduced-error pruning expression; the same procedure is applied in this step.

Fig. 20. (continued) Table 18 Confusion matrix of detection model (String fault day). Classified as

a

b

a = Healthy state b = Faulty state

0 0

0 28,147

Table 19 Testing result of the detection model. Parameters

Value

Time taken to test model Correctly Classified Instances (28147) Incorrectly Classified Instances (0) Total Number of Instances

0.16 s 100% 0% 28,147

– The expected error for both leaves, the right leaf N = 3 and E = 0 then er (3, 0/3, α ) = 0.14, for the left leaf N = 3 and E = 0 then el (3, 0/3, α ) = 0.14; – The global expected error from these leaves is: 3 × er + 3 × el = 0.84 ; – To calculate the predicted error from the root node of this sub-tree, while the root node considered as a leaf with label of the majority class reach by this node, for e (6, 3/6, α ) = 0.64 then the global error predicted by this node6 × eg = 3.81.

Table 20 Confusion matrix of diagnostic model. Classified as

a

b

c

d

a = Free fault b = String fault c = Short Circuit fault d = Line-line fault

0 0 0 0

0 28,147 0 0

0 0 0 0

0 0 0 0

Comparing between the both errors, we found the global error obtained per the root node is greater than the sum of errors got by the leaves. Consequently, the tree is kept in current form.

Table 21 Testing result of the diagnostic model.

=

Parameters

Value

Time taken to test model Correctly Classified Instances (28147) Incorrectly Classified Instances (0) Total Number of Instances

0.36 s 100% 0% 28,147

5. Experimental results and discussion In this section, experimental results are presented in order to evaluate the obtained models. The data set are divided into two randomly chosen sub-sets; the first one is used for the training which contains 66% of the global data-set, while the remained data is directed for the test. Another test was done to enable both models validation using a new data set unseen before.

1 ⎡ 0 0 1 1 × − × log 2 ⎛ ⎞− × log 2 ⎛ ⎞ ⎤ 6 ⎣ 1 ⎝1⎠ 1 ⎝ 1 ⎠⎦ 5 3 3 2 2 − × ⎡− × log 2 ⎛ ⎞− × log 2 ⎛ ⎞ ⎤ = 0.8091 6 ⎣ 5 ⎝5⎠ 5 ⎝ 5 ⎠⎦

5.1. Training and testing results

Gain (Vdc , D) = Info (D)−Info (Vdc , D) = 1−0.8091 = 0.1909

The data-set was recorded in healthy state (free fault) and faulty state (faulty state), this last contains three different faults namely: String fault (S_f), Short Circuit fault (SC_f) and Line-Line fault (LL_f). In order to classify the system state during its operation, two models have been constructed based on C4.5 algorithm using WEKA software for both detection and diagnosis. Where, the data-set has been used in

1 1 5 5 SplitInfo = − × log 2 ⎛ ⎞− × log 2 ⎛ ⎞ = 0.65 6 ⎝6⎠ 6 ⎝6⎠

G _ratio =

Gain (Vdc , D) 0.1909 = = 0.2936 SplitInfo (Vdc , D) 0.65 626

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 21. Curves of (a) the ambient temperature, (b) Estimated cell temperature (c) the irradiation, (d) the current, (e) the voltage, (f) estimated and measured power and (g) the power ratio.

training, testing and validation of the obtained models. Although, the accuracy is calculated from the ratio between the number of the correctly classified instances and the number of total instances used by this prediction model as expressed in the Eq. (16).

Accuracy =

Number of instances correctly classified Number of totaly instances

(16)

To run the algorithm, some parameters must be initialized such as the confidence factor, the minimum number instances per leave and 627

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

to the root node, can be replaced by a rule as presented in the Fig. 15. Furthermore, the decisions path are different from each other’s, the comparison between each path concluded that there is the short, the average and the long ways, then the decision rule has the same characteristic. Therefore, the prediction time has also a relation with the rule model. Fig. 16 describes the script output for the tree model part (Fig. 15) by WEKA, the value that is in brackets front of each class label in the model represents the numbers of the instances that went through from the current leaf, where the first is the number of instances correctly predicted and the second is the number of instances predicted wrongly. The DT output model used for the diagnosis, a part from model is indicated in Fig. 17, as mentioned before, this model has a four class labels with based on the binary split, the same part has the script rule as illustrated in Fig. 17, both figures are taking from WEKA software output (see Fig. 18). The presented figures of the diagnosis model approve that the model have several paths, as well as each path possess different size related to the taken pathway. The built model has been tested by remained data set, the Tables 10 and 11 show the results of this process. As highlighted in Tables 10 and 11, the constructed models concord accurately with the data by comparing different parameters such as: the mean absolute error, the root mean squared error, the relative absolute error and the root relative squared error and so on. Moreover, the building stage has taken relatively low computation temps (12.64 s), which show reliability to use this algorithm for real time building applications. The assessment of the obtained models can be inferred from both detection and diagnostic confusion matrices, which contain the number of samples that represent each class either for the detection of the fault or for the classification of the type of this fault. The confusion matrix table summarized in Table 10 represents the number of the correct and incorrect predicted samples, in which the number of the correct classified samples is situated in the diagonal. Thus, the accuracy that represents the ratio of the correct predicted instances given in the Eq. (16) can be determined from this matrix. The construction of the DT model for the fault detection is based on data set that contains 464,092 instances recorded by the acquisition system, 66% of data is used for the training while the remaining data is directed for the test. As shown in the Table 11, the building time of the model is equal to 12.64 s; moreover, the obtained model possesses 171 nodes and 86 leaves. From the value of the building time we can notice that the C4.5 algorithm is very fast in the training stage. Furthermore, the size of the tree is acceptable and the model can be readable by transferring the model to “if/then” rules. In addition, a test stage is necessary after building the DT model, this process is carried out by unseen data that was never included in the learning process to enable the judgment of the model performances, while the model has been tested by 34% of the data-set randomly chosen. The accuracy of the detection model obtained from the confusion matrix reached 99.8675%, which involves the good prediction of the instances class label and the system-state classification. The test result in the Tables 12 and 13 shows clearly that the classification model has good performances in term of accuracy with suitable tree size and fast building time. Whereas, the result values might have been more interesting, the model has built in 14.1 s with tree form globally composed by 207 nodes and we find between these nodes 104 leaves. The model has a strong precision of prediction, therefore most instances in the test are correctly classified where 157,488 samples are proper classified, and on the other hand, 303 samples are falsely predicted. This test shows clearly the effectiveness of the model where the accuracy is 99.808%. In addition, the time of the model development in the test is equal to 0.72 s.

Fig. 21. (continued) Table 22 Confusion matrix of detection model (Short circuit fault day). Classified as

A

b

a = Healthy state b = Faulty state

0 101

0 26,914

Table 23 Testing result of the detection model. Parameters

Value

Time taken to test model Correctly Classified Instances (26914) Incorrectly Classified Instances (1 0 1) Total Number of Instances

0.09 s 99.6261% 0.3739% 27,015

Table 24 Confusion matrix of diagnostic model. Classified as

a

B

c

d

a = Free fault b = String fault c = Short Circuit fault d = Line-line fault

0 0 129 0

0 0 0 0

0 0 26,822 0

0 0 64 0

Table 25 Testing result of the diagnostic model. Parameters

Value

Time taken to test model Correctly Classified Instances (26822) Incorrectly Classified Instances (193) Total Number of Instances

0.09 s 99.2856% 0.7144% 27,015

option for use (enable or disable) the pruning process, in the proposed approach the setting parameters are respectively 0.25, 40 instances and allow to use the pruning. Thereafter, Figs. 15 and 16 illustrate a part of the tree form with associated output script detection model. Fig. 15 shows a part from the output model by WEKA that is using for the fault detection. The constructed DT model is composed by several sub-trees that contain wired nodes that even have ended with leaves; the tree is branched out by a binary way where the edge of branch possesses a threshold. Moreover, the model involved by many decisions pathway where each path, that begin from the leaf and reach 628

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 22. Curves of (a) the ambient temperature, (b) Estimated cell temperature (c) the irradiation, (d) the current, (e) the voltage, (f) estimated and measured power and (g) the power ratio.

considers five main situations, namely: Free Fault Day, String Fault Day, Short Circuit Day, Line-Line Fault Day and Faults Day.

5.2. Validation results Despite the fact that both models have good results in the test process such as the accuracy and size of the tree but a validation remains necessary to confirm their effectiveness in terms of accuracy, robustness and tree complexity. Validation tests are performed; all system states are considered by both models to ensure precise results. In this way, for fault detection and diagnosis purpose, a dataset is recorded over five days wherein each day corresponds to one state. The test

a. Free fault day Generally, the GCPVS at the majority of time operate in normal state and can be subjected to natural shading, which lead to low power production. For this reason, the obtained models should be tested for these climatic conditions. Fig. 19 represents the ambient temperature, 629

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

The confusion matrix in Table 14 shows that the proposed model demonstrates a good generalization, which allows the prediction of system states with high accuracy. Therefore, the accuracy is of 99.9886% where 26,263 samples are correctly predicted from 26,266 samples. In addition, only 0.36 s were sufficient for the model to classify all instances. The same data are used to validate the diagnosis model; Tables 16 and 17 are established from the prediction of the diagnosis model. According to the presented results in Tables 16 and 17, the prediction is well performed since the diagnosis model expected the healthy state day with 100% prediction accuracy taking only 0.13 s time of execution. b. String fault day The string fault is artificially created in the system by disconnecting one string PV modules. Fig. 20 illustrates the models attributes and electrical variables such as ambient temperature, irradiation, current, voltage and power which are recorded at the string fault day with also the concluded power ratio attribute. As observed in the Fig. 20.e, there is a big power loss due to string fault. Indeed, by comparing both estimated and recorded power curves, it appears that the power is reduced around 50% when a one string is cut out. Therefore, the predictive models are used to deduce each instance class of this day data. Subsequently, Tables 18 and 19 detailed the prediction result of this day (7 July 2017). The provided detection result in the Tables 18 and 19 shows that the model classified properly all system’s states instances. Thus, the values of both accuracy and prediction time equal respectively 100% and 0.16 s, which confirm the effectiveness of the model. The diagnosis model is applied on the same data of this day. Hence, the information’s on the Tables 20 and 21 indicate that the model is consistent for this fault at this day. In addition, the diagnosis model give a good prediction and took a short execution time, where the model perfectly classify all instances correctly, therefore the accuracy is 100% for 0.36 s of the prediction process time as indicated in the Table 20.

Fig. 22. (continued) Table 26 Confusion matrix of detection model (Line-line fault day). Classified as

a

b

a = Healthy state b = Faulty state

0 0

0 24,121

Table 27 Testing result of the detection model. Parameters

Value

Time taken to test model Correctly Classified Instances (24121) Incorrectly Classified Instances (0) Total Number of Instances

0.14 s 100% 0% 24,121

Table 28 Confusion matrix of diagnostic model.

c. Short circuit day

Classified as

a

b

c

d

a = Free fault b = String fault c = Short Circuit fault d = Line-line fault

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 24,121

At the short circuit day, the fault is made in one PV module from the system array randomly. The data is collected during this fault day (29 July 2017), where meteorological and electrical values are plotted in the Fig. 21. From the Fig. 21.e, the short circuit fault affects the power produced by the system this while a dissimilarity existed among both estimated and measured power. In addition, a module from the first string is chosen to short circuit, consequently the system lose a power of this module. As described in the Tables 22 and 23, the confusion matrix resulting of this day shows that the model is able to predict accurately most instance at fast speed, where 26,914 from 27,015 instances are correctly predicted at 0.09 s. Consequently, the model accuracy is equal to 99.62%. The DT model of the diagnosis is used to predict the fault classe, the model can't be avoided a wrong classify some instances, in other hand, the most instances must be correctly predict for good robustness. Effectively, the Tables 24 and 25 elucidates the classification model result, where the accuracy is too good and equal to 99.29%. Moreover, the model conserves the prediction process at short time, it's equal to 0.09 s that is considered as a very short time.

Table 29 Testing result of the diagnostic model. Parameters

Value

Time taken to test model Correctly Classified Instances (24121) Incorrectly Classified Instances (0) Total Number of Instances

0.09 s 100% 0% 24,121

the irradiation, the current, the voltage, the powers predicted by Sandia model with measured and power ratio respectively during the free fault day (1 July 2017). The Fig. 19e shows the curves of the measured and the estimated power. One can see that both curves are sticking, which confirms that during this day, the system works within a healthy state. As stated in the previous section, the ratio between the two powers is used as an attribute which clearly have a high correlation with the system state as show in the Fig. 19e. The obtained results from the detection model test are represented in Tables 14 and 15.

d. Line-line fault day Eventually, the line-line fault is the last state that has been studied in the system for the model validation, Fig. 22 shows the models attributes variation during this day (28 August 2017). 630

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 23. Curves of (a) the ambient temperature, (b) Estimated cell temperature (c) the irradiation, (d) estimated and measured power and (e) the power ratio.

matrix resulting is indicated in table below. The confusion matrix proved that the diagnosis model has good prediction efficiency, then total instances are classified as line-line fault, thus the resulting accuracy is 100% which significant that the prediction of the system state is completely correct (Tables 28 and 29).

Fig. 22 illustrates the three inputs models such as the ambient temperature, the irradiation and the powers, as well as the estimated power by Sandia model. The result obtained from the detection model at this faulty state is represented in Table 26. Table 27 over 24,121 instances are correctly classified; the model accuracy is 100% while the process takes 0.14 s. In addition, the diagnosis model is tested by the same set, the confusion

e. Faults day 631

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 23. (continued)

respectively. Consequently, the model accuracy in this day is 99.606%, and the model kept their prediction time at fast level.

At this day, the four faults are involved in the PV system one after another that to enable us to test the two classification models. Therefore, firstly a short circuit of PV module is created which occurred during 2 h 23 m 47 s, thereafter a PV string is disconnected and takes 1 h 58 m 40 s of time, then the system is restored to normal state as the system rest under this for 2 h 52 m 21 s and eventually a line-line fault is introduced in the system for 1 h 4 m 18 s of time. The variation of the both meteorological and electrical data during this day is represented in the Fig. 23, as well as the Fig. 23d shows the gap between the Sandia model output and the measured power which represents the power losses for each fault. The classification results for both detection and diagnosis models at each sample is shown in the Fig. 24, can noticed from figure that these models have a high accuracy. The Tables 30 and 31 are offered the confusion matrix resulting and the testing result obtained for this day, it is obvious that the model has a good deduction therefore the most samples were correctly classified in spite of the system was exposure to multiple faults during this day, where 29,893 from 29,946 instances are accurately classified in 0.07 s, thus the model accuracy is equal to 99.823%. Besides, the results of the diagnosis model are giving in the Tables 32 And 33, which show the effectiveness of the prediction during that day when the system is subject to multiple faults. A closely analyze, the model is correctly classified 10,311 out of 10,341 instances, 7093 out of 7120 instances, 8611 out of 8627 instances and 3813 out of 3858 instances for free fault, string fault, short circuit and line-line fault

6. Conclusion In this paper, the decision tree algorithm has been proposed for fault detection and diagnosis in the GCPVS. Two models are constructed by the C4.5 algorithm, the first model for the fault detection within two class labels named healthy and faulty, the second one for fault diagnosis which contains four classes: free fault, string fault, short circuit fault and line-line fault. At first, the basic phases used for the learning process of the C4.5 algorithm have been explained. The learning is based on splitting criteria to choose the best splitting attribute, the attribute that have the maximum gain ratio is placed at splitting point, while this process is recursively applied on the data set until a stopping criteria is verified. After modeling, a pruning technique is used to enhance the model by deleting the unnecessary sub-tree. In such the tree form, contains three type main nodes, the root node, the internal node and the leaves, the branches are connected between these nodes, as well as the nodes representing the attributes and the thresholds values of these attributes are placed in the edge branches of each node, and the leaves represents the class label. The data set has been gathered by acquiring from the GCPVS during it operating at health or fault state, therefore three faults are established in the system namely string fault, short circuit fault and line-line fault. The data set has been divided into two parts, the first part consists of 66% of global data for the learning and 632

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Fig. 24. The classification result of (a) detection model and (b) diagnosis model. Table 30 Confusion matrix of detection model.

Table 33 Testing result of the diagnostic model.

Classified as

a

b

Parameters

Value

a = Healthy state b = Faulty state

10,318 30

23 19,575

Time taken to test model Correctly Classified Instances (29828) Incorrectly Classified Instances (118) Total Number of Instances

0.1 s 99.606% 0.394% 29,946

Table 31 Testing result of the detection model. Parameters

Value

Time taken to test model Correctly Classified Instances (29893) Incorrectly Classified Instances (53) Total Number of Instances

0.07 s 99.823% 0.177% 29,946

the second part is the remainder data. The testing result showed the effectiveness of the classification for both models, where the accuracies are equal 99.86% and 99.80% for the detection and diagnosis model respectively. The validation ensures the generalization efficiency of the models by evaluation process at while each state created in a specific day, where the prediction process has a fast execution within high accuracy value (greater than 99%) for all days.

Table 32 Confusion matrix of diagnostic model.

References

Classified as

a

b

c

d

a = Free fault b = String fault c = Short Circuit fault d = Line-line fault

10,311 22 0 30

5 7093 0 0

24 5 8611 15

1 0 16 3813

Benkercha, R., Moulahoum, S., Colak, I., Taghezouit, B., 2016. PV module parameters extraction with maximum power point estimation based on flower pollination algorithm. In: Power Electronics and Motion Control Conference (PEMC), 2016 IEEE International. IEEE, pp. 442–449. Benkercha, R., Moulahoum, S., Kabache, N., 2017. Combination of artificial neural network and flower pollination algorithm to model fuzzy logic MPPT controller for photovoltaic systems. In: Electromagnetic Fields in Mechatronics, Electrical and Electronic Engineering (ISEF) Book of Abstracts, 2017 18th International Symposium

633

Solar Energy 173 (2018) 610–634

R. Benkercha, S. Moulahoum

Intelligence Res. 4, 77–90. Quinlan, J.R., 2014. C4. 5: programs for machine learning. Elsevier. Quinlan, J.R., 1986. Induction of decision trees. Machine learning, 1(1), pp. 81-106. Rawat, R., Kaushik, S.C., Lamba, R., 2016. A review on modeling, design methodology and size optimization of photovoltaic based water pumping, standalone and grid connected system. Renew. Sustain. Energy Rev. 57, 1506–1519. Roumpakias, E., Stamatelos, A., 2017. Comparative performance analysis of grid-connected photovoltaic system by use of existing performance models. Energy Convers. Manage. 150, 14–25. Silvestre, S., Chouder, A., Karatepe, E., 2013. Automatic fault detection in grid connected PV systems. Sol. Energy 94, 119–127. Silvestre, S., da Silva, M.A., Chouder, A., Guasch, D., Karatepe, E., 2014. New procedure for fault detection in grid connected PV systems based on the evaluation of current and voltage indicators. Energy Convers. Manage. 86, 241–249. Solórzano, J., Egido, M.A., 2013. Automatic fault diagnosis in PV systems with distributed MPPT. Energy Convers. Manage. 76, 925–934. Triki-Lahiani, A., Abdelghani, A.B.B., Slama-Belkhodja, I., 2017. Fault detection and monitoring systems for photovoltaic installations: a review. Renew. Sustain. Energy Rev. Villarini, M., Cesarotti, V., Alfonsi, L., Introna, V., 2017. Optimization of photovoltaic maintenance plan by means of a FMEA approach based on real data. Energy Convers. Manage. 152, 1–12. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J., 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann. Yang, X.S., 2012. Flower pollination algorithm for global optimization. In: International conference on unconventional computing and natural computation. Springer, Berlin, Heidelberg, pp. 240–249. Zhao, Y., Yang, L., Lehman, B., de Palma, J.F., Mosesian, J., Lyons, R., 2012. Decision tree-based fault detection and classification in solar photovoltaic arrays. In: Applied power electronics conference and exposition (APEC), 2012 twenty-seventh annual IEEE. IEEE, pp. 93–99. Zhao, Y., De Palma, J.F., Mosesian, J., Lyons, R., Lehman, B., 2013b. Line–line fault analysis and protection challenges in solar photovoltaic arrays. IEEE Trans. Ind. Electron. 60 (9), 3784–3795. Zhao, Y., Lehman, B., Ball, R., Mosesian, J., de Palma, J.F., 2013a. Outlier detection rules for fault detection in solar photovoltaic arrays. In: Applied Power Electronics Conference and Exposition (APEC), 2013 Twenty-Eighth Annual IEEE. IEEE, pp. 2913–2920.

on. IEEE, pp. 1–2. Blaifi, S.A., Moulahoum, S., Benkercha, R., Taghezouit, B., Saim, A., 2018. M5P model tree based fast fuzzy maximum power point tracker. Sol. Energy 163, 405–424. Bonsignore, L., Davarifar, M., Rabhi, A., Tina, G.M., Elhajjaji, A., 2014. Neuro-Fuzzy fault detection method for photovoltaic systems. Energy Proc. 62, 431–441. Chao, K.H., Ho, S.H., Wang, M.H., 2008. Modeling and fault diagnosis of a photovoltaic system. Electr. Power Syst. Res. 78 (1), 97–105. Chen, Z., Wu, L., Cheng, S., Lin, P., Wu, Y., Lin, W., 2017. Intelligent fault diagnosis of photovoltaic arrays based on optimized kernel extreme learning machine and IV characteristics. Appl. Energy 204, 912–931. Chine, W., Mellit, A., Pavan, A.M., Kalogirou, S.A., 2014. Fault detection method for gridconnected photovoltaic plants. Renew. Energy 66, 99–110. Chine, W., Mellit, A., Lughi, V., Malek, A., Sulligoi, G., Pavan, A.M., 2016. A novel fault diagnosis technique for photovoltaic systems based on artificial neural networks. Renew. Energy 90, 501–512. Chouder, A., Silvestre, S., 2010. Automatic supervision and fault detection of PV systems based on power losses analysis. Energy Convers. Manage. 51 (10), 1929–1937. Chouder, A., Silvestre, S., Taghezouit, B., Karatepe, E., 2013. Monitoring, modelling and simulation of PV systems using LabVIEW. Sol. Energy 91, 337–349. Das, S., Hazra, A., Basu, M., 2018. Metaheuristic optimization based fault diagnosis strategy for solar photovoltaic systems under non-uniform irradiance. Renew. Energy 118, 452–467. De Soto, W., Klein, S.A., Beckman, W.A., 2006. Improvement and validation of a model for photovoltaic array performance. Sol. Energy 80 (1), 78–88. Ding, K., Bian, X., Liu, H., Peng, T., 2012. A MATLAB-simulink-based PV module model and its application under conditions of nonuniform irradiance. IEEE Trans. Energy Convers. 27 (4), 864–872. Hadj Arab, A., Cherfa, F., Chouder, A., Chenlo, F., 2005. Grid connected photovoltaic system at CDER-Algeria. In: Proceedings of the 20th European Solar Energy Conf. and Exhibition, June 2005, Barcelona, Spain. Han, J., Pei, J., Kamber, M., 2011. Data Mining: Concepts and Techniques. Elsevier. Kratochvil, J.A., Boyson, W.E., King, D.L., 2004. Photovoltaic array performance model (No. SAND2004-3535). Sandia National Laboratories. Madeti, S.R., Singh, S.N., 2017. A comprehensive study on different types of faults and detection techniques for solar photovoltaic system. Sol. Energy 158, 161–185. Peled, A., Appelbaum, J., 2017. Enhancing the power output of PV modules by considering the view factor to sky effect and rearranging the interconnections of solar cells. Progr. Photovoltaics: Res. Appl. 25 (9), 810–818. Quinlan, J.R., 1996. Improved use of continuous attributes in C4. 5. J. Artificial

634