Model for data codification in hierarchical classifications with application to the biodiversity domain

Model for data codification in hierarchical classifications with application to the biodiversity domain

Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of King Saud Un...

1MB Sizes 1 Downloads 32 Views

Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Contents lists available at ScienceDirect

Journal of King Saud University – Computer and Information Sciences journal homepage: www.sciencedirect.com

Model for data codification in hierarchical classifications with application to the biodiversity domain Alaoui Altaf a,b,⇑, Dakki Mohamed b, Ziti Soumia c, Ettaki Badia d,a, Zerouaoui Jamal a a

Faculty of Sciences, Ibn Tofail University in Kenitra, SIMO: Laboratory of Engineering Sciences and Modeling, B.P. 133, Kenitra 14000, Morocco Scientific Institute, Univ. Mohammed V, Wetland Research Unit, Av. Ibn Battota, Rabat 10106, Morocco c Faculty of Sciences Rabat, Intelligent Processing And Security of Systems, 4 Avenue Ibn Batouta, B.P. 1014, Rabat, Morocco d Lyrica: Laboratory of Research in Computer Science, Data Sciences and Knowledge Engineering, Department of Data, Content and Knowledge Engineering School of Information Sciences Rabat, Morocco b

a r t i c l e

i n f o

Article history: Received 22 May 2019 Revised 3 July 2019 Accepted 22 August 2019 Available online xxxx Keywords: Codification Classification Biodiversity Models

a b s t r a c t The conception and treatment of complex classifications or typologies (as hierarchical tables), mainly of nature components, has for long constituted a major concern for researchers. Hence, several codification methods were developed in order to address and facilitate the management of such tables. Most of these methods uses alphanumeric codes, that remain specific to each case (type of entity), in the sense that their transposition to other cases is rarely feasible. This article proposes a new standardized codification method, applicable to various hierarchical schemas. Implemented in the development of a complex system of biodiversity data management, this method consists in creating for each table a fixed number of levels, coded using the same type of strings, based on ’numeric characters’. This method makes it possible to manage different tables with the same module and facilitates the automatic incrementation and updating of the code (=position). The new model has the advantage of offering possibilities to generate, update or delete a code, without disrupting the codes of the other elements, in addition to the gain in response time of the requests. It gives hope that in some cases (as geo-databases) to merge different tables whose merger was not previously obvious. Ó 2019 Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction All existing elements and phenomena in the world can be organized in well-defined entities or data series, where they are organized in a way to permit their identification, using descriptors that take for each element a specific combination of values. In most of these data series, the elements are classified in a significant hierarchical model (classification or typology), where the elements are linked by the relation type of ’parent to children’. This hierarchical

⇑ Corresponding author at: Faculty of Sciences, Ibn Tofail University in Kenitra, SIMO: Laboratory of Engineering Sciences and Modeling, B.P. 133, Kenitra 14000, Morocco. E-mail address: [email protected] (A. Altaf). Peer review under responsibility of King Saud University.

Production and hosting by Elsevier

relationship can lead to complex entities, given the number of classification criteria and levels. Commonly used in computer science, telecommunication and trade, classifications are also used in the management of data related to nature and people, where exists a large number of entities naturally organized according to a hierarchical model. Good illustrative examples can be found among several components: human (genealogical trees and activities) (Valentini et al., 2000), biological (flora, fauna, microorganisms) (Li and Chen, 2010), ecological (ecosystems, habitats) (Davies and Moss, 2002; Davies et al., 2017; Gavish et al., 2018; Blasi et al., 2017), geo-graphical (regions, sub-regions), hydrological (water systems and hierarchical watersheds) (Wang and Wang, 2013; Farinha, 0000; Verdin and Verdin, 1999; Jia and Wang, 2006; Hörhan, 2009), administrative (states, provinces, collectivities), etc. The computerized management of these hierarchical data needs codification, which is a technical operation that consists to assign an invariable identifier to each element of a data series (Wang and Wang, 2013; Lemos et al., 2018). This codification still encounters certain problems, mainly their ability to reflect both

https://doi.org/10.1016/j.jksuci.2019.08.010 1319-1578/Ó 2019 Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010

2

A. Altaf et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

one single direction (upstream–downstream). One of the mostly used codification approaches, known as Pfafstetter model (Verdin and Verdin, 1999; Jia and Wang, 2006; Farinha et al., 1996), consists in assigning to each branch a digit between one and nine, with the zero reserved for the last section (downstream) of the network. For the subdivision of this network, to each section admitting a bifurcation in its upper part is assigned an odd digit between one and nine, while the arteries that cannot be subdivided into tributaries are assigned even digits between two and eight. When less than nine divisions are used to code sub-basins, this method generates incomplete codes (Hörhan, 2009). To remedy this shortcoming, an alternative approach was adopted by Li et al. (Hörhan, 2009); It creates at each confluence (node) of two (exceptionally three) watercourses, 2–3 son-elements (upstream watercourses) from a parent element (downstream watercourse). It assigns to both son-elements the same hierarchical level, even if they are of very different sizes (flows, width). In this approach, the insertion or deletion of an element causes a change in the codes of all the underlying elements, which requires thus their update in all the documents and tables using them. The classification of living world is another area where codification is needed to manage taxonomic entities in databases (see Fig. 1). The conventional classification of these entities is based on a complex system, constituted at least of seven obligatory levels (kingdom, phylum, class, order, family, genus and species), and several intermediate levels (such as superorder and suborder). Most databases give numerical codes to taxa, but without specifying its construction method. For example, to manage its statistical data of living marine resources, FAO uses a 10-digit taxonomic code to identify taxa (Li and Chen, 2010). Wetlands International uses a 5-character identifier for water birds, made from the concatenation of the first three letters of the genus name and the first two letters of the species name. This concatenation sometimes produces the same code for two different species, which imposes the use of the third letter of the species name instead of the second. This approach does not suggest any hierarchy, while the scientific names of the species are subject to changes. The hierarchical classification is also applied to human activities and their impacts on nature. In Natura 2000 program, the activities are hierarchically classified in the three levels only (Valentini et al., 2000) and each level is assigned a three-digit code. Two-digit code between ’000 and ’900 are attributed to the first level and a threedigit between ’0100 and ‘9900 is attributed to the second and third levels. The last level elements are coded using an incremental step of ’10 , while this step becomes for the first two levels ’100 (see Fig. 2). Two additional codes are reserved to the particular cases of ’other activity’ (X90) and ’negligible activity’ (XX). Impacts are classified in three levels; one letter is assigned to the first level, while the second and the third levels are coded with one letter concatenated the parent code. For the third level, a dash () is added between its code and that of the second level (e.g. ’FC-A’).

uniqueness and hierarchical position of all the elements. The approaches are far from being standardized, in the sense that several approaches are often used in the same application. This paper describes a codification model that can be generalized to any type of hierarchical entities, where each element is identifiable by a code, which integrates the entire hierarchical path of the element as well as its exact position in relation to the other elements of its level. In addition to its simplicity, this approach was conceived with the objective to increase the efficiency and reliability of data manipulation and optimize storage space. Before presenting and implementing this model, a brief evaluation of the literature permits to justify the need of such model for the codification of hierarchical data. 2. Current hierarchical data codifications: literature overview 2.1. Existing methods through examples Several works on hierarchical data codification use alphanumeric codes. In habitats classifications, such as EUNIS (Valentini et al., 2000; Davies and Moss, 2002; Davies et al., 2017; Galparsoro et al., 2012; Greene et al., 1999; Commission, 2005; Evans, 2006; Louvel and Poncet, 2013; Borre et al., 2000) and CORINE Biotopes (Wyatt, 0000; Linden, 2001; Devillers and Corine, 1991; European Commission, 0000), data are organized according to a multi-level hierarchical system (see Table 1). The progression in the typology begins from the highest level which represents the large natural landscapes, coded using a number in CORINE Biotopes and a letter in EUNIS. To each subordinate level is assigned a specific numerical code, preceded by the parent-level code. Besides, the code of the third level is separated from that of the second by a dot. The Corine Land Cover typology (European Environment Agency, 0000) of land uses comprises only three levels, each of them containing a maximum of five categories/types. Each element is identified by a simple numerical code, obtained by concatenation of its own code and that of its parent element. The MedWet model for the classification of Mediterranean wetland habitats (Farinha, 0000; Farinha et al., 1996), closely modeled on that of the US wetlands and deep-water habitats (Farinha, 0000; Evans, 2006), is made of four hierarchical levels (system, subsystem, class and subclass). Given that each of these levels has a maximum of seven ‘son-elements’, the codification uses one letter per level, which gives a four-letter base code. These hierarchical criteria are completed by four independent parameters (water regime, salinity regime, artificiality and dominance), each allowing for a maximum of 14 cases. Each parameter is codified by a letter that holds a fixed position in the code (according to the order given above), which gives a complete code of eight letters for each habitat type. In the European classification NATURA 2000 (Valentini et al., 2000), sites are identified by an alphanumeric code that can be up to nine characters; the two first characters refer to the country, while the site code can be assigned freely according to a logical and coherent system, defined by national authorities. For example, Messancy stream in Belgium bears the code ’BE340620 ; in this example, the hierarchy is visible only at the level of the country. The hydrographical networks give a good pedagogical model of hierarchical entities, where watercourses flow one in the other, in

2.2. Evaluation of existing methods The examples presented above clearly show the crucial importance of codification in the organization and manipulation of data tied by hierarchical relationships. This codification is performed using a variety of methods, almost all of them seeking to integrate

Table 1 Habitat coding systems according to EUNIS and CORINE Biotopes. Level

1

2

3

4

5

6

EUNIS CORINE Biotopes

A 1

A5 15

A5.3 15.1

A5.35 12.11

A5.354 12.111

A5.3541 12.1111

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010

A. Altaf et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

3

Fig. 1. Diagram of organization of living entities (case of Fauna Kingdom): typical example of hierarchical classification, with stable levels.

Fig. 2. Codification of Natura 2000 for human activities.

this hierarchy. However, three great remarks could be provoked about these methods: the code structure and length are specific to each entity processed, and their transposition to other entities could often lead to saturation, in the sense that no sublevels can be created and/ or no more record can be added to a same level; in order to address the data hierarchy (i.e. spatial data), the existing models often break down these data into as many tables as hierarchical levels, with son entities linked to parent entities by a key field. The first flaw of this approach is that it increases the number of tables in the database, which often complicates the hierarchy-related treatments, given that the access to an element of a table requires passing through the parent tables. Another classical approach used in computing is to resort instead to the reflexive relationship (parent-son), without having to multiply tables. This is the case, for example, of the FAO codification of living marine resources (Li and Chen, 2010). In this case, the notion of hierarchical level loses meaning, given the impossibility to put additional elements in one same level, as required by certain classifications (living entities, habitats . . .); the method proposed by Li et al. (Hörhan, 2009) for the binary codification of hydrographic networks seems to remedy this shortcoming, knowing that it identifies hierarchical levels. However, with this approach, updating (deletion, insertion, modification) of an element requires the reshuffling of the codes of sibling element as well as all their son elements, which can generate inconveniences, especially if these codes are used in external reference reports. 3. A new model for the codification of hierarchical data The need to standardize the codification methods emerged as a result of a long conceptual work of a ’Waterbirds and Wetlands of

Morocco’ database, used to manage data on the winter monitoring of waterbirds as well as for the wetlands national inventory. This database contained a large number of tables, where elements are organized in hierarchical schemes (classifications or typologies). While developing, these tables were often subjected to similar treatments, but using a specific module for each table. This approach generates in a computer application a redundancy of similar modules and an exaggerated lengthening of the application. To address the problems discussed in the evaluation section above, the challenge will be to come up with a single processing module that can be applied to manage all hierarchical tables. This can be possible only if the codes used in all such tables are elaborated with the same method, which allows to: identify an element by its level position in the hierarchy and among elements of the same level; support large numbers of levels and of elements into these levels; assign to each element one single identifier throughout the life of the database; process all the hierarchical levels in one same table, i.e. without having to break down the levels into separate entities (such as in GISs), which has the benefit of reducing the storage space, simplifying the queries formulation and optimizing the response time in a relational database. The new method suggests a codification based on a vertical decomposition of the table into fixed levels, knowing that they are generally less than 10; in each level, all elements are of similar nature (homogeneous) and could be in indefinite number. It is also necessary that the codification scheme incorporates the whole hierarchy, that why the code of each element of a given level incorporates its parent code, via a simple concatenation. To allow this, the code is in a ’string’ format but composed only of digits, which gives the possibility to calculate and automatically increment these codes. This format permits also to reserve a fixed length of the code fields. The sons-elements of one same parent are codified using a field with a length big enough to cover the maximum number of possible sons (records) that this parent is likely to have. For example (see Table 2), knowing that in the animal kingdom no taxon of the ’order’ level can contain 100 families worldwide, we can retain for the ’families’ level a code of ’00000 format, whose length (4 characters) allows to insert 999 cases by automatic incrementation using a step of 10 (0010, 0020. . ... . .9990). Assuming that the

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010

4

A. Altaf et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Table 2 . New codification system of hierarchical tables: first 11 elements of a 5-level birds classification. SP_ID

SP_OR

SP_FA

SP_GE

SP_SP

SP_SU

SP_Lev

SP_CODTAX

SP_Name

0 1 2 3 4 5 6 7 8 9

0010 0010 0010 0010 0010 0010 0010 0010 0010 0010

0000 0010 0010 0010 0010 0010 0010 0010 0010 0010

0000 0000 0010 0010 0010 0010 0010 0010 0010 0010

0000 0000 0000 0010 0010 0010 0010 0020 0020 0020

0000 0000 0000 0000 0010 0020 0030 0000 0010 0020

1 2 3 4 5 5 5 4 5 5

00100000000000000000 00100010000000000000 00100010001000000000 00100010001000100000 00100010001000100010 00100010001000100020 00100010001000100030 00100010001000200000 00100010001000200010 00100010001000200020

Tinamiformes Tinamidae Tinamus Tao Larensis Tao Kleei Solitarius pernambucensis Solitarius

Note: SP_ID: id of bird; SP_OR: code of first level (Order) bird; SP_FA: code of second level (Family) bird; SP_GE: code of third level (Genus) bird; SP_SP: code of fourth level (Specie) bird; SP_SU: code of fifth level (Sub Specie) bird; SP_Lev: Number of level of bird classification; SP_CODTAX: global code of bird; SP_Name: Name of bird.

maximum size of this code was well estimated at the time of the database design, there is a need sometimes to anticipate the risk, even minimal, that its extension will be required during the system operation. When there is no need to organize these elements in a given order, the new code could be generated simply by incrementing the last code created in the table. Otherwise, the new element has a defined position among existing sibling elements; in this case, their codes should be separated by an incremental step, which allows inserting new codes for new intermediate elements. Each new element should be inserted halfway between two successive sibling elements (see Table 3) and its code is then assimilated to the ’integer part’ of the mean of the two siblings’ codes (Eq. (1)).

  ðCode½i  1 þ Code½i þ 1Þ Code½i ¼ E 2

tables (i.e. official classifications) are integrated into a database with a definitive content, in the sense that they do not allow any new record. However, many other typologies are open to modifications and insertions, and an incremental step is necessary. As the code is generally not visible, this codification method allows to reinitialize codes in case of saturation (impossibility to insert new codes).

4. Algorithms of implementation of the new codification model Before tackling these aspects, it should be recalled that: the codification system suggested in the present paper concern hierarchical tables, in the sense that their records are organized in a hierarchical scheme; however, it remains applicable to tables with one level; the code designed with this new method has an high importance in organizing (sorting) and searching records, although these tasks can also be performed using other fields in the table; the code is generally not visible, and if needed, an additional visible code could be added in the table, using a specific conception method; most of the existing hierarchical tables are used to organize reference lists (typologies), where only one field is generally displayed when manipulating a list. However, in addition to this element and to the codification fields, other potential fields could be useful, mainly to give a ’short meaning’ and an eventual description of the displayed element; when a typology should contain other fields than these, it is preferable to create for them a specific table, with the code as key field for link.

ð1Þ

This insertion could generate four different scenarios: Scenario 1: Code(i + 1)-Code(i-1) > 1; implement Eq. (1); Scenario 2: Code(i + 1)-Code(i-1) = 1 (saturation of codes); reinitialize the codes of the sibling elements in the current parent: For i = 1, Code(i) = value of the incremental step; For i = [2-N], Code(i) = Code(i-1) + step Scenario 3: the new element (i) is the first son of the current parent; Code(i) = (0 + Code(1))/2 Scenario 4: the new element (i) is the last son of the current parent; Code(i) = Code(i-1) + step. The value of incremental step varies with the number of new elements it is expected to insert between two successive existing elements; for example, a step of 10 makes it possible to insert nine new elements between two successive elements. Some reference

These remarks mean that a hierarchical table could be conceived in a standard format, where each record is characterized by the following fields: Code1 (level 1), Code2 (level 2) . . . CodeN (level N), Code (concatenation of Code1 to CodeN), Title/Label, Short meaning, Visible Code, and Description. The two latter fields are generally very useful, even they are often considered as accessory information, while the field ’Short meaning’ could be used as ’info-bull’.

Table 3 . Inserting a new element halfway between two sibling elements. SP_ID

SP_OR

SP_FA

SP_GE

SP_SP

SP_SU

SP_Lev

SP_CODTAX

SP_Name

3 4 10 5

0010 0010 0010 0010

0010 0010 0010 0010

0010 0010 0010 0010

0010 0010 0010 0010

0000 0010 0015 0020

4 5 5 5

00100010001000100000 00100010001000100010 00100010001000100015 00100010001000100020

tao larensis septentrionalis Tao

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010

5

A. Altaf et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

In the following section, a standardized processing module of hierarchical tables was developed under ’.Net’ platform, and using the UML method for modelling and MySQL for implementation. To illustrate these approaches, an experimental dataset is used; it was extracted from the aforementioned database (Waterbirds and Wetlands of Morocco). 4.1. Structuring hierarchical tables In practice, one same database can contain multiple typological tables, each generally containing a number of hierarchical levels, and each level allowing for a maximum number of records. The developer usually designs one management interface for each of these tables, which lead to a huge redundancy in the development. With the new codification system, it is possible to design a standardized management module for all these typologies, while allowing to integrate their specificities. To display each table (entity) in an updating form, the following information is necessary: name of the entity (i.e. fauna, flora, watersheds, etc.) and four indications on each hierarchy level: appellation, position in the hierarchy, code length, code implementation step and description (general meaning). These features will be stored in a modifiable metadata table, i.e. Level_Typ (See Table 4 and Fig. 3). To process a hierarchical table, the standardized module will use the metadata for both labels and hierarchy management. These labels will be modifiable without any need to manipulate the source code, but the number and hierarchical order of the levels shall be considered as invariable parameters for the time being.

4.2. Creating a code for a new element into a typology In a typology, the insertion of a new element (record) is done within a given hierarchical level, and starts with the creation of its code. The model contains four steps (see Fig. 4 and Fig. 5): select the level in the entity, identify existing element used as reference (after or before) for the new element position, calculate the code (Eq. (1)) and insert the new element. 4.3. Deleting the code at the suppression of a record The deletion of an element (record) in a typology causes the automatic deletion of all the underlying records (son elements) integrating its code, without affecting the sibling elements. This property constitutes a strong point of this codification method, knowing that in some tree design approaches (binary or multiple), this deletion may require re-coding of all elements affiliated with the deleted element. 4.4. Changing the position of an element in a typology To move an element within one same level, the procedure starts by determining its destination position and generating the new code as shown above (see Figs. 6 and 7). This newly assigned code (Fig. 7) will replace the older one (Fig. 6), including for its son elements. In the following example (see Figs. 6 and 7), the moved element is in level 4, from current parent in level 3

Table 4 . Example of typology tables. Level_Typ_Id

Level_Typ_NameTab

Level_Typ_Label

Level_Typ_Length

Level_Typ_Num

Level_Typ_Desc

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Fauna Fauna Fauna Fauna Fauna Fauna Fauna Flora Flora Flora Flora Flora Flora Flora

Kingdom Phylum Class Order Family Genus Species Kingdom Phylum Class Order Family Genus Species

1 2 3 4 5 6 7 1 2 3 4 5 6 7

3 3 4 4 4 4 4 3 3 4 4 4 4 4

Description Description Description Description Description Description Description Description Description Description Description Description Description Description

Fig. 3. Modelling of typology tables.

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010

6

A. Altaf et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Fig. 4. Procedure for creating hierarchical code when inserting an element.

Fig. 5. Procedure for calculating a new hierarchical code.

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010

A. Altaf et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

7

Fig. 6. First location of an element before changing its position.

Fig. 7. New location of an element after changing its position.

(010 010 0010 0000 0000 0000) to the new parent in the same level 3 (010 030 0010 0000 0000 0000) in the second position after its new brother (010 030 0010 0010 0000 0000). current parent (Level 3) current code (Level 4) current codes of the sons (Level 5)

new parent (Level 3) new place (after its first brother) (Level 4) new code (Level 4) new codes of the sons (Level 5) 0000

010 010 0000 010 010 0000 010 010 0000 010 010 0000

0010 0000 0000 0010 0020 0000 0010 0020 0010 0010 0020 0020

010 030 0000 010 030 0000 010 030 0000 010 030

0010 0000 0000 0010 0010 0000 0010 0015 0000 0010 0015 0010

010 030 0010 0015 0020 0000

5. Discussion The model proposed in this paper for the codification of hierarchical classifications was built upon existing classifications used in the field of biodiversity. Some of these are still incomplete, in the sense that they are supposed to admit new elements (i.e. undiscovered species or habitat types). In addition, in such classifications it is possible to move an element from a level and a position to others (i.e. species combinations in the flora or the fauna classifications . . .). The computerized management of these modifications is not an easy task, especially with large relational databases. Various methods solve the codification of the elements of the hierarchical classification tables, but they all present weaknesses as to their generalization to all cases. In other words, the majority of these methods are applicable to the specific case for which they were designed. The proposed codification solution clearly informs on the filiation of all components of a given entity, and has the advantage of offering possibilities to generate or delete a code, without disrupting the codes of the other elements. Indeed, the mostly used methods resort to alphanumeric coding of reduced length, which prevents their generalization. However, to materialize the filiation of elements, most methods propose the concatenation of the parent code with that of the son. The codification model proposed here offers a solution that can be generalized to all classical typologies, as confirmed by its

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010

8

A. Altaf et al. / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

application to a broad range of typologies (more than 100 cases in the same system). However, some aspects could be considered as weaknesses of this method; i.e. after several insertions of new elements between two successive elements, it is possible that all positions offered by the incrementation ’step’ are consumed. This situation was solved by the reinitialization of the codes in the concerned level, using the incrementation ’step’; with this solution, the only risk is that it may generate incompatibility in the case of data exchange with sub-bases using the same codes. Moreover, by multiplying the number of levels and the number of elements in the same level, it is possible to come up with tables of several million records, which is likely to delay the request response time, although the proposed codification offers possibilities to reduce this time (i.e. via filters allowing to work on a part of the table). However, certain large tables may contain data that are not manipulated with the same frequency or with the same tools (such as a table including flora and fauna), and thus the modules end up processing part of the table only, with the other part becoming cumbersome.

6. Conclusion This article proposed a codification method applicable to all kinds of hierarchical tables; its implementation in a big database on ’wetlands and waterbirds’ containing numerous hierarchical tables permits to verify that it introduced a large flexibility both in the conception and the management of complex databases. However, the proposed model has not yet shown all its advantages, that why we think that its use opens up a promising path of research. The first implementation of the codification system permitted already the standardization of some modules of the data management, and a significant optimization of the use of complex databases. However, other promising lines of research remain at their starting point: automation of the creation of hierarchical tables and modification of their structure, increase of efficiency in the processing of large data tables, use of the model in the processing of geospatial data, etc. Some weaknesses of the model have also to be addressed: first of all, the method can handle only the typologies (tables) where elements have a single ’parent’, while in few hierarchical typologies (both living and inert objects), the elements could have multiple ancestries. This seems to admit an easy solution, on which future works could focus.

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References Blasi, C., Capotorti, G., Alós Ortí, M.M., Anzellotti, I., Attorre, F., Azzella, M.M., Zavattero, L., 2017. Ecosystem mapping for the implementation of the European Biodiversity Strategy at the national level: the case of Italy. Environ. Sci. Policy 78, 173–184. https://doi.org/10.1016/j.envsci.2017.09.002. Borre, J.V., Paelinckx, D., Mücher, C.A., Kooistra, L., Haest, B., Schmidt, G.B.A.M., 2000. ‘‘Integrating remote sensing. J. Nat. Conserv. 19 (2), 116–125. C. E. Davies, D. Moss, 2002. EUNIS Habitat Classification, P. European Topic Centre on Nature Protection and Biodiversity. Davies, J. S. Guillaumont, B. Tempera, F. Vertino, A., Beuck, L., Ólafsdóttir S. H. Grehan, A. A new classification scheme of European cold-water coral habitats: Implications for ecosystem-based management of the deep sea, Vol. 145, 2017. Davies, J.S., Guillaumont, B., Tempera, F., Vertino, A., Beuck, L., Ólafsdóttir, S.H., Grehan, A., 2017. A new classification scheme of European cold-water coral habitats: implications for ecosystem-based management of the deep sea. Deep Sea Res. Part II: Top. Stud. Oceanogr. 145, 102–109. https://doi.org/10.1016/j. dsr2.2017.04.014. Devillers, J. D.-T. P. coll, J. P. L. Corine. 1991. biotopes manual Habitats of the European Community. The Institut Royal des Sciences Naturelles de Belgique in collaboration with the CORINE biotopes experts group Commissionof the European communities. European Commission, Interpretation Manual of European Union Habitats. DG Environment. Nature ENV 3. European Commission, 2005. Note to the members of the Habitats Committee. Updating of the Natura 2000 Standard Data Forms and Database. JAC D, European Commission, Brussels. European Environment Agency (EEA), Corine Land Cover (CLC90) 250 M – Version 06/1999. Evans, D., 2006. The habitats of the European Union Habitats Directive. Biol. Environ. 106, 167–173. Farinha, J.C. Costa, L. Zalidis, G. Mantzavelas, A. 1996. Mediterranean Wetland Inventory: Habitat De- scription System. Farinha, J.C., Costa, L.T. Zalidis, G., Mantzavelas, A. Fitoka, E. Vives, N.H.P.T. Mediterranean Wetland Inventory: Data Recording, MedWet/Instituto. Galparsoro, I., Connor, D.W., Borja, A., 2012. Using EUNIS habitat classification for benthic mapping in European seas: present concerns and future needs. Mar. Pollut. Bull. 64 (12), 2630–2638. Gavish, Y., O’Connell, J., Marsh, C.J., Tarantino, C., Blonda, P., Tomaselli, V., Kunin, W. E., 2018. Comparing the performance of flat and hierarchical Habitat/LandCover classification models in a NATURA 2000 site. ISPRS J. Photogramm. Remote Sens. 136, 1–12. https://doi.org/10.1016/j.isprsjprs.2017.12.002. Greene, H.G., Yoklavich, M.M., Starr, R.M., 1999. A classification scheme for deep seafloor habitats. Oceanol. Acta 22 (6), 663–678. Hörhan, J.F.T., 2009. Coding of watershed and river hierarchy to support GIS-based hydrological analyses at different scales. Comput. Geosci. 35 (3), 688–696. Jia, Y., Wang, Z.Z.H., 2006. Development of the WEP-L distributed hydrological model and dynamic as- sessment of water resources in the Yellow River basin. J. Hydrol. 331, 3–4. Lemos, C. Pereira, M. Silva, L.F.F. A codification system roadmap: case study in a metalworking company, Vol. 17, 2018. Li, T., Chen, G.W.J., 2010. A modified binary tree codification of drainage networks to support complex hydrological models. Comput. Geosci. 36 (11), 1427–1435. Linden, P.D.J.D.-T.C.V. 2001. PHYSIS Palearctic Habitat Classification, Institut Royal des Sciences Naturelles, BruxellesUpdated to 10 December. Louvel, J. Poncet, V. G. L. E.U.N.I.S. 2013. European Nature Information System. MNHN. Valentini, E., Taramelli, A., Giulio, F.F.S., 2000. An effective procedure for EUNIS and Natura 2000 habitat type mapping in estuarine ecosystems integrating ecological knowledge, and remote sensing analysis. Ocean Coast. Manag. 108 (2005), 52–64. Verdin, K.L., Verdin, J.P., 1999. A topological system for delineation and codification of the Earth’s river ba-sins. J. Hydrol. 218, 1–12. Wang, H., Wang, X.F.G., 2013. Multi-tree Coding Method (MCM) for drainage networks supporting high- efficient search. Comput. Geosci. 52, 300–306. Wyatt, D.M.B. The CORINE biotopes project: a database for conservation of nature and wildlife in the European community. Appl. Geograp. 4 327–349.

Please cite this article as: A. Altaf, D. Mohamed, Z. Soumia et al., Model for data codification in hierarchical classifications with application to the biodiversity domain, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.08.010