File Formats - Characterization and Validation

File Formats - Characterization and Validation

Stability 17th 17th IFAC IFAC Conference Conference on on Technology, Technology, Culture Culture and and International International October 2016. Du...

548KB Sizes 0 Downloads 92 Views

Stability 17th 17th IFAC IFAC Conference Conference on on Technology, Technology, Culture Culture and and International International October 2016. Durrës, Albania Available 17th IFAC26-28, Conference on Technology, Culture and International Stability online at www.sciencedirect.com Stability Stability October 26-28, 2016. Durrës, Albania October 26-28, 2016. Durrës, Albania October 26-28, 2016. Durrës, Albania

ScienceDirect

IFAC-PapersOnLine 49-29 (2016) 253–258 File Formats - Characterization and Validation File Formats -- Characterization and Validation File Formats Characterization and Validation Validation File Formats - Characterization and

Lavdërim Shala *; Ahmet Shala** Lavdërim Lavdërim Shala Shala *; *; Ahmet Ahmet Shala** Shala** Lavdërim Shala *; Ahmet Shala** *University of Freiburg / Technical Faculty-Computer Science, Freiburg, DE 79110 Germany [email protected] . *University Technical Faculty-Computer DE *University of of Freiburg Freiburg // e-mail: Technical Faculty-Computer Science, Science, Freiburg, Freiburg, DE 79110 79110 Germany Germany *University Technical Faculty-Computer Science, Freiburg, DE 79110 **UniversityofofFreiburg Prishtina/ e-mail: /e-mail: Faculty of Mechanical Engineering, Prishtina, XXK 10000Germany Kosovo [email protected] . [email protected] . e-mail: [email protected] . e-mail: [email protected] **University of Engineering, **University of of Prishtina Prishtina // Faculty Faculty of Mechanical Mechanical Engineering, Prishtina, Prishtina, XXK XXK 10000 10000 Kosovo Kosovo **University of Prishtina / Faculty of Mechanical Engineering, Prishtina, XXK 10000 Kosovo e-mail: e-mail: [email protected] [email protected] e-mail: [email protected] Abstract: Nowadays, most of the information is stored digitally. Digital information is from a high level of view it is just an array of bits. In order to figure out its real meaning software which interprets Abstract: Nowadays, most of the information is digitally. Digital information is aa high level Abstract: Nowadays, most of the information is stored stored digitally. Digitalspecial information is from from high level Abstract: Nowadays, most of the information is stored digitally. Digital information is from a high level it is required. Therefore, if by evolution of technology this software cannot be executed anymore there is of view it is just an array of bits. In order to figure out its real meaning special software which interprets of view it is just an array of bits. In order to figure out its real meaning special software which interprets of view it is just an array of bits. In order to figure out its real meaning special software which interprets potential risk that also theif interpreted it becomes notsoftware useful. The goalbe Digital Preservation is to it Therefore, by of technology this cannot anymore is it is is required. required. Therefore, ifdata by evolution evolution ofby technology this software cannot beofexecuted executed anymore there there is it is required. Therefore, ifdata by evolutionData of technology thisstored software cannot befile executed anymore there is stop occurrences such in files has a specific format potential risk also the becomes The goal Preservation is to potential risk that thatof also thephenomenon. data interpreted interpreted by byisit itcommonly becomes not not useful. useful. The each goal of of Digital Digital Preservation is or to potential risk thatof also theuser datacan interpreted bythe itcommonly becomes not useful. The goal of Digital Preservation is or to structure, by knowing it figure out real meaning of raw data stored in the file as an array of stop occurrences such phenomenon. Data is stored in files each file has a specific format stop occurrences of such phenomenon. Data is commonly stored in files each file has a specific format or stop occurrences of such user phenomenon. Data is commonly in files file has specifican format or bits. Digital considers valid as astored perquisite foreach file to in be ina usable form, with structure, by knowing can out the real of stored the file array of structure, by Preservation knowing it it user can figure figure outfile theformat real meaning meaning of raw raw data data stored in the file as as an array of structure, by knowing it user can figure out the real meaning of raw data stored in the file as an array of valid is meant that a specific file is structured conformas paper we throw bits. Preservation considers valid perquisite for file be in form, with bits. Digital Digital Preservation considers valid file file format format asitsaa declared perquisitefile forformat. file to to In bethis in usable usable form, witha bits. Digital Preservation considers valid file format as a perquisite for file to be in usable form, with spotlight on the accuracy and capability of these file validation tests. Therefore, we present some open valid is meant that aa specific file conform its declared format. In this paper we throw aa valid is meant that file is is structured structured conform its declared file file format. In this paper we throw valid issoftware meant that a specific specific structured conform its filefile format. Inwe thispresent papermore we throw source are and ablefile to is automatically identify anddeclared verify the format. We focus filea spotlight on accuracy capability of file tests. Therefore, some spotlight on the the which accuracy and capability of these these file validation validation tests. Therefore, we present someonopen open spotlight on theidentify, accuracy capability ofinthese file validation Therefore, we presentmore some open types they can andand how work large scale dataverify sets.tests. source software which able to automatically identify and the source software which are are able tothey automatically identify and verify the file file format. format. We We focus focus more on on file file source software which are able to automatically identify and verify the file format. We focus more on file types they can identify, and how they work in large scale data sets. types they can identify, and how they work in large scale data sets. Keywords: File Format, Digital Preservation, JHove, Droid, Exiftool. © 2016, IFAC Federation of Automatic Control) types they can (International identify, and how they work in large scale dataHosting sets. by Elsevier Ltd. All rights reserved. Keywords: File Format, Digital Preservation, JHove, Droid, Exiftool. Keywords: File File Format, Format, Digital Digital Preservation, Preservation, JHove, JHove, Droid, Droid, Exiftool. Exiftool. Keywords: 1. INTRODUCTION 1. This Evolution of computer systems in the last two decades, 1. INTRODUCTION INTRODUCTION 1. INTRODUCTION especially the wide usage of them in allthe over by This Evolution of computer computer systems in the lastthe twoworld decades, This Evolution of systems last two decades, This Evolution ofprofile computer systems in the lastthe two decades, people of every has impacted the way how people especially the wide usage of them all over world especially the wide usage of them all over the world by by especially the wide ofimpacted them thehow world by store Inusage thehas whenallthis paper is written people of every every profile hastime theover way people peopleinformation. of profile impacted the way how people people of everyprefer profile hastime impacted the way how people (2016) people information in store In the when this is store information. information. In to thestore timetheir when this paper paperdigitally is written written store information. In the time when this paper is written their devices [1]. The in the (2016) people to their information digitally in (2016)electronic people prefer prefer to store store theirinformation informationitself digitally in (2016) people prefer to store their information digitally in electronic storage is saved in objects called filesitself where their devices [1]. The in the their electronic electronic devices [1]. The information information itself ineach the their electronic devices [1]. The information itself ineach the file consists an array of bits 0 and/or 1. This is done so electronic storage is saved in objects called files where electronic storage is saved in objects called files where each electronic storage is the saved in objects called files where each independently real meaning matter file array of bits 1. This so file consists consists an anfrom array ofinformation bits 0 0 and/or and/or 1. This is isnodone done so file consists an array of bits 0 and/or 1. This is done so whether it is text, audio or something else itno stored independently from the real independently fromphoto, the information information real meaning meaning nois matter matter independently from the information real meaning no matter digitally asis arrayphoto, of bits. On the other hand,else there arestored built whether audio or it whether it it isantext, text, photo, audio or something something else it is is stored whether it issoftware text, photo, audiomake or something else it is computer these arrays ofstored bits digitally bits. the other hand, there are built digitally as as an an array array of ofwhich bits. On On the other hand, there are built digitally as software an array ofwhich bits.these Onmake the other hand, there are built meaningful by converting bits to the real interpretation computer these arrays of computer software which make these arrays of bits bits computer software which make these arrays of bits of them and vice-versa. Without these the meaningful by these bits real interpretation meaningful by converting converting these bits to to the the real software interpretation meaningful by converting these to the real software interpretation information is meaningless itbits is just an array of bits. the of vice-versa. Without these the of them them and and vice-versa.andWithout these software of them and vice-versa.andWithout these software the information is it is just of information is meaningless meaningless and itthe is world just an anofarray array of bits. bits. is the One important characteristic of computers information is meaningless and it is just an array of bits. continuous evolving and the wide computer software One characteristic of world computers is One important important characteristic of the therange worldofof of computers is the the One important characteristic of isthearange world of computers is the and hardware. Currently, there high number of operating continuous evolving and the wide of computer software continuous evolving and the wide range of computer software continuous evolving and the range of computer software systems, and software to wide manipulate different kind of and Currently, there is of and hardware. hardware. Currently, there is aa high high number number of operating operating and hardware. Currently, there is a high number ofeverybody. operating information e.g. text being offered to be used by systems, and software to manipulate different systems, and software to manipulate different kind kind of of systems, and software torapidly manipulate different kind of In addition, this number is increasing, and, moreover, information e.g. text being offered to be used by everybody. information e.g. text being offered to be used by everybody. information e.g. text being offered to On be used by everybody. current software is regularly updated. the other hand also In this is increasing, and, In addition, addition, this number number is rapidly rapidly increasing, and, moreover, moreover, In addition, this number is rapidly increasing, and, moreover, the hardware implementation and architecture is occurring current software is regularly updated. On the other also current software is regularly updated. On the other hand hand also current software is regularly updated. the other hand also changes towards better performance. InOn summary, variety the implementation and is occurring the hardware hardware implementation and architecture architecture isthis occurring the hardware implementation and architecture isto occurring and evolution of computer systems is leading multiple changes towards better performance. In summary, this variety changes towards better performance. In summary, this variety changes towardsof better performance. Inissummary, thisleads variety ways of storing and interpreting the digital data. This to and computer systems to and evolution evolution of computer systems is leading leading to multiple multiple and evolution of computer systems is leading to multiple cases like the one mentioned above when the user has still the ways of storing and interpreting the digital data. This leads ways of storing and interpreting the digital data. This leads to to ways of storing andunable interpreting the when digital data. This leads to information, to above figure out itsthe real meaning. cases one user has the cases like like the thebut oneismentioned mentioned above when the user has still stillThis the cases like the one mentioned above when the user has still the can be due but to is theunable fact that the out wayits is information, to real meaning. information, but is unable to figure figure out itsthis real information meaning. This This information, but is unable to figure out itsthis real meaning. structured software This and can fact the way is can be be due dueis to tonotthe thesupported fact that that by thethe waycurrent this information information is can be due tohe the fact thatcan thebe way this information is hardware that uses which an updated successor of structured is not supported by the current software and structured is not supported by the current software and structured is not supported by the current software and the one used datacan or abe one. Here hardware that uses which an successor of hardware thattohe hecreate uses the which can becompletely an updated updatednew successor of hardware that hecreate uses the whichinto canconsideration. be an updated successor of Digital goal the used new one. the one one Preservation used to to createcomes the data data or or aa completely completely The new main one. Here Here the one used to create the data or a completely new one. Here Digital Digital Preservation Preservation comes comes into into consideration. consideration. The The main main goal goal Digital Preservation comes into consideration. The main goal

of it is keep the digital information usable over the time, therefore it tries to make it neutralusable against of it is is keep keep the digital digital information usable overcontinuous the time, time, of it the information over the of it is keep the digital information usable over theto time, technological evolution. Digital Preservation tries find therefore it tries to make it neutral against continuous therefore it tries to make it neutral against continuous therefore it tries to it neutral againsttries methods, strategies, andmake activities help it achieve itscontinuous goals [1]. technological evolution. Digital Preservation to find find technological evolution. Digital Preservation tries to technological evolution. Digital help Preservation to find methods, strategies, and activities itinformation achievetries its goals goals [1]. methods, strategies, and activities help it achieve its [1]. In order to view the content of a digital (file), the methods, strategies, and activities help it achieve its goals [1]. user should know which software to use to open the file. On In order to view the content of a digital information (file), In order to view the content of a digital information (file), the the In order tohand, view theorder content of a digital information (file), the the other in to show the user the real meaning of user should know which software to use to open the file. On user should know which software to use to open the file. On user should know which to user use open On thehand, software should knowthe how theto array bitsfile. which the file other hand, in order order tosoftware show the the realofthe meaning of the other in to show user the real meaning of the otherthehand, in order to show the user of is inside file is should structured, inthe thereal array the file software should knowwhere how the array ofmeaning bitsspecific which the file thethe software know how the array of bits which the file thethe software should knowwhere how the array of bits which information needed by software is noted. In order to solve is inside file is structured, in the array specific is inside the file is structured, where in the array specific is inside filethere is by structured, where in the array specific these two the issues a concept called format. information needed software is noted. noted. Infile order to solve information needed byexists software is In order to solve information needed by software is noted. In order to solve these two issues there exists a concept called file format. these two issues there exists a concept called file format. File format defines the structure how information is ordered these two issues there exists a concept called file format. in the arraydefines of bits in how the digital storage (disk). File format defines thestored structure how information is ordered ordered File format the structure information is File format defines thestored structure how information is ordered Normally, in operating systems the file format is declared as in the array of bits in the digital storage (disk). in the array of bits stored in the digital storage (disk). in the array ofthe bits stored infile thename digital storage (disk). an extension at end of the it is preceded by Normally, in operating systems the file format is declared asa Normally, in operating systems the file format is declared as Normally, in operating systems the fileformat format is declared as dot.extension Therefore, of file to be an at the end name it preceded by an extension atthe theidentification end of of the the file file name it is is seems preceded byanaa an extension at the end of the file name it is preceded by a easy job. But, in contrast, it is not as easy as it looks since the dot. Therefore, the identification of file format seems dot. Therefore, the identification of format seems to to be be an an dot. Therefore, the identification ofasfile file format seems to be an format extension can be modified at any time by user. easy job. But, in contrast, it is not easy as it looks since the easy job. But, in contrast, it is not as easy as it looks since the easy job. But, in contrast, it is not as easy as it looks since the Therefore, from the point of it is by not user. only format can be at any format extension extension cansoftware be modified modified at view, any time time by user. format extension can be modified at any time by user. important figure what format theit haveonly by Therefore, the software point view, is Therefore, tofrom from the out software point of ofdoes view, itfile is not not only Therefore, fromextension, the software point view, itfile is not looking but it also ofimportant that theonly file important format does have by importantat to tothefigure figure out out what what format does the the file have by important tothefigure outstructure what format does theit file have by itself represents a valid of theimportant format is thought to looking at extension, but it also that the file looking at the extension, but it also important that the file looking at the ifextension, but itnot also important that the file be otherwise the file does represent it, the file is itself represents a valid structure of the format it is thought itself represents a valid structure of the format it is thought to to itself represents anot valid structure of the format itit,is the thought to considered to be useful [2]. be otherwise if the file does not represent file be otherwise if the file does not represent it, the file is is be otherwise if not the useful file does not represent it, the file is considered be [2]. Due to the to fact goal of the Digital Preservation is to considered to bethat not the useful [2]. considered to be not useful [2]. keep form it also treats the problem Due the fact that the of the Digital Preservation is to to Due to tothe the file fact in thatuseful the goal goal of the Digital Preservation is Due to the fact that the goal of the Digital Preservation is to mentioned above by providing mechanisms which identify keep the file in useful form it also treats the problem keep the file in useful form it also treats the problem keep the file in by useful formOnce it also the problem and verify the format a file. a filetreats is verified such mentioned above providing mechanisms which identify mentioned above by of providing mechanisms which by identify mentioned above by providing mechanisms which identify mechanism it is guaranteed thatOnce it is aastructured in theby and format of file such and verify verify the the format of aa file. file. Once file is is verified verified byproper such and verify formatdedicated of a file. to Once a file is verified such way that a the software its format open it. mechanism it is incan theby proper mechanism it is is guaranteed guaranteed that that it itopen is structured structured in the proper mechanism it we is guaranteed itopen is structured incan theopen proper In putdedicated a lightthat on how file is way that aa software to its waythis thatpaper software dedicated to the openway its format format canformat open it. it. way thatby aDigital software dedicated to the openway its format canformat open it. treated Preservation. In this paper we put a light on how file In this paper we put a light on the way how file format is is In this by paper we Preservation. put a light on the way how file format is treated treated by Digital Digital Preservation. This paper is organized as follows: In the second chapter we treated by Digital Preservation. briefly explain the analysis that that Digital This is as In second chapter This paper paper is organized organized as follows: follows: In the the secondPreservation chapter we we This paper is organized as follows: In the secondPreservation chapter we briefly explain the analysis that that Digital briefly explain the analysis that that Digital Preservation briefly explain the analysis that that Digital Preservation

Copyright © 2016 IFAC 253 2405-8963 © 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Copyright 2016 IFAC 253 Peer review© of International Federation of Automatic Copyright ©under 2016 responsibility IFAC 253Control. Copyright © 2016 IFAC 253 10.1016/j.ifacol.2016.11.062

2017 IFAC TECIS October 26-28, 2016. Durrës, Albania 254

Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

conducts towards file format in order to proof that it is usable. Next, in the third chapter we introduce file format registries which are public databases where file format specification is stored. In the fourth chapter we present a brief description of three tools named JHOVE, DROID, and Exiftool which use the information in databases explained in the previous chapter to identify, retrieve characteristics, and validate file format of files for different file formats whose specification is available in these databases. Next in chapters five and six we present my experiment done against a data set of 927 files using tools mentioned above. In chapter five for each tool individual results towards analysis on a large scale data set are presented. Next in the sixth chapter we try to merge the results from each individual tool from previous chapter. Here we throw a spotlight on the conflicting results from different tools. Moreover, in this chapter we present a framework proposed by K., B.,[3] for merging results from different tools including also tools presented in chapter four, this framework tries to solve conflicts at its best. Finally, in the last chapter we make a brief discussion about current situation into file format identification, characterization, and verification, gives suggestions what is important in the future. 2. DIGITAL PRESERVATION APPROACH TOWARDS FILE FORMAT All the work done by digital preservation in order to ensure that the file is useful in terms of its format can be grouped into three processes [1]: * Identification - The process of identifying what format the given file is likely to be. * Characterization - The process of extracting specific information from the file. * Verification - The process of verifying if the file matches the structure given by specification. 2.1 Identification Identification as described above has to do with determining what format a file is likely to have. This can be in interest of human or software, to achieve it there are two approaches. The first one relies on fact that when a file is stored in file system in its name it is also noted the format as an abbreviation called extension. More precisely extension is the text after the last dot ('.') [4]. Therefore, a file named in the file system as foo.txt is declared to have the file format text. The problem with this approach is that these extensions are not by default verified by system. As a result, a user can store a file to disk and nobody stops him from declaring any file format to this file. Therefore, by just having an extension declared in file systems it not guaranteed that the content of the file matches the structure of the declared file format so a file named foo.txt is not guaranteed to have some text stored inside but it is likely to be so. To solve this issue a second approach towards identification is used. This more advanced method ignores the fact that file format is declared and tries to determine the file format by analysing the structure of the file, and comparing it if it matches any of the known file format specifications [1]. These known file format specifications are stored in so called File Registries which are databases which consist structural specification for different file formats, they are discussed later in this paper. Also in order to analyse the structure it is needed to extract some information from the file this is known as Characterization and is described below in this chapter [2].

2.2 Characterization Characterization is the process of extracting specific characteristics from the file. This extracted information is used for two main purposes. First it is used by software to check if the file consists the needed information to match the file format it is thought to be, this is called file format validation and is explained later in this chapter, Second use of the extracted data after a validation is proved is to use them to construct the meaningful interpretation from the file, and present it to user [1]. Extracting specific information from files is computationally expensive since each file should be loaded in memory and be analysed. Therefore, to solve this problem digital libraries where huge amounts of data is stored usually do this extraction only once and then save the extracted data apart from the original file but in the same repository so when another extraction is needed this saved extracted data is read instead of another extraction from the original file which is more computationally expensive \cite {preservation Thesis}.Ford [2] explains characterization by an example which we will also show below. He takes a wellknown image format named TIFF which is widely used to store image files. The extension for TIFF files is .tif, below in Fig. 1 the raw presentation of some part of a random TIFF file where the array of bits is noted in hexadecimal form.

Fig. 1. Hexadecimal Values of a TIFF file [2] The interesting data to be extracted from the file is noted with red colour in the figure above. In order to show the image a TIFF viewer software first validates whether the file he is asked to open is structured conform the TIFF file format. In this case the TIFF format specification notes that the first two bytes of the TIFF file note the endianness and can have values 49 in case of little endian or 4d in case of big endian. Furthermore, the specification says that the third byte in case of little endian should have value 00 and the fourth one 2a. The first four bytes of a TIFF file together make what is called the magic number. By checking the magic number if the values, and their position is conform the specification the software determines the file format. In addition, in the picture there is some more data marked red, from the information that they are carrying (Image height, Image width, and Colour Model) one can conclude that this data is used by the software to interpret the true meaning of the file to user [2]. 2.3 Verification Verification is the third step that Digital Preservation conducts towards a file format. This step tries to figure out whether a given file is structured in compliance with its file format specifications, and whether a software designed to open this file format will be able to successfully open it. By successfully is meant to interpret the real meaning of the file. One strict approach might be that the achievement of both goals of verification is mutual inclusive. Therefore if one goal is achieved then the other is also achieved. But in practice this assumption does not stand, and by not standing it creates some trouble towards verification. This comes due

2017 IFAC TECIS October 26-28, 2016. Durrës, Albania

Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

to the fact that software used to interpret different file formats specially complex file formats have some degree of flexibility so they are able to interpret files which do not strictly comply the specification of file format. However, one thing is guaranteed for sure that if the file is strictly structured conform its file format specification the software designed to open it will definitely open it. But even if it is not strictly structured conform it's file format specification further analysis should be conducted to see whether the missing parts from the structure are inside boundaries of the flexibility so the software can still interpret the file or not. Once a file format is verified it is noted as a valid file format. There exist a lot of tools which can do the file format verification, in the coming chapters we will show the results of an experiment done with some of them and discuss their performance. 3. FILE FORMAT REGISTRIES In order to validate the format for a given file, the software tool which does this process should have in disposal the specification of the exact structure of the given file format and then analyse if the file is built conform it. Therefore, there are built databases which consist the specification of different file formats. Once a tool wants to validate a file format it can query one of these databases to retrieve the specification of a valid file format for the format in question, then compare it with the actual file and make the decision. In addition, file format registries contain additional information which might be used by Digital Preservation but are not in the scope of this paper like relationship between different file formats or different versions of the same file format, this might be useful when a migration between to file formats or two versions of the same file format is done. Below we will present some of well-known file format registries. PRONOM - Created by The national Archives of United Kingdom, it consists a database of file formats where for each file format characteristics of it are noted. In addition, relationship between different file formats is also noted. Since it consist information for a large number of file formats the information size for each file format is asap possible [5]. GDFR - Global Digital Format Registry, was developed by University of Harvard Digital Library which is also one of the pioneers in the field of Digital Preservation. Later on the build of this format registry also contributed Online Computer Library Centre (OCLC) and the US National Archives and Records Administration (NARA) [6]. UDFR - Unified Digital Format Registry is another file registry created by University of California Digital Library it aims to group the information of both previous big file registries PRONOM & GDFR, uses RDF to store its data [7]. 4. FILE FORMAT CHARACTERIZATION AND VALIDATION TOOLS In addition to file format registries the digital preservation community has created specialized tools which make use of data in these registries to extract characteristic properties and validate the file format of a specific file. Usually, the teams that developed the file format registries have also developed tools that use them, but there are also tools that make use of data in registries and were developed independently from them. There are plenty of tools and most of them are open source so they are freely available with public license. One of the most critical weaknesses of digital preservation is that

255

there are still not available file format validation tools which are giving satisfactory results for most of the file formats which are commonly used in digital libraries. As a result, nowadays exist tools which are very specialized into identifying, gathering characteristics and verifying some specific file formats but these tools give bad results for some other file formats. Therefore, in order to be able to validate many different file formats one should use multiple tools and then merge these results. In the following part of this chapter we will present three most popular tools for file format characterization and verification. They are named JHOVE, DROID and Exfitools. For each tool a brief description of tools mostly who created and major specifications are presented. 4.1 JHOVE JHove was developed by JSTOR (http://www.jstor.org/) and Harvard university libraries. It was firstly released in 2004. As all other file format analysis tools it is meant to be able to identify, retrieve significant properties, and verify the validity of the file format for a given file. What differs this tool from other tools and is considered to be its significant characteristic is that it is organized in modular format [2]. Therefore, it is meant to be extensible. As a result, JHOVE has a module for each file format that it can identify, retrieve characteristics, and validate. There are some built in modules which have been developed by JHove developers. This group consists modules for the following file formats: TIFF, GIF, JPEG, PNG, JPEG-2000, AIFF, WAV, XML, HTML, UTF8, ASCII, PDF, and a generic bytestream. In addition to them there are other modules which are developed by users for their needs typical examples of this group are modules for MP3, ZIP, and GZIP. Everyone who wants to add an additional file format to be analysed by JHOVE can develop a new module for that file format and integrate it to JHOVE. This makes JHOVE very flexible in terms of the file formats it can analyse, but, on the other hand, it causes problem when the same data set is analysed with different versions of JHOVE therefore different results are generated [8]. The outputs of this tool can be in plain text, or in xml format. 4.2 DROID Digital Record Object Identification is another software tool which is used only for file format identification. Even though, the topic of this paper is to analyse tools which do further analysis on file format like characterization and validation not just identifying of file format we decided to include DROID in the experiments of this paper because it is considered by digital preservation research community as one of the most precise tools for file format analysis, and in addition it was also used by another paper which does similar research to this paper whose results we will show in chapter six. [3]. It was developed by the National Archives of United Kingdom who have also developed the file format registry PRONOM which was explained in chapter three. Therefore, DROID is meant to work on top of PRONOM it uses the information about file formats at PRONOM and by using it conducts different analysis towards file format [4]. Communication of DROID and PRONOM is done via a file format signature file which is regularly updated from PRONOM to DROID. This signature file contains the needed characteristics which are used by DROID to identify the file format. Usually these characteristics consist file extensions,

2017 IFAC TECIS October 26-28, 2016. Durrës, Albania 256

Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

magic numbers etc [2]. DROID comes with a graphical user interface which lets normal users with no programming skills to easily use it. In addition, it has also the command line interface which is designed for advanced users and let them use advanced techniques to automate of processes inside DROID. The analysis done with DROID, in xml format [9]. 4.3 EXIFTOOL Exiftool is another software tool which does file format identification, and, furthermore it is also specialized to retrieve significant properties from file formats. It does not perform a file format validation. This tool is mainly designed for extracting and modifying metadata from EXIF (Exchangeable Image File Format) file format which is specialized to store metadata of digital camera and scanners output. But Exiftool in addition to EXIF file format works also with a huge variety of file formats, and include most of the popular file formats that are commonly used to store information. Therefore in context of digital preservation it is used to identify and extract significant properties of different file formats. It provides a numerous output formatting options with tab-delimited, HTML, XML and JSON [10]. 5. EXPERIMENT In order to find the capabilities of each of the tools presented in the previous chapter we have conducted an experiment with 927 files of different file formats including DOC, GIF, ZIP, HTML, PNG, PPT, XLS, SWF etc. These files are taken from a public files library named govdocs1\footnote {http://digitalcorpora.org/corpora/govdoc} where we took only first 927 files out of 1 million which library contains. My experiment consists of two phases. The first phase which is presented in this chapter uses each tool presented in previous chapter to conduct an individual analysis against the set of data mentioned above. Moreover, the second phase tries to combine results from each individual tool to an overall result. The goal of this phase is to use as much as possible each individual tool powers and reduce impact of their weaknesses into the result. Therefore, to achieve this we have tried different merging strategies which lead to different results which are presented in the next chapter. The following part of this chapter describes the results of first phase of experiment and is organized as follows: For each tool the results in terms of file format identification, characterization, and validation are presented whenever the tool supports any of these analysis. 5.1 Characterization of file format with JHOVE The table below summarizes the results of characterization process using JHOVE. In the table for most popular file formats it is noted which properties the tool extracted. Table 1. Characterization of file format with JHOVE File Format PNG, DOC, XLS, PPT TXT, GZ HTML, ZIP

PDF JPG GIF XML

Properties Extracted Last Modified, Size, Format, MIMEtype Last Modified, Size, Format, MIMEtype, Version, Profile, PDF metadata, Image, Fonts etc. Last Modified, Size, Format, MIME type, Version, Profile, JPEG etc. Last Modified, Size, Format, MIME type, Version, Profile, GIF metadata etc.Size, Format, Last Modified, MIMEtype, Version, XML…

Identification and Validation of file f o r m a t with JHOVE: Fig. 2 r e p resents the summary o f file format i d e n t i f i c a t i o n and verification c o n d u c t e d w i t h JHOVE. There h a v e been total 9 2 7 files analysed and these files are later analysed using Exiftool and DROID. As it is noted in the figure JHOVE ma n a g e d to identify the format of each given file. This stands so because JHOVE assigns file format ”Bytestream” to each file if it cannot assign any other specific file format. Moreover, unlikely other tools JHOVE also verifies if the file structure matches the specification of the file format JHOVE identifies the file to be. As a result, from all analysed files JHOVE claimed that 89% of them where structured according to specification so they had valid file format, while for 11% validation failed. W e have manually analysed the files which are found not valid from JHOVE and they were all empty files with only filename and extension declared most of them HTML files.

Fig. 2. Summary of file format identification and verification with JHOVE 5.2 Experiment with DROID As mentioned in chapter four DROID is not specialized to do file format characterization, thus it is not able to retrieve significant and special properties from different file formats. But in order to identify the file format DROID consists a generic module which does the extraction of some parameters which are common for all file formats. These parameters are: Extension, Size, LastModified, Format, MIMEType, and PUID. Therefore, one can say that DROID characterization capabilities are limited to properties mentioned above. DROID is thought to be one of the most powerful tools when it comes to file format identification. We have conducted my experiment towards 927 files to identify their file format with droid and the summarized results of this experiment have been shown in the Fig. 3 below. As it can be inherited from the figure DROID was able to identify the file format for 96.44% of the files. In addition, there are more file formats present compared to the previous tool JHOVE.

JHOVE bytestream PDF-hul JPEG-hul GIF-hul XML-hul

Fig. 3. Summary of file format identification with DROID

2017 IFAC TECIS October 26-28, 2016. Durrës, Albania

Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

5.3 Experiment with Exiftool Exiftool is found to be the most specialized tool among three tools presented in this paper when it comes to file format characterization. This stands so because Exfitool consists of specialized modules to extract metadata from different file formats and metadata formats. From the name one can conclude that it is specialized to deal with EXIF metadata, but, in addition, it is powerful also dealing with other formats of metadata like IPTC (International Press Telecommunications Council) Information Interchange Model, XMP (eXtensible Metadata Platform), and ICC Profile (International Colour Consortium). The table below summarizes the results of characterization process using Exfitool. In the table below for most popular file formats it is noted which properties the tool extracted. Exiftool is considered to be very powerful with file format characterization, on the other hand, it deals very good also with file format identification, we have conducted my experiment towards 927 files to identify their file format with Exiftool and the summarized results of this experiment have been shown in Fig. 4 below. Table 2. Summary of file format characterization-Exiftool File Properties extracted Format PNG FileName, Directory, FileSize, FileModifyDate, JPG MIMEType, Width, Height, ColorType, GIF BitDepth, Compression, Megapixels. DOC FileName, Directory, FileSize, FileModifyDate, XLS MIMEType, CodePage, Title, Subject, Author. PPT FileName, Directory, FileSize, FileModifyDate, HTML MIMEType, Generator, Keywords. TXT Does not recognize this file type! FileName, Directory, FileSize, FileModifyDate, GZ MIMEType, Compression ZIP Flags, OperatingSystem, ArchivedFileName. FileName, Directory, FileSize, FileModifyDate, PDF MIMEType, PDFVersion, PageLayout. FileName, Directory, FileSize, FileModifyDate, XML MIMEType, Width, XML Metadata.

Fig. 4. Summary of file format identification with Exiftool As it can be inherited from the figure Exiftool was able to identify the file format for 80.6% of the files. Therefore, there is a higher number of files whose file format could not be identified by Exiftool compared to the previous tool DROID. However, also Exiftool is able to identify higher number of file formats compared to JHOVE. While compared to DROID the number of formats identified is almost the same differs only by one type.

257

6. MERGING OUTPUTS OF DIFFERENT TOOLS In the previous chapter we presented individual results from each tool. If one looks at these individual results, and compare them he will find that there are quite big differences between them. First of all some tools like JOHVE are able to note a file format per each file, while others for a number of files cannot note the file format. On the other hand, other tools can identify more file formats, as a result, combining of results from each tool will definitely lead to more accurate results. Based on this fact on the second phase of my experiment we have merged the outputs of JHOVE, DROID, and EXIFTOOL by using different merging techniques which lead me to an overall result which is later presented in this chapter. Bellow we will chronologically explain how the experiment has been done and what Problems we have occurred during the experiment. First of all one important condition which determines if one is able to combine the result is that the results should be in the same format, and, in addition their syntax should be the same. In my case each of the tools described in chapter four supported multiple formats of output. Therefore, we was careful to use the same output format for all of them, as a result, we have used the XML output format for each tool, and then we have imported these XML results to a MySQL database from where w e have done the analysis on data. What w e saved from each tool in MySQL is a key value table where key is the file name and value the file format. Percentage of conflicts when three tools are considered is 61% and when two tools are considered is 19%. After having output in same format, as noted above there w e have encountered another problem, it is the problem of different notations used by different tools e.g. some tools noted file formats with lower case letters some others with upper case, or DROID noted JPEG file format as JPG while others noted it as JPEG, the same problem was noted with GZIP file format. First w e had to solve these issues before going into merging the data. After having solved issues mentioned above, w e continued with the actual merging. In order to merge outputs there should be defined a rule how the merging is done, first of all w e did a simple merging where w e took only the intersection of results from three tools. This strategy seemed to be not a good one since there were too many conflicts between evaluations of file formats and only a few of them where present in the intersection. I n total there are 358 out of 927 files which are found to have the same file format from each tool and take part in the merging of results. Another way to merge the results can be to consider all files for which at least two tools out of three have resulted the same file format for the given file. We also conducted this kind of merging strategy and ended with less conflicts. This time 749 out of 927 files are found to have the same file format from at least two different tools. Even though the second strategy gives better results than the first one, nobody guarantees that the tools which are found to have the same evaluation are right and the other tool which had different is wrong there can always be the opposite. Therefore, it is needed to define a strategy which maximizes the utility from each tool. K., B.,[3] developed a strategy to merge results from different tools.

2017 IFAC TECIS October 26-28, 2016. Durrës, Albania 258

Lavdërim Shala et al. / IFAC-PapersOnLine 49-29 (2016) 253–258

6.1 A strategy to merge results from different tools. K., B.,[3] used a tool called C3PO (Clever, Crafty, Content Profiling Tool) which serves as a front end for different digital preservation tools, among others also tools used in this paper are supported by C3PO. The strategy of K.,B., [3], beside intersection of results it also considers the cases shown in Fig. 5 by using which they try to take in consideration specialized parts of each tool. By using the rules shown in Fig. 5 on their experiment of 100.000 files they achieved to lower number of conflicts in terms of file format from 15529 (16.41% of total files) to 3838 (3.98% of total files). In total, they lowered the number of conflicts by five times. Rules of merging results, K.,B., [3].

Fig. 5. Summary of experiment results: Discovered file formats (left), conflict ratio (right) 6.2 Merging results with K., B., strategy. I have also used K., B. strategy to merge results from chapter four. As presented in Fig. 5 by considering the intersection of results 358 out of 927 files are found by each tool have the same file format. Then we continued to append this result the results derived by applying of each rule proposed by K., B. [3]., as a result, at the end we have 452 files whose format is identified either by intersection or by rules of K., B. [3]. Therefore, these rules in my experiment conducted towards 927 files have lowered the number of conflicts from 569 (61.38 % of total files) to 475 (51.24 % of total files). In my case it has lowered the percentage of conflicts by 10 %, but as it is presented in previous section in higher scale this strategy gives much better results. In Fig. 5 are shown the final results of my experiment. 7. SUMMARY This paper gives a general evaluation about digital preservation. It informs user what digital preservation is and with what it deals. Furthermore, this paper focuses on file format analysis conducted as a part of digital preservation. It throws a spotlight on file format identification, characterization and verification, which are three analysis that digital preservation conducts towards file formats. First, this paper introduces the reader to the world of file formats by keeping him informed what a file format is and how file formats work. Next, it presents some of the most powerful open source tools which are designed to do file format analysis. The results of each individual tool are presented to the reader. Here w e would like to highlight the fact that beside there are many tools for file format analysis there is no stable tool which covers a high number of formats and works in an acceptable performance in high scale of data.

When working with high scale of data each of the tools w e have used seemed to have performance issues, some of them even crashed after a few time and could not give results. Therefore, we had to limit our experiment to a low number of files (927 files). Furthermore, the paper highlights another problem of file format analysis in digital preservation which is merging of outputs of different tools to have better overall results. Here we have thrown a spotlight on non- standardization of tools, as a result of which merging and combining the results becomes more difficult than it should be. One good thing to be noted here is that all tools are supporting XML output format which makes merging easier. On the other hand, one big issue encountered here is that tools are using different notations for file formats e.g. some tools are found to identify JPEG files format as JPG some others as JPEG, in order for a software to find that they are talking about the same format it should be noted the same, in addition, case sensitivity is not standardized but this issue is solved by using combining tools that are not case sensitive. Also the results from different tools differ a lot as a result, a simple merge of them by finding the intersection is not found to be a good idea. Therefore, this paper present the work of another paper where a clever strategy to merge and combine output of different tools is presented. Finally this paper uses this strategy to merge the outputs of three tools JHOVE, DROID, and Exiftools from tests conducted towards the same data set and at the end the results of this merge by highlighting the impact of merging strategy are presented to user. In summary, this paper by evaluating different tools and strategies of merging their results presents the current situation in file format analysis in digital preservation, and highlights the emerge that new more standardized and more stable tools need to be developed. REFERENCES [1] Abrams, S., at a ll.(2009).“What? So What”:The NextGeneration JHOVE2 Arch. for Format-Aware Charac. Journal IJDC4(3), pp.123-136. Edinburgh, UK. [ 2 ] Ford, K.M. (2011). The Application of File Identification, Validation, and Characterization Tools in Digital Curation. Univ. of Illinois, USA. [3] Kulmukhametov, A., Becke, C. (2014). Content profiling for preservation: Improving scale, depth and quality. The EDL-RP 8839 (1) pp. 1–11., Thailand. [4] Lechich, R. (2014) File format identification and validation tools. (http://www. library.yale.edu/iac/ DPC/FileIDandValidate.pdf) [last accessed 14.06.’14]. [5] Brown, A. (2005) Pronom 4 information model. Technical report, The National Archives, UK. [6] Goethals, A. (2010). The unified digital formats registry. ISQ 22(2) pp. 26–29 [7] Frisch, P., Heino, N., Tramp, S. (2012). Unified digital format registry (udfr). Univ. of California, USA. [8] Abrams, S. (2004). The role of format in digital preservation.VINE 34(2)pp.49–55. Emerald…, USA. [9] Brown, A. (2005). The droid application programming interface. The National Archives, UK. [10]Raymond, R. (2016). Exiftool documentation. http://www.sno.phy.queensu.ca/~phil/exiftool online, [last accessed on 14.06.2016].