Mass image classification

Mass image classification

digital investigation 3 (2006) 190–195 available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/diin Mass image classification5...

423KB Sizes 2 Downloads 83 Views

digital investigation 3 (2006) 190–195

available at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/diin

Mass image classification5 Paul Sanderson Sanderson Forensics Ltd

article info

abstract

Article history:

Investigations involving massive quantities of graphics files that must be reviewed and

Received 14 September 2006

classified require more robust tools to facilitate review and categorisation of massive num-

Revised 23 October 2006

bers of graphic files. Traditional forensic tools are not equipped for such large-scale cate-

Accepted 24 October 2006

gorisation tasks this paper introduces some techniques that can streamline the process. ª 2006 Published by Elsevier Ltd.

Keywords: Mass image classification Hash MD5 Indecent images of children Thumbnails Skin tone analysis

1.

Case overview

As part of the Operation Ore National Investigation a UK Police Force opened an investigation into a suspect named Smith. Smith allegedly used his credit card to enter a site called nudeteens in the USA, and download paedophile images. He paid $19:95 for 90 days access which expired in August 2000. Smith was already known to Customs & Excise. In 1995, he was intercepted at an airport, having flown from Europe, and an examination revealed two video tapes and magazines containing bestiality material. The subsequent house search revealed additional pornographic material. Smith admitted copying and supplying videos to a friend. At that time, a total of approximately 80 videos, 15 magazines, 10 indecent covers, and 4 computer discs were seized, and he paid a £1000 compound penalty.

In December 2003, a search warrant was executed at the Smith’s home address. The premises, a small barn conversion, was awash with computers, hard drives and associated paraphernalia. There were eight computers, 39 separate hard drives, over 400 CD/DVD’s, more than 200 floppies, in excess of 100 Zip disks, and approximately 100 videos. The computer hard drives alone contained 800 GB of digital evidence. It was also clear from the set up that he was producing films of some kind on a fraudulent commercial basis as there was a mass of copying, labelling and boxing up equipment. Sex stories relating to adults and children were also found in the premises. The computers were seized and then viewed at the police station. It became apparent that the police team could not manage the massive numbers and sizes of files. Even a preliminary inspection of one hard drive revealed thousands of child pornographic images ranging from category 1 to category 5 on

5 This case study report was submitted in part fulfilment of the degree of MSc in Forensic Computing and Cyber Crime Investigation at Dublin University. E-mail address: [email protected]

1742-2876/$ – see front matter ª 2006 Published by Elsevier Ltd. doi:10.1016/j.diin.2006.10.010

191

digital investigation 3 (2006) 190–195

the UK Sentence Advisory Panels charging guidelines (discussed below). The police server could not physically hold a job of this magnitude and allow other officers to work on different cases. The 800 GB of data did not even take account of CDs, DVDs and floppies which had not yet been imaged. The contents of these media would have added around another terabyte (1000 GB) of data to the case. Faced with this massive amount of evidence, the Senior Investigation Officer sought the assistance of this author and his associates to perform the following tasks: 1. 2. 3. 4.

Recover indecent images. Establish what distribution offences had taken place. Seek evidence to prove or disprove Smith’s involvement. Establish if he was using peer-to-peer software to download images. 5. Negate any possible defences he may come up with.

2. Methods and tools used in the investigation The initial investigation was attempted using EnCase version 5.0.3 (www.encase.com). The forensic duplicates of all of the hard disk drives were added to the case and processed together, and the gallery view was utilised to review the graphics files present on the disks. It quickly became apparent that there were a number of problems with this approach, this was despite the case residing on a high specification machine with the forensic duplicates on a RAID 4 array attached to a 3 GHz Pentium computer with 2 GB RAM: Firstly, the sheer magnitude of files on Smith’s machine meant that EnCase was very slow to respond to user input. Simply tagging a file by applying a blue check marking via the mouse or via the space bar resulted in EnCase temporarily freezing for approximately 1 s for each file tagged. Given that there were 800,000 JPG files, a delay of this magnitude clearly made this line of attack unrealistic. Utilising one person for 8 h a day tagging 800,000 files at a rate of one per second would have taken 27 days – clearly not a practicable approach to the problem. Secondly, EnCase at this time was not sufficiently stable to allow the large-scale investigation of deleted files on the subject media without EnCase crashing. Although regular saves would minimise the amount of data loss, the time spent reloading the case file was considerable (30þ minutes) on each occasion. In addition, saving the case took between 2 and 3 min. Thirdly, there were obviously a large number of duplicate graphic files on Smith’s computer, utilising EnCase would mean that each of the duplicate graphic files would have to be separately categorised according to the five point Sentence Advisory Panels charging guidelines (Sentencingguidelines) as below, possibly resulting in the same file being placed into more than one category. EnCase does support the use of hash sets but the process is not dynamic, i.e. when manually categorising files, categorising one file does not result in all files with the same hash being so categorised.

Level

Description

1

Images depicting nudity or erotic posing, with no sexual activity

2

Sexual activity between children, or solo masturbation by a child

3

Non-penetrative sexual activity between adult(s) and child(ren)

4

Penetrative sexual activity between child(ren) and adult(s)

5

Sadism or bestiality

An alternative approach needed to be found that would allow the large scale categorisation of files, initially by means of known hash values, followed by a manual process on those not categorised by hash. Experience has shown that only a few tenths of percent of files could be categorised using known hash values, and extensive manual categorisation is generally required. A brief review of existing tools was carried out. Only one seemed as if it was suitable, PicaPro from Geek Ltd (http:// www.geek.ltd.uk/). A demonstration copy of the PicaPro program was obtained but it failed to run on either of two test machines. Third party feedback also indicated that Pica, at this time, had a number of limitations when it came to reviewing previously categorised files. Therefore, PicaPro was discarded and it was decided to write an in-house utility. Developing an in-house tool for a specific case also opens the door for further development of a commercial tool.

2.1.

Design and implementation

The following design goals were decided upon:  Must provide thumbnail views of multiple files at one time.  Must be able to categorise all displayed thumbnails with one mouse/keystroke.  Files must also be viewable at full screen size.  Any duplicate file would be categorised at the same time that the first occurrence of the file was categorised.  Any results of this process would need to be re-imported back into EnCase.  We would need to restrict, or filter, the uncategorised files based on various criteria.  We would need to be able to easily review previously categorised files and re categorise as necessary.  The officer/agent categorising each file would need to be logged.  The time that each file was categorised would need to be recorded.  Third party hash sets would be implemented to reduce the size of the uncategorised material. For various reasons the program/project was given the internal code name Moggy. Given the above requirements it was decided to write a bespoke graphics viewing package built upon a third party graphics library in conjunction with a MySQL (http://www. mysql.com/) database to record various file properties.

192

digital investigation 3 (2006) 190–195

There is no facility to interface with EnCase so the graphics filter built into EnCase was used to export the graphics files from each of the hard disk drives in the case to an appropriate folder structure that reflected their original locations on Smith’s hard disks. The export was limited to the major Internet graphics formats, i.e. jpg, gif, tiff, png, bmp. AOL ART files were not supported by the viewing engine chosen (although there were no AOL ART files on Smith’s computer). Once the files were extracted a routine was utilised that was part of the Moggy program that recursed the directory structure and for each file recorded the following information:     

An MD5 hash for each file. The size of the file. The dimensions in pixels for each file. The number of unique colours in the file. The full file path (on the local disk) for each file.

or scanned (for older images) and the colour depth would normally be many hundreds if not thousands of unique colours. 2. Small files (small as in byte size) almost by definition cannot contain much unique data and therefore were rare indecent images of children. 3. Files of unusual dimensions were usually not images, i.e. a picture of a person or object tends to have the general dimensions of a photograph. An image that was 100 pixels wide by 500 high is unlikely to be an IIoC, however, a picture that is 10 pixels wide by 100 high would almost certainly not be an IIoC.

Based on previous investigations rather than by any specific research it had been noted that the following generally held true:

Some examples of files with unusual dimensions and examples of files with different colour depth are displayed in Fig. 1. It should be emphasized that the rules referred to above are not designed to detect indecent images of children, or to exclude them, but merely to identify the images and present the categorising officer with a set of images that are unlikely to be indecent images of children. Given the above the following first stage filter processes were designed:

1. Files with a small colour depth were rarely indecent images of children. Indecent images of children would normally have been taken with a digital camera (for modern images)

1. Utilise known hash databases obtained from a UK police ftp server to identify those images that were in the database that were known to be IIoC.

2.2.

Data reduction and file categorisation

Fig. 1 – Examples of files with different colour depth.

digital investigation 3 (2006) 190–195

2. Utilise known hash databases such as NSRL hashset to exclude those images that were known not to be IIoC. 3. Manually review all images with less than 200 unique colours. 4. Manually review all images that were less than 1.5 KB in size. 5. Manually review all images where one side of the image was greater than four times the length of the other side – i.e. review all images of unusual proportions. Note that in the above description, except for those files identified by hash, each of the reviews is performed manually. Moggy was used to initially identify the files but the final decision as to whether to ‘tag’ a file is non-IIoC was made by the operator. The process starts with each file being assigned a category or uncategorised. Moggy by default will only display uncategorised files. The examiner chooses which filter to apply (say files with less than 200 colours) and which category he ‘expects’ the files to fall into (for files under 200 colours we would expect that the content would not be IIoC so a category of non-pornographic would be entered. Moggy then displays a page of thumbnail images of the first uncategorised files that have less than 200 colours. These would initially have a category assigned as ‘nonpornographic’; this category would be selected and displayed in the drop down ‘combo’ box below each picture, as shown in Fig. 2. The examiner could see at a glance if all of the images fell into the non-pornographic category and could accept this classification for all files by clicking the ‘categorise’

193

button at the top of the page. However, if one or more files were found to be IIoC then they could individually be recategorised by means of the drop down combo box as appropriate. The ‘categorise’ button would then be utilised as previously. In practice there were very few images that deserved more than a cursory glance to determine whether they were IIoC and it was found that by utilising these methods approximately 30,000 files could be categorised in an hour, i.e. 12 pages of thumbnails every minute. There were approximately 50,000 files that were quickly categorised as non-pornographic in this manner (a combination of images with <200 colours, smaller than 2 K and unusual dimensions) reducing the total amount of files remaining by about 6%. It was further noted during the processing of the files that the defendant had categorised many of the files already, i.e. there was one folder structure with the name ‘Aria’ that had a content that was totally restricted to one female adult model. A change was made to Moggy to allow just the files in and below a specified folder. This change turned out to be one of the most significant in streamlining the categorisation process as in general graphics files in the same folder had the same or similar themes. This allowed the use of Moggys ability to categorise large numbers of thumbnails quickly. Just processing the ‘Aria’ folder allowed the categorisation of approximately 38,000 files, including duplicates found on some of the other disks in the case. Of course the process of exclusion could only be taken so far and eventually the throughput dropped to a much slower pace. Depending upon the type of material seen with the

Fig. 2 – Thumbnail display of images for efficient review, and drop down menu for categorisation.

194

digital investigation 3 (2006) 190–195

occasional ‘run’ of IIoC images, or the occasional single image seen on a page of thumbnails with adult images, the categorisation rate dropped to about 5–10,000 images an hour. In conjunction with the police force instructing us, and attempting to keep the costs under control, we decided that it would be more cost effective for police officers to continue the categorisation process and the input of this author and his associates became purely one of support.

3.

Technical problems and lessons learned

3.1.

AOL art files

ART files were potentially a problem. The graphics engine used is extremely powerful but does not, as yet, support AOL ART files. This is a feature that has been indicated could be available soon.

3.2.

Speed of MySQL

MySQL was initially chosen as it is a mainstream high performance database engine. In practice though it was found to be too slow. Consider a case where there are 800,000 images and therefore 800,000 entries in the database. In order for the database to function at any reasonable speed there were indexes created on:     

Hash; Colours; Size; Dimensions; Category.

When a file is categorised then an appropriate SQL (pseudo) statement must run such as: Select all files from job-smith where hash ¼ ‘‘12345etc’’ set the category to 2. The affect of this statement is to identify each file with the specified hash, update the category of each of these files, and update the category index for each file. In practice this took approximately 0.25 s to process. This in itself is slow but when categorising a page of, say, 24 thumbnails a 6 s delay between each page almost defeats the object of mass categorisation. The ideal solution for this would have been to create a customized database system that was optimised just for this task. Whilst this sounds complex in reality it would just be a flat file database with a few indexes using simple hashing techniques. However, the timescales involved prevented the coding and testing of such a solution and a practical alternative was sought. The solution we chose was to create a second MySQL database that was purely a repository for MySQL commands. Rather than issue a SQL statement directly to MySQL, Moggy simply pushed the plain text command onto a job queue and then immediately displayed the next image or page of thumbnails. A separate program was written to service the job queue, simply reading SQL statements in

order from the queue and submitting them to the MySQL database. This technique had a number of benefits. First, it was failsafe and if for some reason Moggy crashed on a corrupt file (unusual as the graphics library utilised was very stable) then the commands were still queued and would continue to be processed by the job engine. Second, if the job engine failed then the job queue would just grow in length until the job engine was restarted. There was one unforeseen shortcoming to the solution. If Moggy failed and was restarted, then Moggy would first display any uncategorised files. If there were un-serviced commands still in the job queue then Moggy would effectively display previously categorised images a second time.

3.3.

Deleted files

The graphics engine utilised was very good at displaying partial images. Approximately 20–30% of the images on Smith’s computers were deleted (in that the allocation for the files had been lost but the files names, sizes and starting clusters were still present in the directories). Needless to say a good portion of these images were almost completely intact and also a good number were so corrupt that there was little visual content. A filter that could determine the extent of the file that was displayable was considered but not until too much of the job had been undertaken to make the development of the filter worth while.

3.4.

Skin tone analysis

After the above filtering processes had been carried out there were still a massive amount of files left to be categorised. We would estimate that approximately 5–10% of these files were not images of people or were images of dressed individuals. An additional filter which would calculate the amount of skin tone in particular image was considered, but the development time for such a program was considered too excessive given the constraints on the job. This filter would have been used as previously, i.e. to present a page of thumbnails to the categorising officer that were unlikely to contain naked bodies, the categorising officer would then decide if this were the case and categorise all or some appropriately.

3.4.1.

Adult and bestiality images

Moggy also contained an option to categorise images that were not illegal, in the UK, but could give relevant statistics to prosecution and defence as well as being useful in categorising future jobs; the intention being to build a database of hashes for future work. Whilst in-house staff was quite rigorous in their approach to classifying the images the police officers subsequently involved were not so thorough. At the time of the miscategorisation of the first of these files Moggy was not logging the categorising person and so it was not possible to utilise the fraction of the database that contained these hashes on future work.

digital investigation 3 (2006) 190–195

3.4.2.

Hash weighting

A number of the hashes obtained from third party sources for indecent images of children were found to have been miscategorised. However, the majority of the hashes and categorisations were, in our opinion, correct. It therefore followed that each file that was categorised by hash had to be re-evaluated to see if the categorisation was correct. A system such as the following could be utilised that gave a reliability score to each hash:  Each hash (and correspondingly the image file and category) is weighted when it is first added to the database as unreliable.  As the file is resubmitted by a different force or a different categorising officer the weight would be adjusted upwards if the categorisation is the same or downwards if different.  The process would be repeated every time the that a hash was submitted. It would then be trivial to create a filter to allow a reviewing officer to just review those files whose hash weighting was below a given threshold.

4.

195

Summary

Given the tools, resources and budget for this case it was clear that the only sensible route forward was a bespoke development program. Moggy was developed on time and on budget for the case. The design goals and actual functionality were more than fit for purpose. The intention to further market Moggy, however, was over taken by events. Trevor Fairchild of the Ontario Provincial Police had developed C4P (Categoriser for Pictures) (E-crime) which was now being released free of charge to police forces and investigators worldwide. The decision not to compete with a free software package was taken and development of Moggy has now stopped.

references

. .