Information Processing and Management 43 (2007) 740–751 www.elsevier.com/locate/infoproman
Information redundancy across metadata collections Muriel Foulonneau
*
University of Illinois at Urbana-Champaign, Grainger Library, 1301 West Springfield Avenue, Urbana, IL 61801, United States Received 3 March 2006; received in revised form 23 June 2006; accepted 28 June 2006 Available online 28 August 2006
Abstract Metadata records made available by content providers often lack the implicit information of their original use environment. Metadata aggregators therefore tend to emphasize completeness as a primary quality for shareable metadata. However, when adding implicit information to item-level records, data providers increase the redundancy of information contained in records from the same collection. The present paper reports on an effort to assess the extent and potential impact of information redundancy in metadata collections aggregated using the Open Archives Protocol for Metadata Harvesting. The first experiment quantifies the resemblance of metadata records on a collection-by-collection basis across 176 metadata collections aggregated for the CIC metadata portal. A second experiment measures the tendency of items from the same collection to appear together in results lists generated for a set of user queries. Results of the analyses correlate and suggest that within some collections item-level metadata records are not sufficiently differentiated to support certain digital library functions well. Metadata collections have a distinct role when included in larger aggregations, and in that role a minimum level of descriptive granularity is required to support digital library functions implemented by service providers. The experiments suggest possible ways to deal simultaneously with metadata record completeness, consistency, and redundancy. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Metadata; Collections; Open Archives Initiative Protocol for Metadata Harvesting; Similarity; Digital libraries
1. Introduction – information redundancy, collection identity and granularity issues in a metadata aggregation Digital library systems based on metadata harvesting aggregate heterogeneous collections of metadata records created by various actors. The coherent integration of metadata records is a major challenge for such systems. Shareable metadata guidelines (see for example DLF/NSDL, 2005) tend to emphasize completeness of information, contextualization (making explicit information left implicit in local usage), and the use of standard terminologies. Data providers are encouraged to make explicit (a best practice) the resource context (e.g., its collection title). Such contextual information is often identical for all the records that are members of a *
Tel.: +1 217 244 7809; fax: +1 217 244 7764. E-mail address:
[email protected]
0306-4573/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2006.06.004
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
741
given collection. Therefore, adding contextual information, while improving completeness, also tends to increase the rate of non-unique information in the item-level records of a collection. A service provider (aggregator) typically will not use all metadata properties for all purposes. Items in a list of results usually include only a sub-set of properties from each metadata record. For example, Title, Creator/ Author and Description only may be provided to the user Selecting records from a list of results (IFLA, 1998). Therefore, information in common across multiple metadata records within a collection may impact in different ways on the various digital library functions implemented by a service provider. For several functions, the sub-record used may be duplicative (‘‘approximately duplicate records’’) (Monge & Elkan, 1997). Approaches to metadata record comparisons mostly aim to identify duplicate records as indicators of ‘‘resource equivalence’’ in the overall aggregation (Harrison, Elango, & Bollen, 2004; Khan, Maly, & Zubair, 2005; Lagoze et al., 2006; Nahm, Bilenko, & Mooney, 2002). However, information redundancy in metadata records does not automatically imply duplicate resources. Two resources may be accessed through the same URL but still be distinct. Conversely, two different URLs may point to the same resource. Generally, information redundancy across records from the same collection cannot be considered as a definitive indicator of duplication of resources. Redundancy may only be an indicator of completeness of records, of consistency in cataloging practices and/or of a low level of item-level record customization in a collection (limited information included to differentiate one item from another). The CIC metadata portal was confronted with evident cases of near-duplicate records inside some collections of harvested metadata. The issue, however, was not that the same resource appeared in multiple collections but that the metadata records contained in some collections did not evidence sufficient differentiation one from another. Implementing an aggressive de-duplication algorithm would have meant removing access to valuable resources indexed in the CIC metadata portal. Instead, it was decided to demonstrate to data providers the actual impact of exposing near-duplicate records, then let them assess whether the representation of their resources in the aggregation was optimal. With this information, data providers also will be better able to judge the usability of each individual collection for particular functions or tasks performed on the aggregation. The collections described in the present paper are sets of items defined by data providers. Franconi (2000) suggests that to adequately represent knowledge in digital library applications, collections should be considered not only as sets of items but as aggregate objects (collections represented by inherent vs contextual properties in Hill, Jane´e, Dolin, Frew, & Larsgaard, 1999). Item-based digital library applications are gathering sets of items without consideration for the original aggregate objects created by content providers. However, metadata records similarity can re-create the original sets of items in the applications because they answer with a similar pattern to the solicitations of the digital library applications. They create ‘‘natural’’ clusters inside the aggregation. Information redundancy inside collections does raise the question of which descriptive granularity (the aggregate object or its components) is most adapted to specific user needs. If all records of a collection are about Illinois and a user requests resources about Illinois, the system can display all the records of that collection. Alternatively, it is possible to display a collection-level description rather than item-level descriptions of all records of the collection. The present paper reports on the results of a series of quantitative studies to assess information redundancy across item-level records belonging to individual collections in the CIC metadata aggregation. The first experiment quantifies the extent of metadata record similarity in the CIC metadata collections. The second experiment provides an indication of one potential impact of similarity on retrieval. Both analyses apply common methodologies used to measure similarity and assess collections in information retrieval systems. The experiments presented are designed to help in understanding impact of the structure of metadata collections in terms of usability for a digital library aggregation. 2. The CIC metadata collections The CIC metadata portal aggregates more than 500,000 metadata records from 11 Universities, using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH [http://www.openarchives.org/]). Represented are 187 collections, containing between 1 and 85,000 items each.
742
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
Each item in the aggregation belongs to one and only collection. Aggregation structure is based on collections defined by the way in which data providers chose to share their metadata using OAI-PMH, i.e., according to how they organized their OAI repositories and sets. Collections defined by data providers are assumed to reflect a certain homogeneity as to the content type, the topicality and/or the provenance of items within each collection (see the discussion in Powell, Heaney, & Dempsey (2000) about collection identity). They are also assumed to have been developed in a coherent context and therefore reflect consistent metadata creation practices. Sometimes, they are the result of a digitization project and the same person has created all the corresponding metadata. Sometimes the metadata had been created by a cataloger when the resource was only analog. Sometimes, the collection is based on contributors creating metadata themselves using a common data entry interface that constrains metadata record structure and data values (e.g. as for institutional repositories). One hundred and seventy six collections, representing 533,378 records were considered for the experiments reported in the present paper. Collections with less than 10 records and collections usually harvested in MODS format [http://www.loc.gov/standards/mods/] were excluded. All the experiments were run against as-harvested Simple Dublin Core metadata records, before any reprocessing. Harvested Qualified Dublin Core records were dumbed down to simple Dublin Core by transforming all qualified refinements into their associated simple Dublin Core property. 3. Similarity of metadata records Redundancy in metadata records can be analyzed by identifying exactly similar metadata property values (e.g., Stvilia, Gasser, Twidale, Shreeves, & Cole, 2004). However, it may also be desirable to identify ‘‘nearly duplicative’’ records as well. Similar metadata records may be effectively redundant for many purposes (e.g., retrieval). If records were created for every single page of a digitized book, then the difference between two metadata records from this collection might be only a URL (Identifier field) and a page number (Description field in this case). The metadata properties used are identical, the length of metadata values is similar and metadata values are different. The Description property value only differs in a single character (the page number). Our hypothesis was that the presence of such similar records could have an impact on retrieval and selection of records in the context of a digital library system built on top of a large aggregation. 3.1. Methodology Strategies to compare the similarity of two metadata records for deduplication are usually based on iterative similarity comparisons of element values, i.e., the strings contained in each metadata field (see for example Khan et al., 2005). Harrison et al. (2004) uses alternative metrics (vector space representation)_which measure the direct impact on the performance of full-text retrieval algorithms. There are different approaches available to measure string similarity (Chapman, 2006). For the present assessment, the algorithm selected needed to allow for comparing metadata practices applied to related item-level records rather than for calculating an average similarity between Description fields over the whole database. A metadata record Resemblance measure, based on a q-grams approach (Broder, 1997; Broder, 2000), was calculated pairwise for sample records taken from each metadata collection. The Resemblance was defined using 4-term shingles (four shingling). In this model a shingle is a set of four consecutive terms (punctuation excluded). All possible unique shingles are defined for each metadata record (S1). Resemblance ðR1; R2Þ ¼
S1 [ S2 S1 [ S2
Table 1 Data recorded to measure records similarity inside CIC collections Record 1 identifier R1
Record 2 identifier R2
# Shingles record 1 S1
# Shingles record 2 S2
# Similar shingles S1[S2
7404
7403
209
209
196
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
743
The measure results in a value between 0 and 1. For example, the following data (Table 1) were recorded for comparisons of record A and record B. For purposes of this analysis, the homogeneity of a collection was defined as the average Resemblance (mean) found as the result of pairwise comparisons performed over a representative sample of all records in the collection. A similar analysis was made using a cost edit distance methodology (Levenshtein, described in Chapman, 2006). Though not described here, this analysis provided equivalent results for the collections analyzed. In order to assess the similarity of metadata records as a whole, property assignments were ignored. (An analysis keyed to similarity measures relative to a specific function might analyze properties more selectively.) All original property values from each metadata record were concatenated into a database field of a new database table (Metablob). Metablob contains the recordID and the concatenation of all its metadata property values. When concatenating, metadata elements were sorted by property then value. As a result, a similarity algorithm sensitive to terms ordering could be applied usefully. One Metablob table record exists for each original metadata record. A Colls table associates every record to a collection (each record a member of one single collection). Then a program was written to randomly select up to 100 records in each collection and compare each of those records to all others in the sample (pairwise comparison). This represents a maximum of 1002 (10,000) comparisons for each collection having 100 records and above. 3.2. Resemblance of metadata records in CIC collections The Thesis and Dissertations of Ohio State University (etd.ohiolink.edu.osu) collection, having very different item-level records, has a Resemblance of 0.04, while the much more homogeneous Borobudur collection of artifacts from the Borobodur temple in Indonesia (lib.umich.edu.borobudurbib) has a Resemblance of 0.82. In the Knight’s American Mechanical Dictionary collection (lib.umich.edu.kdimgbib), the difference between both records presented in Fig. 1 is a URL and a number as highlighted, although several terms vary in other records. Its Resemblance is 0.8. At the opposite end of the spectrum, Fig. 2 presents two dissimilar records from the Ohio State institutional repository: the Knowledge Bank. Fig. 3 presents the Resemblance values for 176 collections. The results show the prevalence of collections with a low Resemblance between records. However, a small number of collections have a significant Resemblance measure (15 collections above 0.7). Choosing a Resemblance threshold on which to build a filter for ‘‘near-duplicate’’ records is not the objective of the present paper and depends on other parameters, for example the actual properties used for certain functions. However, a collection with 0.97 Resemblance (such as lib.umich.edu.brutbib) would be likely to be considered as a collection of ‘‘nearly-duplicate’’ records in any system implementing a deduplication
Concatenation of properties for two records of the lib.umich.edu.kdimgbib collection http://name.umdl.umich.edu/IC-KDIMG-X-BBN7481-UND0002-UND-001-UND00000470]BBN7481_0002_001_00000470 Knight's American Mechanical Dictionary jpeg These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically. Please go to http://www.umdl.umich.edu/ for more information. Knight's American mechanical dictionary image Industrial arts, Dictionaries, Mechanical engineering, Technology, Inventions 2; Knight's American mechanical dictionary : being a description of; tools, instruments, machines, processes, and engineering : history of; inventions : general technological vocabulary : and digest of mechanical; appliances in science and the arts; 3 v., 75 leaves of plates; Illustrated with upwards of five thousand engravings. UND Knight, Edward Henry
http://name.umdl.umich.edu/IC-KDIMG-X-BBN7481-UND0003-UND-001-UND00000035]BBN7481_0003_001_00000035 Knight's American Mechanical Dictionary jpeg These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically. Please go to http://www.umdl.umich.edu/ for more information. Knight's American mechanical dictionary image Industrial arts, Dictionaries, Mechanical engineering, Technology, Inventions 3; Knight's American mechanical dictionary : being a description of; tools, instruments, machines, processes, and engineering : history of; inventions : general technological vocabulary : and digest of mechanical; appliances in science and the arts; 3 v., 75 leaves of plates; Illustrated with upwards of five thousand engravings. UND Knight, Edward Henry
Fig. 1. Comparison between two records of the lib.umich.edu.kdimgbib collection.
744
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
Fig. 2. Metadata records from the Ohio State University Knowledge Bank.
80
Number of collections
70 60 50 40 30 20 10 0 0-<0.1 0.1 <0.2 0.2 <0.3 0.3 <0.4 0.4 <0.5 0.5 <0.6 0.6 <0.7 0.7 <0.8 0.8 <0.9 0.9<=1 Resemblance
Fig. 3. Skewed distribution of resemblance measure in CIC collections.
algorithm. Indeed, the very little difference between item-level records from certain collections may show that the granularity of descriptions (item level) may be too fine for efficient usage of the metadata records in the
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
745
context of the CIC metadata aggregation. A single record or a collection-level description may be sufficient to represent the collection in the CIC aggregation. 3.3. Similarity of sub-records Most functions implemented in the CIC metadata portal do not use complete metadata records. It is therefore important to compare the similarity not only of entire records but also of sub-records which may be used for example in the display of search results to enable users to Select resources. The Title, Subject, Description and Creator properties were chosen to evaluate the potential differentiation among sub-records since these properties are most often used for Finding and Selecting resources from a list of results. They are also properties typically most different (large entropy) in aggregated metadata collections as demonstrated in Stvilia et al. (2004). For sub-record analysis, values of these four properties were concatenated into a database field of a new database table (MetablobSub), similar to the Metablob table described above, As before when concatenating, properties were sorted by property then value. The similarity of the sub-records was then evaluated using the same methodology as described in Section 3.1 above. It was intended to assess whether the selection of a sub-record for display purpose in a list of results or for searching decreased the resemblance of the various CIC collections. Intuitively use of the sub-record was expected to decrease the calculated Resemblance of most collections. The comparison for 176 collections (Fig. 4) shows that for most collections sub-records analyzed were noticeably less similar than full records within the same collection. In a minority of cases (11 collections), Resemblance of the collection was higher (differentiation decreased) for sub-records as compared to full records from the same collection. In one other case, it remained the same. The collections having a higher Resemblance for the sub-records than for the full records have a high full record Resemblance (between 0.78 and 0.97). It is however, noticeable that records from the scout.wisc.edu.scout collection have a very low Resemblance so that the sub-record is only slightly less similar (Resemblance of 0.01 for the full record and 0 for the sub-record). Since most elements are different, the comparison to the sub-record is less significant. Fig. 4 shows that even collections with very similar records can have sub-records with a low Resemblance (below 0.2). Other collections with non duplicate records have fully duplicate sub-records. The lib.umich.edu.kdimgbib collection has a Resemblance of 0.8 for the full record but 1.0 for the sub-record. The above analyses of metadata record similarities demonstrate that a significant minority of collections include records with extremely similar content. This however does not tell the full story as to the actual impact the presence of similar records can have on a digital library system. For instance, the true impact of information redundancy on retrieval is the result of an interaction between data and user queries in the retrieval framework designed in a specific digital library system (e.g. the retrieval algorithm). It is necessary to assess whether despite high Resemblance, records still bear enough differentiated information to allow users to efficiently retrieve individual items in the collection.
Resemblance of collection for sub-records and full records 1.2
Resemblance
1 0.8 0.6 0.4 0.2 0 Collections Resemblance sub-record
Resemblance full record
Fig. 4. Comparison of Resemblance indicators between records of the same collections (sub-records and full records).
746
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
4. Impact of information redundancy on search and discovery Ideally, in order to optimize the task of Finding records (IFLA, 1998), a system must guarantee on the one hand that the records of a given collection are all retrieved if they all match the query (e.g. if the query is for pictures, and all items in a collection are pictures). On the other hand, for a more discriminating query, itemlevel descriptions should be sufficiently differentiated to return only the one or two items from a collection relevant to the query rather than the whole collection if the collection as a whole is less relevant to the user’s information need. In order to optimize the performance of the Select task, the normalization of several fields (e.g. the Type) might allow using symbols to succinctly label resources discovered (e.g. a speaker icon for an audio recording) (Foulonneau & Cole, 2005). However, a list of results should also display information that differentiates one resource from another. An effective sub-record for Selection should include both types of information (i.e., normalized and differentiated). In order to assess the impact of information redundancy on search and discovery, a test was conducted on the CIC aggregation with real user queries. 4.1. Methodology of the collection discovery experiment User queries from OAIster service (http://www.oaister.org/) transaction logs were launched against the CIC metadata aggregation. OAIster is a well-known generalist OAI-based system. Most resources included in the CIC aggregation are also indexed in OAIster. A sample of 10,473 logs gathered over 13 months was analyzed. From those logs, 5192 unique queries (not considering stop words and redundant terms) were extracted. For this experiment query field selected by the user was ignored. Whichever the targeted field(s) in the original user query (whole record, Title only, Subject only or Author only), the query terms were launched against complete records (Metablob table) of the CIC aggregation. A full text index was created for the Metablob table. The retrieval algorithm used was the standard full text search algorithm integrated within Microsoft SQLserver 2000. For each query which generated one or more item-level matches (4612 of the 5192 unique queries), the analysis application recorded a query identifier, the record Identifier and the Type of Match (full or partial). 4.2. Tendency of records from the same collection to appear together in a list of results In order to assess the impact of the various indicators of information redundancy, the experiment allowed identifying collections which consistently pull up together in lists of results. Fig. 5 shows for each collection the percentage of records from the collection returned for each query having at least one record match in the collection. The vertical bars (one per collection) represent the total (100%) of matching queries for each of the collections. Each vertical bar is subdivided (indicated by color change) to show the proportion of matching queries which retrieve: less than 10% of the items in the collection (bottom of
% queries with matches
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Collections % queries with 0-10% matches
% queries with 10-99% matches
% queries with =100% matches
Fig. 5. Collection discovery experiment – percentage of matching queries and percentage of corresponding matching records (full matches only) for each collection.
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
747
Affinity of collection records for retrieval 120
Number of collections
100
80
60
40
20
0 0<0.1
0.1 <0.2
0.2 <0.3
0.3 <0.4
0.4 <0.5
0.5 <0.6
0.6 <0.7
0.7 <0.8
0.8 <0.9
0.9<=1
Affinity measure
Fig. 6. Distribution of collections records affinity for retrieval.
bar, darkest color); between 10% and 99% of the records in the collection (middle section, intermediate color); and 100% of records in the collection. (top, lightest color section of each bar). It is clear that for five collections, queries always retrieved either the whole collection (100% of items) or none of the items of the collection (clear bars on the right hand side of Fig. 5). For other collections, the dark part (less than 10% of items in collection retrieved together) clearly dominates. A few collections have significant intermediary results (10–99% of item level records matching). According to the sample of queries tested, a small number of collections contain mostly or wholly ‘‘redundant’’ item records for the purposes of search/ retrieval. For 166 collections out of 176, at least one query matched all item-level records in the collection. For a typical collection, 80% of the queries matching at least one item match less than 10% of items. In order to quantify the tendency of the items of a collection to pop up together in a list of results, an indicator was created, which was named ‘‘Affinity for retrieval’’ (Af). This was calculated for each collection (C) as the sum of the proportions of items (i) matching each query in its entirety (Mi) divided by the number of queries for which the collection had at least one item with a full match (q). The Af indicator is always between 0 and 1. P ðM i =iÞ AfðCÞ ¼ q Fig. 6, depicting the distribution of Af values for the collections analyzed, shows again that a small number of collections are redundant for retrieval purpose as measured with the sample queries considered (Af close to 1). Moreover, it shows a substantial difference between a vast majority of collections which have a very low Af indicator and those few collections having extremely high values of Af (i.e., near 1.0). The median of Af is 0.08. 5. Correlation between information redundancy and retrieval The measure of the impact of information redundancy on search and discovery (Af) was compared to the indicators of similarity based on similarity analyses of the records themselves. Table 2 gives Af and Resemblance values for a subset of collections representing the full range of Resemblance values observed, from the very similar records of lib.umich.edu.brutbib to the very dissimilar ones from scout.wisc.edu.scout. Three collections were extracted from the collections having highest Resemblance values, three other from the most dissimilar ones, finally four were added from the other collections to represent collections in between the extremes. The first and second columns represent the percentage of the queries having at least one item match in a collection which retrieved respectively 0–10% of the collection items and 100% of the collection items. Table 2 illustrates that CIC collections include extreme cases of both redundancy and diversity of records.
748
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
Table 2 Indicators of similarity for a sample of collections % Queries with 0–10% items retrieved
% Queries with = 100% items retrieved
Affinity for retrieval
Resemblance of full record
lib.umich.edu.brutbib digital.lib.umn.edu.gpgovman lib.umich.edu.borobudurbib lib.umich.edu.ppotpusbib lib.umich.edu.cjsbib digital.lib.umn.edu.shm lib.umich.edu.postidbib etd.ohiolink.edu.osu iubio.bio.indiana.edu.biosoft.OAI2.iubio scout.wisc.edu.scout
0 0 0 0.3 0.75 0.85 0.88 0.99 0.98 0.96
1 1 1 0.48 0.09 0.06 0.1 0 0 0
1 1 1 0.66 0.46 0.1 0.12 0.01 0.01 0.02
0.97 0.86 0.82 0.51 0.31 0.26 0.19 0.04 0.03 0.01
Average (mean) in the CIC repository
0.82
0.13
0.16
0.39
Table 3 Correlation (Pearson) between Relevance and Affinity indicators for the CIC collections
Resemblance full record Resemblance sub-record
Affinity for retrieval
% Queries with 0–10% matches
% Queries with = 100% matches
0.6 0.73
0.55 0.68
0.64 0.77
The product–moment correlation (PEARSON coefficient) was calculated between the various indicators Table 3 for all CIC collections. The Af metric for the CIC collections for the purposes of search and discovery clearly correlates to string similarity measures for those same collections, but the strength of the correlation is only moderate. The Scout collection at the University of Wisconsin clearly includes very different information inside its metadata records. As a consequence, when a record matches a query, the rest of the collection usually does not match the same query. On the other hand, five collections always appear en masse in results lists. For those five collections, the similarity measure showed a very low difference between metadata records. The Af measure for these collections was less evenly distributed across the range then the Resemblance measure (skewness of the distribution). Apart from the measure of similarity between complete metadata records, additional factors may impact the relative performance of item-level records in facilitating the discovery of a collection as a whole. As an example, the properties which contributed to the match may or may not be similar according to the collections considered. The higher correlation between Af values and sub-record Resemblance values suggests a possible extension of this statistical analysis based on the potential of various metadata properties to contribute to the discovery and differentiation of records. While this experiment has inherent limitations due to the methodology adopted for assessing impact on retrieval (full matches on any property, set of queries used), it clearly suggests major differences in behavior of metadata collections when included in a large metadata aggregation. Differences manifest in performance in retrieval (finding all the records all the time together), ranking (if the matches are the same and the metadata all nearly the same, then a ranking algorithm based on metadata content and features will be ineffectual) and selecting (a user may have little information to influence him or her to pick one record rather than another from the same collection). 6. Strategies to handle information redundancy in metadata aggregations When aggregating metadata records, service providers themselves often normalize certain properties in order to represent data in a consistent manner. This can increase the relative similarity of records from the same collection. Records completeness, consistency of cataloging practices and differentiation of resources in the same metadata collections have the potential to be contradictory requirements.
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
749
When creating metadata records, catalogers may add the same information to a portion or all of the records of a collection. Several fields tend to be more constant than others across collections (e.g., when using controlled terminologies). Information redundancy also depends on the collection development policy and other influences leading to association of items with a particular collection. Common features of records across collections reflect metadata practices and collection definition practices. These can help determine the identity of the collection within the aggregation. User queries may match a property for which collection common values are: Essential: the raison d’eˆtre of the collection (e.g., if the collection was built as a collection of pictures by Charles W. Cushman, all items may naturally have the same Creator and the same Type properties). Incidental common features of the collection (the pictures are all published in JPG format, however, if a new picture is added in PNG format, it can still be part of the collection). Marginal in the collection, only applying to a subset of the items or even a single item in the collection. If two items in the collection represent a dog, a collection level representation will not allow discovering the collection. All item-level descriptions are necessary to discover that the collection includes a photo of a dog for instance. Some users performing queries on topicality might be best served with a full collection, rather than with 1000 item level records. A query on ‘‘Abraham Lincoln’’ matches all the item level records of the University of Michigan collection named ‘‘The Collected Works of Abraham Lincoln’’. In this case a collection level description would be sufficient. As stated above, item ranking may be arbitrary if there are insufficient differentiating features available to the search engine (on the other hand, ranking algorithms based on the popularity of a resource could in certain cases be successful at differentiating items having highly similar metadata descriptions). In other cases, a collection may contain only several items relevant to the request. A single poster representing Abraham Lincoln is part of the ‘‘Social Welfare History Archives (Hygiene Posters)’’ collection of the University of Minnesota. Then, the collection as a whole is not relevant. In case a large number of items in a collection match a query, the application may avoid displaying 30 pages of results containing similar snippets. A number of strategies can be considered to contain the negative effect of information redundancy in metadata records. Adding a downgraded version of the resource, such as an extract of the full text or a thumbnail may be another way to help differentiate records in a list of results (Foulonneau et al., 2006). Contextual information can also remain in collection-level records, limiting the amount of redundant information in the item-level records, while still being available and useful for discovery of both item-level and collection-level resources. In order to identify whether a collection rather than any specific item is relevant, it may be possible to either find matches in a collection-level description or find the percentage of item-level matches in the collection (Foulonneau, Cole, Habing, & Shreeves, 2005). Using collection-level descriptions only or as the primary entry point (as in the IMLS digital collections registry [http://imlsdcc.grainger.uiuc.edu/collections/]) does not allow discovering, representing or retrieving items on properties which are marginal in the collection. In such circumstances, enriching collection-level access with constituent item-level descriptions can be a useful way to protect against overlooking relevant resources. Digital library systems built on top of large metadata aggregations must consider the nature and identity of the metadata collections being aggregated. If item-level descriptions contain a large proportion of essential and incidental features common to all items in the collection, then the difference between item-level descriptions may not allow adequate differentiation between items. On the contrary, if item-level descriptions contain no common attributes (the collections with very low Resemblance), then the content of those items may not be adequately represented by a collection-level description. Multiple levels of granularity must be dealt with to support various digital library functions. The CIC aggregation is based on item-level record sharing. Other types of aggregations, such as the IMLS digital collections will appear in the next few years, allowing for work with more complete collection descriptions and collections constituted and shared according to different criteria. They will allow a better interaction between different levels of granularity in digital library systems.
750
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
7. Conclusions Metadata usability assessment was performed by analyzing the interaction between user input (e.g., user queries) and the responses of the system. The study of the CIC aggregation suggests a ‘‘normal’’ behavior of metadata collections. This model could be generalized with future studies in order to identify collections potentially ill-adapted to their new environment which present a risk of low performance. It demonstrates that, even through item-level based digital library applications ignore collections as aggregate objects, metadata records from the same collection (as defined by content providers) often naturally behave in similar manner in response to the tasks performed. Future digital libraries such as suggested by Franconi (2000) and Lagoze et al. (2006) will have to coherently represent collections as both sets of items and aggregate objects. The level of granularity of descriptions best adapted to the user tasks in the digital library system may differ according to the context in which the metadata records are being used. In particular, the level of granularity of description can be well adapted to a specialized service but not to a generalist service. In this context, duplicate filtering mechanisms based on metadata record features may be ineffective to deal with the information redundancy issue. If the collection behavior is atypical for all functions, then this suggests that the granularity of description or the records structure is not well adapted to the service. Accordingly metadata creation or sharing practices have to be reconsidered by the data provider. Alternatively, if the behavior of a collection is atypical for a particular function, for instance Selecting records from a list of results, then the service provider must implement a mechanism which allows using each individual record for retrieval but a collective representation of the matching records for the list of results. A collection-level description is not always available to represent the collection. It is however possible to partially infer collection-level features from the information redundancy found in item-level metadata records in order to represent the set of items. Collection-level descriptions are enriched in the CIC metadata portal based on item properties. Further study on the generation of collection-level representations based on records similarity will be done in the future in order to improve the manner in which the system takes advantage of item-level information to handle collections. The customization of item-level records and their subsequent behavior in a metadata aggregation are the result of metadata creation practices as well as collection identity. The collections represented in the CIC metadata portal are defined by data providers, most of the time based on content-related criteria. Cross-collection retrieval strategies such as Rasolofo, Abbaci, and Savoy (2001) take into account both the relevance of items and the relevance of the collection. However, in that context, collections are often defined according to technical criteria. They are equivalent to search targets (see Hill et al., 1999). They often do not reflect a similar custodial history and consistent metadata practices. The development of applications using collections as defined by their content and provenance rather than by technical criteria can allow content providers to correlate the collections discoverability to specific metadata creation practices. Based on the analysis of metadata collections in an aggregation, it is then possible to actually improve the efficiency of digital library systems through the content provider’s metadata sharing practices and the service provider’s digital library implementations. User testing suggests that users dislike seeing overly many items from the same collection in a list of results (e.g., Shreeves & Kirkham, 2004). Clearly information redundancy across item-level records within a collection contributes to the problem. The threshold of how much information redundancy is too much depends on context and on user perceptions. The present results should be refined by more directly measuring the perception of users regarding the problems created by aggregating collections of relatively undifferentiated item-level metadata. Acknowledgements This work was supported by a grant from the Committee of Institutional Cooperation’s Center for Library Initiatives. We acknowledge the libraries of the following participating CIC member institutions for providing metadata and participating in the discussions on metadata sharing: University of Chicago, University of Illinois at Chicago, University of Illinois at Urbana-Champaign, Indiana University, University of Iowa, Univer-
M. Foulonneau / Information Processing and Management 43 (2007) 740–751
751
sity of Michigan, Michigan State University, University of Minnesota, Northwestern University, Ohio State University, Pennsylvania State University, Purdue University and the University of Wisconsin-Madison. We acknowledge the University of Michigan for sharing the transaction logs of the OAIster service. The author thanks Timothy W. Cole, Thomas H. Habing and William H. Mischo for discussions and editorial support. References Broder, A.Z. (1997). On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997 Proceedings 11–13 June 1997 (pp. 21–29). http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=666900. Broder, A. Z. (2000). Identifying and filtering near-duplicate documents. In R. Giancarlo & D. Sankoff (Eds.), Combinatorial pattern matching: 11th Annual symposium, CPM 2000, Montreal, Canada, June 21–23, 2000. Lecture notes in computer science (Vol. 1848/2000). Springer-Verlag GmbH, ISSN: 0302-9743, Chapter. p.1http://www.springerlink.com/openurl.asp?genre=article&issn=03029743&volume=1848&spage=1. Chapman, S. (2006). Sam’s string metrics. http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Digital Library Federation/National Science Digital Library (2005). Best practices for shareable metadata. http://oai-best.comm.nsdl.org/ cgi-bin/wiki.pl?PublicTOC. Foulonneau, M., & Cole, T. W. (2005). Strategies for reprocessing aggregated metadata. In Nineth European conference on digital libraries, ECDL 2005, September 18–23, 2005, Vienna, Austria. Proceedings series: Lecture notes in computer science. Heidelberg: Springerhttp://www.springerlink.com/openurl.asp?genre=article&id=doi:10.1007/11551362_26. Foulonneau, M., Cole, T. W., Habing, T. G., & Shreeves, S. L. (2005). Using collection descriptions to enhance an aggregation of harvested item-level metadata. In Proceedings of the fifth ACM/IEEE-CS joint conference on Digital Libraries 2005. New York: ACM http://portal.acm.org/citation.cfm?doid=1065385.1065393. Foulonneau, M., Habing T. G., Cole, T. W. (2006). Automated capture of thumbnails and thumbshots for use by metadata aggregation services. D-Lib Magazine. http://www.dlib.org/dlib/january06/foulonneau/01foulonneau.html. Franconi, E. (2000). Knowledge Representation meets Digital Libraries. 1st DELOS (Network of Excellence on Digital Libraries) workshop on ‘‘Information Seeking, Searching and Querying in Digital Libraries, December 2000, Zurich, Switzerland. Harrison, T.L., Elango, A., Bollen, J. and Nelson, M. (2004). Initial Experiences Re-exporting Duplicate and Similarity Computations with an OAI-PMH Aggregator. arXiv Report, cs.DL/0401001. Hill, L. L., Jane´e, G., Dolin, R., Frew, J., & Larsgaard, M. (1999). Collection metadata solutions for digital library applications. Journal of the American Society for Information Science., 50(13), 1169–1181. IFLA study group on the functional requirements for bibliographic records (1998). Functional requirements for bibliographic records. IFLA publications (Vol. 19). http://www.ifla.org/VII/s13/frbr/frbr.pdf. Khan, H. M., Maly, K., & Zubair, M. (2005). In Nineth European conference on digital libraries, ECDL 2005, September 18–23, 2005, Vienna, Austria. Proceedings series: Lecture notes in computer science (pp. 531–532). Heidelberg: Springer-Verlag. Lagoze, C., Krafft, D., Cornwell, T., Dushay, N., Eckstrom, D., & Saylor, J. (2006). Metadata aggregation and ‘‘automated digital libraries’’: A retrospective on the NSDL experience. In Sixth ACM/IEEE-CS joint conference on digital libraries 2006. New York: ACM Press. Monge, A.E., Elkan, C.P. (1997). An efficient domain independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery. http://www-cse.uscd.edu/users/elkan/ approxdup.ps. Nahm, U.Y., Bilenko, M., Mooney, R.J. (2002). Two approaches to handling noise variation in Text mining. In: Proceedings of the ICML2002 workshop on text learning (pp. 18–27), Sydney, Australia. Powell, A., Heaney, M., Dempsey, L. (2000). RSLP collection description. D-Lib Magazine. http://www.dlib.org/dlib/september00/ powell/09powell.html. Rasolofo, Y., Abbaci, F., & Savoy, J. (2001). Approaches to collection selection and results merging for distributed information retrieval. In H. Paques, L. Liu, & D. Grossman (Eds.), Proceedings of the tenth international conference on information and knowledge management (Atlanta, Georgia, USA, October 5–10, 2001). CIKM ’01 (pp. 191–198). New York, NY: ACM Press. doi:10.1145/ 502585.502618. Shreeves, S. L., & Kirkham, C. M. (2004). Experiences of educators using a portal of aggregated metadata. Journal of Digital Information, 5(3)http://jodi.ecs.soton.ac.uk/Articles/v05/i03/Shreeves/. Stvilia, B., Gasser, L., Twidale, M., Shreeves, S.L., Cole, T.W. (2004). Metadata quality for federated collections. In: ICIQ04-9th international conference on information quality. http://www.isrl.uiuc.edu/~gasser/papers/metadataqualitymit_v4-10lg. pdf. Muriel Foulonneau is visiting assistant professor and project coordinator at the University of Illinois at Urbana-Champaign for the CICOAI metadata harvesting project. She is part of the American Digital Library Federation and National Science Digital Library best practices group on the Open Archives Initiative and shareable metadata.