Segmentation of visiting patterns on web sites using a sequence alignment method

Segmentation of visiting patterns on web sites using a sequence alignment method

Journal of Retailing and Consumer Services 10 (2003) 145–153 Segmentation of visiting patterns on web sites using a sequence alignment method Birgit ...

268KB Sizes 2 Downloads 21 Views

Journal of Retailing and Consumer Services 10 (2003) 145–153

Segmentation of visiting patterns on web sites using a sequence alignment method Birgit Hay*, Geert Wets, Koen Vanhoof Universitaire Campus, Limburg University, Gebouw D, Diepenbeek B-3590, Belgium

Abstract In this article, a new method is illustrated for segmentation of visiting patterns on a web site. Instead of clustering users by means of a Euclidean distance measure, in our approach users are partitioned into clusters using a non-Euclidean distance measure, called Sequence Alignment Method (SAM). This method ensures that sequential relationships, represented by the order of elements, are taken into account. In experiments using real traffic data on the web site of a Belgian telecom provider, the performance of SAM is compared with the results of a method based on Euclidean distance measures. Empirical results show that SAM identifies segments presenting behavioral characteristics not only with regard to content but also considering the order of pages that are visited on a web site. r 2003 Elsevier Science Ltd. All rights reserved. Keywords: Web (usage) mining; Market segmentation; Sequence analysis

1. Introduction and background Marketing strategies have evolved from a manufactured oriented approach earlier this century to a market oriented approach (Wedel and Kamakura, 2000). The emphasis on reduction of production costs moved to satisfaction of customers. Competitive advantage is attained through identifying specific needs of groups of customers and developing the right offer to the right group of consumers or market segments. Smith (1956) stated: ‘‘Market segmentation involves viewing a heterogeneous market as a number of smaller homogeneous markets, in response to differing preferences, attributable to the desires of consumers for more precise satisfaction of their varying wants.’’ Analysis of market segments may lead to different managerial actions and strategies that improve business result and, in general, strengthen competitive advantage. Today, within marketing research, market segmentation studies are performed in many different ways, using various techniques to analyze market segments. Brijs et al. (2001) present a new methodology for behaviorbased customer segmentation. The method of latent *Corresponding author. Tel.: +32-11-26-86-38; fax: +32-11-26-87-00. E-mail address: [email protected] (B. Hay).

class mixture modeling is used in order to discover hidden customer segments on the basis of the contents of their shopping baskets. Furthermore, customer segments are analyzed with regard to socio-demographic characteristics. These characteristics are used to target different segments with more relevant product offers, in order to improve the satisfaction of consumers’ wants and needs which will ultimately strengthen competitive advantage. In Turner and Reisinger (2001) structural equation models are used to divide domestic tourists into segments based on different demographic/social characteristics. Each segment requires different product attributes and services which results in separate dimensions of retailing satisfaction. This becomes a measure of retailing performance as perceived by domestic tourists. In addition, this can indicate directions for change in the retailing provision in order to increase satisfaction, and finally, domestic tourist visitation. Furthermore, Fernie and Staines (2001) present a taxonomy of European grocery distribution networks provided through the use of cluster analysis on 18 logistics-related variables applied to 10 country markets. The study shows a high degree of variability between country clusters on three sub-sets of market-structure, trading format and physical/socio-economic infrastructure. The information provided through segmentation of European grocery distribution makes it possible to understand similarities

0969-6989/03/$ - see front matter r 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0969-6989(03)00006-7

146

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153

and differences between individual country markets. Subsequently, Aurifeille (2000) performs a market segmentation study through a regression algorithm (Typren) for partitioning a population into clusters characterized by homogeneous models and predictions. Typren is based on the hybridization of a genetic algorithm with linear regression. The method can be used for targeting groups of customers based on the objectives of the firm. While market strategies evolve to a more market oriented approach and ultimately become a one-to-one strategy, the World Wide Web offers ideal opportunities for this tendency. Companies’ web sites have created an important environment for information, distribution and selling of products and services. Marketing concepts and a clear understanding of visiting behavior of customers and prospects on our web site have become serious aspects to support commercial business through the Internet. Within web mining research, segmentation studies have proven to provide valuable knowledge as well. Web Mining discovers and analyses useful information from the World Wide Web (Cooley et al., 1997). In order to divide visiting patterns into segments on the web, a variety of techniques are used. In Mobasher et al. (2000) segmentation studies are performed through cluster analysis. By means of clustering user transactions and page-views, overlapping aggregate profiles are discovered that can be effectively used by recommender systems for real-time webpersonalization. Furthermore, Cadez et al. (2000) use model based clustering to visualize navigation patterns on a web site. The clustering approach is model based, opposed to distance based, handles the problem of clustering sequences of different lengths and partitions users according to the order in which they request web pages. The authors analyzed data from the msnbc.com site and concluded that the new method reveals interesting insights suggesting improvements in the web site. Yet concerning the problem of clustering users based on their web navigation patterns using a measure that incorporates the order of elements, except for Wang and Za.ıane (2002) and Cadez et al. (2000), no prior work has been found. However, within web mining research, segmentation of navigation patterns based on the order of visited web pages reveals important information in order to support and increase customer satisfaction (e.g. optimizing the layout of the web site through structuring of page-links). Consequently, the purpose of this study is identifying segments of visiting behavior on a web site using a method that well reflects structural information (through the order of elements) within sequences. To this end we will introduce the Sequence Alignment Method (SAM), which is a distance-based technique that reflects operations necessary to equalize sequences. The higher the distance measure or score, the less equal the

sequences are and vice versa. The method is applied to real user-traffic data on the web site of a Belgian telecom provider. Furthermore, the method is compared to a common used distance measure within cluster analysis called Association Distance, which does not incorporate structural information. The focus of our paper is to provide segments of visiting behavior at a web site, giving a general view of visited pages and the order in which pages are visited. The article is organized as follows. First, two techniques for measuring distances between sequences are described: SAM, which incorporates structural information and Association Distance, which does not. Then we illustrate our experiments. Finally, the article is concluded and topics for further research are given.

2. Basic concepts 2.1. Sequence alignment method Definition. SAM is a non-Euclidean distance measure reflecting the order of elements. Basically, a nonEuclidean distance measure is a similarity measure that goes beyond the Euclidean straight line drawn between two objects (Hair et al., 1998). SAM is used within several research domains. The method, also called string edit distance, is used for sequence comparison within molecular biology and speech recognition (Sankoff and Kruskal, 1983). A sequence is defined as a number of elements arranged or coming one after the other in succession. Likewise, in Mannila and Ronkainen (1997) edit distance is described as a distance measure between event sequences in telecom studies. Also in traffic analysis studies the method is used to discover navigation patterns (Joh et al., 2001, 1999). In this section we give a short overview of the algorithm. A more detailed explanation can be found in Sankoff and Kruskal (1983). In general, the distance (or similarity) between sequences is reflected by the number of operations necessary to convert one sequence into the other. As a result, SAM distance measure is represented by a score. The higher/lower the score, the more/less effort it takes to equalize the sequences and the less/more similar sequences are. In addition, SAM scores for the following operations during equalization process: Insertion and deletion operations are applied to unique elements of source (first) and target (second) sequences; reordering operations are applied to common elements. Common elements appear in both compared sequences whereas unique elements appear in either one of them. Finally, SAM represents the minimum cost for equalizing two sequences. In particular, SAM distance measure between two sequences S1 and S2 is calculated using the following

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153

147

formula (Joh et al., 2001; Sankoff and Kruskal, 1983):

The equalization process continues with reordering common element 7 (or 8) in s1 or s2 :

dSAM ðS1 ; S2 Þ ¼ ðwd D þ wi IÞ þ ZR;

s1 f1; 2; 3; 4; 8; 7g;

ð1Þ

where dSAM is the distance between two sequences S1 and S2 ; based on SAM; wd the weight value for the deletion operations, a positive constant not equal to 0, determined by the researcher (wd > 0); wi the weight value for the insertion operations, a positive constant not equal to 0, determined by the researcher (wi > 0); D the number of deletion operations; I the number of insertion operations; R the number of reordering operations; Z the reordering weight, a positive constant not equal to 0, determined by the researcher (Z > 0). Eq. (1) indicates that the score, represented by SAM distance measure between two sequences, consists of the costs for deleting and inserting unique elements and the costs for reordering common elements. Calculation of SAM distance measures between sequences is a combinatory problem. In practical applications, dynamic programming algorithms are used to resolve combinatory problems (Joh et al., 2001; Mannila and Ronkainen, 1997; Wilson, 1998). A summarization of the SAM algorithm is given in Fig. 1. Example. To illustrate SAM, consider the following sequences s1 (source sequence) and s2 (target sequence). Both sequences represent sequentially ordered visited pages on a web site. Each page of the web site is represented by an identification number. Suppose: wd ¼ wi ¼ 1 and Z ¼ wd þ wi

s2 f1; 2; 3; 4; 8; 7g: Finally, equalizing s1 with s2 took 2 insertion and 1 reordering operation, which gives us: dSAM ðs1 ; s2 Þ ¼ 4: 2.2. Association distance Definition. A common used distance measure between sequences for segmentation studies is Association Distance (Everitt, 1980). The method is Euclidean-based and does not take into account the order of elements within sequences. A Euclidean-based distance measure is a similarity measure calculating the length of a straight line drawn between two objects (Hair et al., 1998). A simple form of association measure for analyzing data in non-metric terms will transform each sequence into a vector and counts the number of dissimilarities at each position of the sequence. Missing values in either one of the compared sequences are treated as a dissimilarity. In particular, the distance between two sequences based on Association Distance is presented with the following formula: n X dASS ðS1 ; S2 Þ ¼ fi ð2Þ

s1 f1; 4; 7; 8g; s2 f1; 2; 3; 4; 8; 7g: First, the maximum number of identical elements having the same order of occurrence is defined. In the example above, page 1 is accessed before page 4, followed by page 8 (or page 7) in both s1 and s2 : Then, in order to equalize s1 with s2 ; unique elements 2 and 3 of s2 are inserted into s1 which gives the following sequences: s1 f1; 2; 3; 4; 7; 8g; s2 f1; 2; 3; 4; 8; 7g:

begin read source sequence //first sequence of sequence pair// read target sequence //second sequence of sequence pair// calculate maximum identity //identity = common elements occurring in the same order// define other common elements define unique elements calculate SAM cost between source and target sequence end Fig. 1. Summarization of SAM algorithm.

i¼1

with fi ¼ 1 fi ¼ 0

if S1 ð1ÞaS2 ðiÞ; otherwise;

where dASS is the distance between twoPsequences, S1 and S2 ; based on Association Distance; ni¼1 fi the sum of dissimilarities between sequences S1 and S2 from positions i to n; n the number of positions of S1 or S2 if the sequences are of equal length, otherwise n is equal to the number of positions of the longest sequence. Example. The distance between sequences s1 and s2 ; given in the previous example, will be 5 based on Association Distance.

3. Experiments 3.1. Proposed approach In order to get a general view about what is going on at a web site, the purpose of the experiments are identifying segments of visiting behavior providing general information about visited pages and the order

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153

148

Association Distance SAM Distance

Step 1 Inputfile S1 S2 S3 ...

Calculate distances between each sequence pair

Sk

(1)

(2)

R-squared, Index Calinski & Harabasz, C-Index based on SAM R-squared, Index Calinski & Harabasz, C-Index based on SAM

Step 4 Define number of clusters and cluster sequences (server sessions)

d (S1, S2) = --- --d (S1, S3) = --- --... d (S1, Sk) = --- --d (S2, S3) = --- --d (S2, S4) = --- --... d (S2, Sk) = --- --... d (Sk-1, Sk) = --- ---

Cluster 1

Step 2 Build distance matrices based on SAM based on Association

S1 S2 S3 S1 0 --- --S2 --- 0 --S3 --- --- 0 ... Sk --- --- ---

... ... ... ...

Sk -------

...

0

S 1 S2 S 3 S1 0 --- --S2 --- 0 --S3 --- --- 0 ... Sk --- --- ---

... ... ... ...

Sk -------

...

0

Step 3 Apply Ward clustering to distance matrices based on SAM based on Association

(1)

(2)

Step 5

Cluster 2

For each cluster calculate Cluster n open sequences and Cluster 1 alignment scores Cluster 2

Cluster description: Open sequences (order) Alignment scores (order) Frequency graphs (content)

Cluster m

Fig. 2. Proposed approach.

in which pages are visited. Fig. 2 presents 5 processing steps of our approach. First pair wise distances between sequences are calculated using SAM, which incorporates the order of elements, as distance measure. To compare the results, the same sequences are used to calculate pair wise distances based on Association, a measure that does not incorporate the order of elements. In a following step distance matrices are build for each distance measure. The matrices hold distance scores between each sequence pair with a score of 0 on the diagonal. In step 3 Ward clustering is applied to each distance matrix. Because this study is focused on SAM, no special attention is paid to the clustering method. Therefore, a simple hierarchical clustering algorithm like Ward (Hair et al., 1998) is used. Further research will include the use of other clustering methods. In order to define an optimal solution for the number of clusters, r2 ; index of Calinski & Harabasz and c-index are used in step 4. R2 is one of the most commonly used stop criteria (De Pelsmacker and Van Kenhove, 1999), equals the proportion of variation explained by the model (Hair et al., 1998) and ranges in value from 0 to 1. The level of r2 increases with the number of clusters. On the other hand, cluster analysis will be economically unfeasible because of computational demands if too many clusters need to be analyzed. Therefore, a trade off between the number of clusters and model fit must be defined. Small values of r2 indicate that the model does not fit the data well whereas measures of 0.6 or higher are considered acceptable (Hair et al., 1998). The index of Calinski & Harabasz has proven to be the best method for stop-criterion in clustering (Milligan and Cooper,

1985) and maximizes a normalized ratio of between and within cluster distances as a means of choosing the optimal number of clusters (Calinski and Harabasz, 1974). A second best method for stop-criterion in clustering is c-index, which minimizes normalized values (De Pelsmacker and Van Kenhove, 1999). Ultimately all three criteria are evaluated and used to define the number of clusters during clustering based on SAM and Association Distance measures. Finally, the resulting clusters are analyzed and described by examining the pages and the sequential order of the pages. To this end, in step 5, open sequences and alignment scores are calculated for each cluster. In Buchner . et al. (1999), open sequences are used for discovering structural information, represented by the order of elements, within navigation patterns. Sequences with the same elements occurring in the same order and irrelevant of the positions of the elements are called open sequences. For example, open sequence (1, 3, 5) occurs in sequences (4, 1, 2, 3, 6, 5), (1, 2, 3, 4, 2, 5) and (3, 1, 3, 5, 2). In addition, the concepts support and confidence are used to quantify open sequences. Support is defined as the number of sequences grouped into a particular cluster containing the open sequence divided by the total number of sequences in that cluster. Confidence is equal to the number of sequences within a cluster containing the open sequence divided by the total number of sequences in the inputfile representing the open sequence. Ultimately, the support value is used to calculate the frequency of an open sequence while the confidence value measures the typicality of the frequency for a particular cluster. Besides open sequences, another way for cluster validation is the calculation of alignment scores. In

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153

Wang and Za.ıane (2002) sequence alignment scores are introduced in Web Mining studies in order to measure similarity between sequences of page accesses. Alignment scores are calculated through the longest common subsequence between two sequences divided by the average length of the sequences. This means that the maximum number of common (equal) elements occurring in the same order between two sequences is measured. The score ranges from 0 to 1. An alignment score of 0 indicates complete lack of similarity while a score of 1 is measured for equal sequences. For example, the alignment score between sequences {1, 3, 4, 6, 2} and {1, 4, 5, 2} equals 0.66. For each cluster, alignment scores between sequences within that cluster are presented in a three-dimensional graph. Sequences are arranged in the same order on two axes; the third axis shows the alignment score of each sequence pair. In order to give a content presentation of each cluster, without considering the order of elements, frequency graphs are given. Each graph shows horizontally the code of the web page and vertically the visiting frequency or number of times the web page is accessed.

3.2. Data For this project, log files of a Belgian telecom provider collected over a 1-week period are used. In order to analyze visiting behavior on a web site, sequences are represented by sessions of web-click stream data. A server session or visit is defined as the click-stream of page-views for a single visit of a user to a web site (Cooley, 2000). In this article, we will use server session and visit interchangeably. First, data stored in log files are cleaned in such a way that only URL page requests of the form ‘GETyhtml’ are maintained. Then a unique code is given to each distinct ip-address and URL. Third, sessions are identified using a threshold of 30 min viewing time. This means that, with the same ipaddress, a new session starts when the time between two page requests is more than 30 min. In general, a session is created when a new ip-address is met in the log file. Furthermore, a filtering method is invoked on the sessions in order to identify visitors using the same ipaddress. Finally, server sessions are built in the form of (session-id, {/page-idS}) representing consecutive pages requested by the same user. For example, a visit ðs1 ; f1; 2; 45; 27; 28; 112gÞ gives us information that a user enters the web site through page 1, then visits pages 2, 45, 27 and 28, and finally exits the web site when page 112 has been viewed. A total number of 773 sessions are composed from the data in the log files. We remark that, in this study, we focus on information retrieval based on SAM rather than examining the various engineering issues with regard to accurate sessionizing and user identification.

149

3.3. Results Following our proposed approach three stop criteria for defining the optimal number of clusters are evaluated: r2 ; index of Calinski & Harabasz and cindex. In Fig. 3 the values of r2 with corresponding number of clusters, based on SAM and Association Distance, are outlined. The number of clusters are defined when r2 reaches a minimum level of 0.60. This means that, when SAM is used, three clusters are formed with r2 equal to 0.64. For Association Distance, three clusters are built with r2 reaching a level of 0.73. The index of Calinski & Harabasz and c-index define the same number of clusters for both distance measures. The clusters represent segments of visiting behavior on a web site, showing visited pages and the order in which pages are visited. In Table 1 the number of sessions grouped into each cluster based on SAM and Association is given. For validation purposes, open sequences are presented in Table 2. When SAM is used as a distance measure, open sequences are found with a high confidence value in cluster 1 and 3, indicating the typicality of the open sequence for that particular cluster. For example, open sequence (281, 280, 355) with confidence 93.49% in cluster 3 states that if page 281 is accessed, followed by page 280, the chance that page 355 will be accessed is 93.94%. Some examples of typical order-based visiting behavior are: 28, 27, 109 shown by cluster 1 and 281, 280, 355 shown by cluster 3. Yet, in the second cluster open sequences are found with a relatively low confidence value. We discovered that server sessions grouped into cluster 2 are very short of length with page accesses occurring rarely, indicated by low support values. The average length of the sessions is 1.0

0.8

0.6

r-squared based on

0.4

Association Distance r-squared based on

0.2

SAM Distance 0.0 1

2

3

4

5

6

7

8

9

10 ....... 773

Number of clusters Fig. 3. Ward’s minimum variance cluster analysis: model fit when SAM and Association are used as distance measures.

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153

150

equal to 5.06 compared to 22.60 for cluster 1 and 21.33 for cluster 3. This means that, if we compare the sessions of cluster 2 with those of cluster 1 and 3, using SAM, many operations are necessary due to a high difference in length between the sequences. Consequently, a high distance measure is obtained. On the other hand, a relatively low SAM distance measure is calculated between sequences within cluster 2 due to short sequence length. However, when Association Distance is used, finding typical visiting behavior is not so straightforward as when SAM is used. In Table 2, all open sequences having support of 11% and above, have the highest confidence values in cluster 1. Moreover, confidence values of 90% and more are not shown which means that clusters do not strongly represent typical orderbased visiting behavior. A three-dimensional graphical validation technique for analyzing order-based information in each cluster is given by alignment scores in Fig. 4. For each cluster, alignment scores are calculated between each sequence pair. Higher alignment scores indicate stronger similarities between visitor sessions with regard to identical pages and the order in which pages are visited. The

Table 1 Partitioning sessions in clusters, based on SAM and Association as distance measures Distance measure Cluster

SAM

Association

1 2 3

252 225 296

301 353 119

figure shows highest alignment scores for cluster 1 and 3, based on SAM. The lower alignment scores for cluster 2, based on SAM, are due to sessions containing pages that are rarely accessed. This means that common sub strings between sequences are very small which leads to lower alignment scores. In general, lower alignment scores are presented for the clusters based on Association Distance. In order to analyze clusters with regard to content information, without considering the order of visited pages, frequency graphs are provided in Fig. 5. Based on SAM, three segments are discovered. The first shows visiting behavior to web page codes ranging from 1 to 250. This group of pages is presented in French language. On the other hand, visiting behavior towards page codes ranging from 251 to 492 are shown by the third cluster, representing visits to Dutch pages of the web site. Finally, another group of server sessions represents visiting behavior to web pages in both languages. When Association is used as distance measure, server sessions are not clustered based on content information. Typical content-based information for each cluster is hard to find. We also examined solutions having one cluster more and one cluster less than r2 indicates (refer Fig. 2 step 4). For SAM and Association, this led to a two segmentssolution and a four segments-solution. However, the results were not improved. Finally, in other experiments we analyzed segmentation for server sessions in one of the languages. Instead of using the inputfile as a whole (refer Fig. 2), we first divided the server sessions into three groups. The first group holds server sessions to Dutch pages, the second group holds sessions to French pages and visits to pages of both languages are formed in a third group. Then, in

Table 2 Open sequences, representing sequentially ordered visited web pages, found in clusters 1, 2 and 3 when SAM and Association are used as distance measures Open sequence

(281, 280) (281, 280, 355) (28, 27) (492, 491) (281, 280, 355, 492) (28, 27, 109) (28, 27, 109, 250) (281, 280, 355, 492, 491) (28, 27, 109, 250, 113) (28, 27, 109, 250, 113, 249) (281, 280, 355, 492, 491, 358) (113, 250, 249) (305, 286, 317) (196, 186, 194) (109, 192) (224, 242, 230)

Support (%)

31.95 27.81 27.17 25.87 25.09 22.90 20.70 19.79 17.85 11.50 11.00 4.78 2.45 1.94 0.77 0.39

SAM confidence (%) cluster

Association confidence (%) cluster

1

2

3

1

2

3

2.43 2.33 91.90 3.00 2.06 93.78 97.50 2.00 98.55 97.75 1.18 97.30 0.00 66.67 83.33 66.66

5.26 4.18 5.24 3.00 1.03 3.95 0.62 0.00 0.00 0.00 0.00 2.70 43.37 26.67 16.67 33.34

92.31 93.49 2.86 94.00 96.21 2.27 1.88 98.00 1.45 2.25 98.82 0.00 56.63 6.66 0.00 0.00

42.91 41.86 56.19 45.50 43.30 58.19 60.62 45.10 60.15 52.80 45.88 35.13 15.79 26.67 33.33 33.33

34.41 33.49 25.71 30.50 31.44 23.73 21.25 30.06 19.56 22.47 22.35 10.81 52.63 26.67 16.67 33.33

22.68 24.65 18.10 24.00 25.26 18.08 18.13 24.84 20.29 24.73 31.77 54.06 31.58 46.66 50.00 33.34

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153

151

Fig. 4. Alignment scores for server sessions clustered using SAM and Association as distance measures.

Cluster 2 (SAM)

300

300

300

200

200

100

0

1 51 101 151 201 251 301 351 401 451 26 76 126 176 226 276 326 376 426 476

Cluster 1 (Association)

Frequency

100

300

200

0 26

51 101 151 201 251 301 351 401 451 76 126 176 226 276 326 376 426 476

Page_id

200

100

100

1

Cluster 3 (Association)

400

300

200

1 51 101 151 201 251 301 351 401 451 26 76 126 176 226 276 326 376 426 476 Page_id

Cluster 2 (Association)

400

300

0

0

1 51 101 151 201 251 301 351 401 451 26 76 126 176 226 276 326 376 426 476 Page_id

Page_id 400

200

100

Frequency

0

Frequency

400

100

Frequency

Cluster 3 (SAM)

400

Frequency

Frequency

Cluster 1 (SAM) 400

1 51 101 151 201 251 301 351 401 451 26 76 126 176 226 276 326 376 426 476 Page_id

0

1 51 101 151 201 251 301 351 401 451 26 76 126 176 226 276 326 376 426 476 Page_id

Fig. 5. Visited web pages within server sessions clustered using SAM and Association as distance measures.

step 1, distance measures are calculated for each group and the analysis proceeds with the following steps. Ultimately, for each group, the same results are obtained compared to those when the inputfile is not

split up into language groups. Also, typical open sequences for cluster description and higher alignment scores between sessions in clusters based on SAM are shown. Likewise, frequency graphs provide typical

152

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153

content-based information for each cluster when SAM is used. When Association is used as a distance measure, typicality of open sequences, alignment scores and content-based information is much less straightforward.

4. Conclusions and further research In order to provide a general view of visited pages and the order in which pages are visited, the purpose of this study is to provide segments (represented by clusters) of visiting behavior on a web site using a SAM. For validation purposes, SAM is compared with another method, called Association Distance, which does not incorporate structural information through the order of elements. Experiments on a real data set of user-traffic data on the web site of a Belgian telecom provider show that clustering server sessions by means of SAM identifies two segments of visiting behavior providing content and order-based information. Another segment identifies visiting behavior of short sessions to pages of the web site that are rarely visited. The segments resulted from clustering based on Association Distance do not provide good content and order-based information. Three validation techniques are used for analyzing the results. Open sequences and alignment scores measure order-based information within clusters; frequency graphs show how typical clusters present content information. First, typical open sequences, representing sequentially ordered visited web pages with high confidence values (i.e. more than 90%), are found for two clusters based on SAM. When Association was used as distance measure, typical open sequences are not found. Second, alignment scores, indicating order-based similarity between sequences within each cluster, are high for two clusters based on SAM. A third cluster shows low alignment scores because sub-strings are hard to find between sequences due to rare clicks. In general, alignment scores between sequences clustered by means of Association are lower. Third, three clusters based on SAM provide unique content information given by frequency graphs. This is not shown by the clusters based on Association. Segments of visiting behavior, providing general information of visited pages and their sequential visited order, using SAM as distance measure between server sessions, are described as follows. The first segment represents visits to pages of the web site in the following order: 28, 27, 109, 250, 113, 249. This pattern has support of 11.5% and confidence of 97.75%, indicating typical visiting behavior towards the following French main pages: products/services, news and press release, sales, tariff, promotions, jobs. Yet, the third segment represents the following (order of) pages: 281, 280, 355, 492, 491, 358. This pattern has support of 11% and

confidence of 98.82%, indicating typical visiting behavior towards the following Dutch main pages: products/ services, news and press release, sales, tariff, promotions, jobs. In the second segment visiting behavior towards pages of both languages is illustrated. Patterns have low confidence values, which means that this segment does not represent a typical behavior. Instead, it groups sessions that do not belong to the first neither to the third segment because of their occasional page clicks and/or because of exceptional short sequences. The market value of the segments in our experimental results, from clustering based on SAM, provide information that French speaking people follow the same pattern and order of pages than Dutch speaking people, i.e. products/services, news and press release, sales, tariff, promotions, jobs. Practically this may imply that the structure of the web site should provide links between pages allowing the visitor to directly follow this common pattern, without having to click on other pages that are not typically visited. Conventional segmentation results do not provide such order-based information. Other experiments clustered sessions into four and two instead of three segments. However, the results were not improved. More tests evaluated segmentation analysis for one of the languages. The same validation techniques were used indicating that SAM is a better method for content and order-based segmentation of sequences. Further research should explore the effect on the solution when other parameters, instead of 1 for insertion/deletion and 2 for reordering, are used. Sensitivity analyses concerning the parameters used in SAM must be performed. Also, other clustering algorithms than Ward, like for example mixture models, should further be examined. Furthermore, other alternatives for comparing the results based on SAM must be evaluated. Finally, SAM will be extended by means of integration with the web site tree structure (for example using token rings) and inclusion of w2 statistical tests as a factor for measuring interestingness.

References Aurifeille, J.-M., 2000. A bio-mimetic approach to marketingsegmentation: principles and comparative analysis. European Journal of Economic and Social Systems 14 (1), 93–108. Brijs, T., Swinnen, G., Vanhoof, K., Wets, G., 2001. Using shopping baskets to cluster supermarket shoppers. Conference Notes of the 12th Annual Advanced Research Techniques Forum (ART), American Marketing Association, Florida, Amelia Island Plantation, June 24–27. Buchner, . A.G., Baumgarten, M., Anand, S.S., Mulvenna, M.D., Highes, J.G., 1999. Navigation pattern discovery from Internet data. ACM Workshop on Web Usage Analysis and User Profiling (Webkdd), San Diego, August 15, pp. 25–30.

B. Hay et al. / Journal of Retailing and Consumer Services 10 (2003) 145–153 Cadez, I., Heckerman, D., Meek, C., Smyth, P., White, S., 2000. Visualization of navigation patterns on a web site using model based clustering. Technical Report MSR-TR-2000-18, Microsoft Research. Calinski, R., Harabasz, J., 1974. A dendrite method for cluster analysis. Communications in Statistics 3, 1–27. Cooley, R., 2000. Web usage mining: discovery and application of interesting patterns from web data. Ph.D. Thesis, University of Minnesota http://www-users.cs.umn.edu/Bcooley/pubs.html. Cooley, R., Mobasher, B., Srivastava, J., 1997. Web mining: information and pattern discovery on the World Wide Web. A survey paper. Proceedings of the Ninth IEEE Conference on Tools with Artificial Intelligence (ICTAI ‘97), Newport Beach, CA, November. De Pelsmacker, P., Van Kenhove, P., 1999. Marktonderzoek. Methoden en toepassingen. Garant, Leuven/Apeldoorn. Everitt, B., 1980. Cluster Analysis. Halsted Press, New York. Fernie, J., Staines, H., 2001. Towards an understanding of European grocery supply chains. Journal of Retailing and Consumer Services 8 (1), 29–36. Hair, J.F., Andersen, R.E., Tatham, R.L., Black, W.C., 1998. Multivariate Data Analysis. Prentice Hall, New Jersey. Joh, C.H., Arentze, T.A., Timmermans, H.J.P., Popkowski-Lesczyck, P., 1999. Identifying shopper segments using scanner panel data and sequence alignment methods. Proceedings of the Sixth International Conference on Recent Advances in Retailing and Services Science, Puerto Rico, July 18–21. Joh, C.H., Arentze, T.A., Timmermans, H.J.P., 2001. A positionsensitive sequence-alignment method illustrated for space-time

153

activity-diary data. Environment and Planning A 33 (2), 313–338. Mannila, H., Ronkainen, P., 1997. Similarity of event sequences. Fourth International Workshop on Temporal Representation and Reasoning, TIME, IEEE Computer Society, Florida, May 10–11, pp. 136–139. Milligan, G.W., Cooper, M., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179. Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., Wiltshire, J., 2000. Discovery of aggregate usage profiles for web personalization. WebKDD Workshop, Boston, August 20. Sankoff, D., Kruskal, J.B., 1983. Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading, MA. Smith, W., 1956. Product differentiation and market segmentation as alternative marketing strategies. Journal of Marketing 21, 3–8. Turner, L.W., Reisinger, Y., 2001. Shopping satisfaction for domestic tourists. Journal of Retailing and Consumer Services 8 (1), 15–27. Wang, W., Za.ıane, O.R., 2002. Clustering web sessions by sequence alignment. Third Workshop on Management of Information on the Web in conjunction with 13th International Conference on Database and Expert Systems Applications DEXA, Aix en Provence, France, September 2–6, pp. 394–398. Wedel, M., Kamakura, W., 2000. Market Segmentation. Conceptual and Methodological Foundations. Kluwer Academic Publishers, Boston. Wilson, W.C., 1998. Activity pattern analysis by means of sequence alignment methods. Environment and planning A 30, 1017–1038.