Advanced Engineering Informatics 18 (2004) 129–142 www.elsevier.com/locate/aei
An adaptive website system to improve efficiency with web mining techniques Ji-Hyun Lee*, Wei-Kun Shiu Graduate School of Computational Design, National Yunlin University of Science and Technology, 123, Section 3, University Road, Touliu, Yunlin 640, Taiwan, ROC Received 25 May 2004; revised 11 September 2004; accepted 14 September 2004
Abstract The paper proposes an adaptive web system—that is, a website that is capable of changing its original design to fit user requirements. For the purpose of improving shortcomings of the website, and also to make it much easier for users to access information, the system analyzes user browsing patterns from their access records. This paper concentrates on the operating-efficiency of a website—that is, the efficiency with which a group of users browse a website. By achieving high efficiency, users spend less operating cost to accomplish a desired user goal. Based on user access data, we analyze each user’s operating activities as well as their browsing sequences. With this data, we can calculate a measure of the efficiency of the user’s browsing sequences. The paper develops an algorithm to accurately calculate this efficiency and to suggest how to increase the efficiency of user operations. This can be achieved in two ways: (i) by adding a new link between two web pages, or (ii) by suggesting to designers to reconsider existing inefficient links so as to allow users to arrive at their target pages more quickly. Using this algorithm, we develop a prototype to prove the concept of efficiency. The implementation is an adaptive website system to automatically change the website architecture according to user browsing activities and to improve website usability from the viewpoint of efficiency. q 2004 Elsevier Ltd. All rights reserved. Keywords: Adaptive website; Web mining; Recursive design; Web usability design; Browsing efficiency
1. Introduction For an enterprise, its website is usually the front door of its advertisement. Important factors for web designers when considering the design of a new website include the attractiveness of the design, an effective structure to the web page to deliver information quickly, and user satisfaction among a growing and diverse set of users faced with ever increasing web contents. However, with the development of more and more web-based technologies and the growth in web content, the structure of a website becomes more complex and web navigation becomes a critical issue to both web designers and users. Moreover, web pages are hard to design in a systematical way. Web architecture, routine path, and page contents are often
* Corresponding author. Tel.: C886 5 534 2601x6511; fax: C886 5 531 2169. E-mail address:
[email protected] (J.-H. Lee). 1474-0346/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.aei.2004.09.007
intuitively decided. These are some of the reasons that lead users to errors or inconvenient access when browsing a website, thereby bringing a negative impression to individuals or companies. In order to deal with this problem, identification of user intention and behavior become necessary, and under a consideration of this necessity, an adaptive website system is conceived. Recent developments in data mining technology can help enterprises determine problems in communication, and improve their tactics in response to customers. The main function of data mining is to assist managers in finding useful knowledge from huge data, which are stored in databases. Data mining technology has been used for web issues since 1995. Mining web pages is referred to as ‘Web Ming’ to distinguish it from general data mining. Web Mining is distinguished into Web Content Mining, Web Structure Mining, and Web Usage Mining, according to the nature of data collected, the task objectives and user needs. Since Web Content Mining cannot easily be detected from user access records, this paper focuses only on Usage
130
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
and Structure Mining. Usage Mining employs user browsing records to analyze user intentions. Structure Mining can tell us about the connectivity of each web page in terms of efficiency. Web users have different purposes and intentions when browsing a website. Each user’s intention and purpose can be determined by an inspection of their browsing activities. By tracking interactive user behavior, the system can help designers detect user usage patterns and can suggest ways to build more efficient websites. One of the important usability factors is efficiency, which measures user behavior when operating an interface. Perkowitz and Etzioni [1] propose adaptive websites that automatically improve their organization and presentation by learning from visitor access patterns. Systems, which adapt themselves automatically to current user needs or perceived requirements, and to his/her current task, are called adaptive [2]. Most adaptive systems use some kind of web-mining approach to improve their website automatically and efficiently. This paper proposes an adaptive system for easier user access of a website. In particular, the system calculates website efficiency based on observing user browsing behaviors and either creates a link between two web pages or provides useful suggestions to web designers on improving the efficiency of their defected website. This paper is structured as follows. In Section 2, we discuss related work. In Section 3, we propose an algorithm to calculate efficiency and how to improve efficiency of a website by combining design usability with data mining. The implementation of a system, as proof of concept of this research, is described in Section 4. Discussions and conclusion with contributions and future work are in Section 5.
2. Related work 2.1. Web mining Web mining has emerged as a specialized field during the last few years and refers to the application of knowledge discovery techniques specifically to web data. Web content and web structure mining, respectively, refer to the analysis of the content of web pages and the structure of links between them. Web usage mining, on the other hand, is the process of applying data mining techniques to the discovery of patterns in web data [3]. Web usage mining involves four steps: user identification, data pre-processing, pattern discovery and analysis. User access patterns are models of user browsing activity. In most cases these are deduced from web server access logs. An alternative method includes client-side logging, using techniques such as cookies. This is referred to as web-log mining [2]. Mining activities help us to know the data patterns. User patterns, extracted from Web data, have been applied to a wide range of applications. Projects by Spiliopoulou and
Faulstich (1998), Wu et al. (1998), Zaiane et al. (1998), Shahabi et al. (1998) have focused on Web Usage Mining in general, without extensive tailoring of the process towards one of the various sub-categories. The WebSIFT project is designed to perform Web Usage Mining from server logs in the extended NSCA format. Chen et al. (1996) introduce the concept of maximal forward reference to characterize user episodes for the mining of traversal patterns. A maximal forward reference is the sequence of pages requested by a user up to the last page before backtracking occurs during a particular server session. The SpeedTracer project [Wu et al., 1998] from IBM Watson is built upon work originally reported in Chen et al. (1996). In addition to episode identification, SpeedTracer makes use of referrer and agent information in the preprocessing routines to identify users and server sessions in the absence of additional client side information. The Web Utilization Miner (WUM) system [Spiliopoulou and Faulstich, 1998] provides a robust mining language in order to specify characteristics of discovered frequent paths that are interesting to the analyst. Zaiane et al. (1998) have loaded Web server logs into a data cube structure in order to perform data mining as well as On-Line Analytical Processing (OLAP) activities such as roll-up and drill-down of the data. Their WebLogMiner system has been used to discover association rules, perform classification and time-series analysis. Shahabi et al. (1997) and Zarkesh et al. (1997) have one of the few Web Usage mining systems that rely on client side data collection. The client side agent sends back page request and time information to the server every time a page containing the Java applet is loaded or destroyed [3]. 2.2. Adaptive websites Users interact with a website in multiple ways, while their mental model about a particular subject can obviously differ from those of other users and the web developer. Consequently, improving the interaction between users and websites is of importance. Raskin [4] introduces various ways of quantification in measuring interface design in his book. Especially, he mentions information-theoretic efficiency, which is defined similarly to the way efficiency is defined in thermodynamics; in thermodynamics we calculate efficiency by dividing the power coming out of a process by the power going into the process. If, during a certain time interval, an electrical generator is producing 820 watts while it is driven by an engine that has an output of 1000 W, it has an efficiency 820/1000, or 0.82. Efficiency is also often expressed as a percentage; in this case, the generator has an efficiency of 82%. This calculation can be applied to calculate the information efficiency. Srikant and Yang [5] propose an algorithm to automatically find pages in a website whose location is different from where visitors expect to find them. The key insight is that visitors will backtrack if they do not find the information where they expect it: the point from where they backtrack is
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
the expected location for the page. They also use a time threshold to distinguish whether a page is target page or not. Nakayama et al. (2000) proposes a technique that discovers the gap between website designers’ expectations and users’ behavior. The former are assessed by measuring the inter-page conceptual relevance and the latter by measuring the inter-page access co-occurrence. They also suggest how to apply quantitative data obtained through a multiple regression analysis that predicts hyperlink traversal frequency from page layout features. Most adaptive systems include a procedure on mining web log to understand user behaviors and patterns and to improve their website automatically and efficiently. However, none of them try to calculate the efficiency to improve the web structure. We want to apply the efficiency concept from [4] and develop the efficiency calculation function in Section 3.
3. Methodology 3.1. Obtaining the information for mining activities 3.1.1. The website architecture The website structure is more of a network or lattice than tree-like. Using links the website is the key to its structure. We wrote a robot program (also called a spider) to mine web architectures. The program detects hyperlinks in.html files. It saves this hyperlink address, goes to the linked page, and detects more hyperlinks in that page. By recursively applying this method, the program will, eventually, obtain the structure of the website. To run the spider program, the user simply inputs an IP address. After that, the program automatically grabs the website structure and saves it into the database. The program uses the domain name to distinguish internal hyperlink addresses from external addresses.
131
We use the web server of the Graduate School of National Yunlin University of Science and Technology (NYUST) (http://www.compdesign.yuntech.edu.tw/) to acquire user browsing behavioral data and to obtain the website architecture. Fig. 1 shows the result, stored in the database, of using the spider program. 3.1.2. User browsing records User browsing records can be collected from three different sources: the web server log file, proxy server log file, and browser cookies. A web server log file records all user access activities on that server. However, it has a problem to distinguish IP addresses from different computers, because if a group of users set up the same proxy, the log file will record all of them with the same IP address. Another source is the proxy server log file. In a local network, users sometimes set up a proxy to reduce network bandwidth usage so that the proxy server log file saves all user-browsing behavior on the local network. However, the proxy server’s log file also has the same problem with the web server’s log file. Moreover, the proxy server’s log file cannot detect the users who do not set up the proxy. For these reasons, we decided to use browser cookies to record the data we need. Since the content of cookies depends on the programmer, we wrote a cookie program to obtain the necessary information for this thesis such as user id, stay time at each page, browsing sequence, and so on. 3.2. Obtaining efficiency 3.2.1. Calculating operating-efficiency The operating-efficiency of a website is a quantification of the amount of data conveyed through users’ interactive browsing behavior. For an interface requiring input, if its efficiency exceeds that of a calculated lower bound, then the user is doing unnecessary work, and the interface can be improved. Information efficiency E
Fig. 1. The result of the web structure in database.
132
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142 Table 1 Duration of a page viewed
Fig. 2. An example of website efficiency.
is similar in nature to thermodynamic efficiency. The efficiency of an interface is defined as the minimum amount of information necessary to do a task, divided by the amount of information that has to be supplied by the user. As is the case for physical efficiency, E lies between 0 and 1. When no work is required for a task, no work is done, and the efficiency is deemed to be 1 [4,6–12]. In this paper, the term efficiency denotes operatingefficiency—the kind of efficiency associated with a group of users when browsing a website. Eq. (1) gives the formula to calculate the efficiency of a path in the web site. Fig. 2 shows an example of efficiency calculation: Efficiency Z Shortest path ðshortest path from (1) initial page to target pageÞ=user operating cost User browsing sequences reveal behavior patterns that can help us determine expectations of user operating behavior and (if any) shortcomings of the website. Client cookies are basic resources, and these are set by data-mining for each user’s browsing behavior sequence. To calculate the efficiency of a website, each user’s operating route, beginning at an initial page and terminating at a target page, needs to be determined. The term operating cost refers to the number of pages visited between the initial (begin) page and the target (end) page. 3.2.2. Finding operating routes from user browsing sequences Determining precisely the initial and the target pages are critical to correctly calculating efficiency. Time thresholds offer a feasible way of dealing with this problem. In general, when browsing a website, we spend more time on the target page getting information compared to the pages enroute. Therefore, by setting a time threshold, we can determine whether or not a page is a target page [5]. That is, we can assume that time spent on a page over the time threshold defines the target page. The target page may also be the initial page of another operating route. The method to acquire operating routes beginning at an initial page and terminating at a target page is given below: , Set a time threshold to distinguish whether a page is a target page or not. , Partition each user’s browsing sequence so that each operating route terminates on a target page.
Rank (number of page views per month)
Country
1
South 2164 Korea Hong 1123 Kong Japan* 788 Global 774 average Singapore 699 US 678 Taiwan 618 Australia 512 New 414 Zealand
2 4
6 7 9 15 20
Number of page views per month
Number of page views per surfing session
Duration of a page viewed
Average click rate for top banners
92
0:00:28
0.62
63
0:00:37
0.69
52 43
0:00:35 0:00:47
0.37 0.41
55 35 55 39 30
0:00:37 0:00:54 0:00:37 0:00:54 0:00:53
0.24 0.36 0.50 0.27 0.23
Source: Nielsen/NetRatings (March 2001) [13].
, The first page in an operating route is the initial page, and the last is the target page. Table 1 presents the average duration a page is viewed in various parts of the world. This information is helpful in deciding values for the time threshold, one that is based on location. Fig. 3 illustrates how to get the target page from a user’s browsing sequence. As can be seen, the time threshold is set to 10 s. Each letter represents a web page and the number within the ‘( )’ to the right of the letter is the user’s stay time. After filtering out pages where the stay time is less than the time threshold, we get the target pages. In the figure, BI-3 has two target pages: ‘D’ and ‘E’. Consequently, the browsing sequence for BI-3 is divided into two operating routes: OI-3 and OI-4.
Fig. 3. Partitioning users’ browsing sequences into operating routes.
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
133
with the website, and thus, increase their operatingefficiency.
Fig. 4. Increasing users’ operating cost.
3.2.3. The relationships between efficiency and distance to target From data obtained from all users’ browsing sequences, the experimental website is divided into operating routes. The depth of the website is 5. The shortest path is a path with a shortest distance from an initial page to a target page. The smallest shortest path has distance 1 (see Fig. 4). User operating cost increases as the distance of the shortest path gets longer. In our experiment, when the shortest path has distance 1, the average operating cost is 1.4347 units, and the difference between these two values is small. However, when the distance of shortest path is longer, say 5, the operating cost is 13.1739, and the differences between shortest path and operating cost grow rapidly. If we calculate efficiency using the ratio: shortest path/operating cost, we get the values shown in Fig. 5. Efficiency decreases as the shortest path gets longer; the differences between shortest path and operating cost do not grow linearly. In other words, users spend much more steps (time) to get to the target whereas the shortest path becomes only slightly longer. As a shortest path gets longer, users have a greater probability of making operating mistakes enroute to the target. To reduce user operating mistakes, lowering a necessary operating cost offers a better solution. That is, it is better if we are able to reduce the distance to target. Then, users will make fewer mistakes in interaction
Fig. 5. Decreasing efficiency.
3.2.4. Getting the efficiency of a website Fig. 6 shows the procedure for extracting critical information from each operating route in order to calculate efficiency. The field ‘Shortest Path’ is the minimum cost from the initial page to a target page. ‘Operating Cost’ is each user’s actual operational cost from going from the initial page to the target page. Efficiency is calculated by the ratio: Shortest Path/Operating Cost. The ‘Average Efficiency’ is obtained by dividing the sum of the efficiencies by the number of users (IDs). The ‘Average efficiency’ of all operating routes can also be considered as a measure of the whole website efficiency, i.e. all operating routes from the initial page to the target page. A website with a good average operating-efficiency lets users get to a specific page rapidly. The following is the algorithm to obtain a website’s operating-efficiency.
Get efficiency of a web site //OR:Operating route records, OR1: the first record of OR EFF:Z0//the efficiency of an OR for ((Td1) to (last record of OR)) begin EFFdEFFC(Get the shortest path of this OR from initial to target page/Length of this OR) end Return(EFF/the number of OR)
3.3. The gap between expectation and actuality 3.3.1. Expected value ‘Expected value’ refers to the efficiency, in a best case scenario, of a link to a website. In Fig. 6, the efficiency of OI-001, the path from ‘C’ to ‘H’, is 15.3%. If a link from ‘C’ to ‘E’ is provided, we can assume that the user will directly access ‘C’ to ‘E’ so that, in the best case, items between ‘C’ to ‘E’ in OI-001 can be expected to be removed. The original operating route in this transaction is C, B, A, C, G, C, B, E, I, E, B, C, D, H and after adding a link from ‘C’ to ‘E’, the operating route, in the best case, becomes C, E, B, C, D, H. As a result, the efficiency of the modified OI-001 increases to 40.0%. Adding the link between ‘C’ to ‘E’ influences other paths (‘operating routes’) and thus, efficiency. This influences website efficiency (‘average efficiency’) after a re-calculation of the efficiency for each operating route. In this case, the website efficiency is improved from 34.5 to 52.6% (Fig. 7). The modified average efficiency is called the ‘expected value’. If the expected value is higher, the efficiency of the website is better, i.e. has improved. In this example, after adding the link ‘C’ to ‘E’,
134
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
Fig. 6. Extracting critical information from each operating route.
the expected value becomes 52.6%. This means that, in the best case, the efficiency of this website will have improved to 52.6%. Even when a link exists between two pages, the expected value for the link can still be calculated. In this case, we use the term expected value to refer to the ideal website efficiency for the existing link. If the website efficiency is 30% and the expected value for a specific existing link is 40%, the ideal website efficiency is higher than the actual website efficiency with the link so that there is a gap
between the ideal (expected) and the real (actual) efficiency. We consider such link as not good and so that they should be removed. Fig. 8 shows an example of calculating expected values for both cases. When we calculate expected values, we have to know the difference between expected value and real website efficiency. This difference, between expected value and real website efficiency, helps us know which links should be added or removed. If the link does not exist, but has a higher expected value, this means that the link increase
Fig. 7. New average efficiency after adding a link from ‘C’ to ‘E’.
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
135
Fig. 8. Expected values for both cases.
the website efficiency, leading to the suggestion that it ought to be added. On the other hand, if the link is present, and the expected value is higher than the real website efficiency, this means that the link does not satisfy our expectation and leads to the suggestion that it ought to be removed. 3.3.2. Gap value If we want to add a link, we should know whether this link improves web efficiency or not. As we have just seen, most of the time, there is a gap (difference) between the actual website efficiency and the expected value. If this difference is bigger, the link may be worth adding. This difference is referred to as ‘Gap value’. The definition of gap value is the ratio of expected value to the current website efficiency. Eq. (2) shows the calculation of the gap value: Gap Z Expected value=current website efficiency
(2)
For example, if a website has an expected value of 70% for a link, but the current website efficiency is only 35%, the gap value is 2. A gap value of 1 means that website efficiency is exactly the same as our expectation. The interpretation of gap value varies according to different situations. Let us assume a gap value higher than 1. If the gap value refers to a nonexistent link, but there is a link in-between, the link is worth adding because it will improve website efficiency. If the gap value refers to an existing link, then this link has problems because user behavior does correlate with designer expectation. On the other hand, assume a gap value lower than 1. If the gap Table 2 Interpretations of two aspects for gap value Gap value Link exists Link not exists
Lower than 1
Higher than 1
A good designed link Not worth to build this link
A problematic link Worth to build this link
value refers to a nonexistent link, but there is a link inbetween, the link is not worth adding. If the gap value refers to an existing link, then we can consider that the existing link as good, because it augments website efficiency beyond designer expectation. Table 2 summarizes these interpretations for gap value. Here is the algorithm to calculate the gap value for a link. Get gap value of a link //OR:Operating route records, OR1:the first record of OR //LinkStart:Input the Link start page //LinkEnd:Input the Link end page Current EfficiencyZcall algorithm Get Efficiency of a Web Site If(no link between LinkStart and LinkEnd) Give the Link from LinkStart to LinkEnd Backup ALL OR for restore For (Increase (Td1) to (last record of OR) ) begin StartPositiond0 EndPositiond0 jd last item position of ORT For( Increase (id 1) to (last item position of ORT)) Begin CId get the item from the position #i in the ORT//the page ID CJdget the item from the position #j in the ORT//the page ID If(CIZLinkStart and StartPositionZ0) StartPositiondi If(CJZLinkEnd and EndPositionZ0) EndPositiondj j– end
136
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
Remove the items from StartPosition to EndPosition in ORT end Expect valuedcall algorithm Get Efficiency of a Web Site Remove the Link from LinkStart to LinkEnd Restore OR Gap ValuedExpect value/Current Efficiency Return(Gap Value)
3.4. Using gap value to determine web architecture 3.4.1. Sorting gap values per link The gap values produced for each link have to be sorted to determine the most efficient link. Fig. 9 shows the sorted gap values for the links. In Fig. 9, the table on the left illustrates the situation for an existing link between each of the ‘Begin’ and ‘End’ pages, with the most problematic link (with the highest gap value) being that from ‘B’ to ‘E’. The table on the right depicts the situation for a nonexistent link yet, but assumes links between each ‘Begin’ and ‘End’ page pair. The most promising link to improve the website efficiency is the link from ‘B’ to ‘H’, which has the highest gap value. The following is the algorithm to sort gap values.
Sort Gap values per Link //OR:Operating route records, OR1:the first record of OR //Table!Start, End, Gap value, ExistLinkOrNotO //Format to save links and it’ improve rate For (Increase (Td1) to (last record of OR)) begin For (Increase (Idfirst item position of ORT) to ((last item position of ORT)- 2)) begin For(Decrease (jdlast item position of ORT) to (jOZiC2)) begin CI:Zget the item from the position #i in the ORT//the page ID CJ:Zget the item from the position #j in the ORT//the page ID If(no link between CI and CJ) LinkExistdfalse Else LinkExistdtrue if(no record match !CI, CJO in the table fields of !Start, EndO) begin GapValdcall algorithm Get Gap value of a link (link from CI to CJ) Add !CI, CJ, GapVal, LinkExistO to Table!Start, End, Gap value,
ExistLinkOrNotO End end end end Sort the records by the field ‘Gap value’ in Table Return(Table)
3.4.2. Adaptive website to improve web efficiency automatically Once the gap values have been sorted, we can easily find the best link to improve the website efficiency. The information can be sent to the system to add or remove links either automatically or interactively. User behavior may change frequently, so the adaptive website has to reflect user intentions dynamically so as to improve/ maintain website efficiency. An adaptive website lets users reach their goals easily and save them time by automatic revisions of any shortcomings of the website. In most cases, adding links increases the efficiency of the website. However, when a web page has too many links, this may be confusing to users resulting in user access problems. What provides the maximum benefit for user—whether it is reducing the number of links or increasing efficiency—can be controversial; in general, web designers have an idea about the number of links that is needed and this is based on both content and their own design heuristics. In dealing with this problem, we consider the trade-off between reducing the number of links and efficiency. One solution is to limit the maximum number of the links per each page, or for the whole website. The other is to set a website efficiency threshold and only add links until the threshold is reached. In this thesis, we combine these two methods. The following describes the procedure in detail: I. Setting the number of links. Designers set a limit on the maximum number of links and then, iteratively,
Fig. 9. Sorting gap values per link.
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
137
Fig. 10. System architecture.
add links with the highest efficiency until this number is reached. There are two possible ways of doing so: one is by setting a limit on the number of links for the whole website; or, for each page. II. Setting an efficiency threshold that designers want to achieve. Web designers set a threshold of efficiency that website should reach. The system will keep adding links until the threshold is reached. III. Combining the limited number of links and efficiency threshold. Designers consider solutions that first satisfy an efficiency threshold, and then satisfy the limits on the link numbers.
4. System implementation 4.1. System architecture The system includes two parts: one is for web designers to control the server-side system of the website design; and the other is for on-line users to provide advice. Fig. 10 shows the system architecture of the first implementation. 4.2. Server-side control program This part of the system is the main program on the serverside allowing designers control of the website design.
The program consists of four parts (Fig. 11): (1) the web structure mining program; (2) the web usage mining program; (3) visualization of the website structure; and (4) configurations for the adaptive website. 4.2.1. Web-structure mining program The first step towards increasing the efficiency of a specific website is to obtain the website structure. The web structure mining program proposes to grab the website structure and save it onto a database. When a designer types the IP address of a website to the program, the program automatically starts to mine the entire website architecture. Firstly, the program downloads the page and analyzes its HTML code. Next, the system seeks out the hyperlinks in the page and repeats the actions until the page does not belong to the domain the designer inputs. Finally, the program obtains the entire website architecture and saves it onto the database. Fig. 12 shows the web structure program. 4.2.2. Web usage mining program This program uses the algorithm, described in Section 3, of calculating the operating-efficiency of the website, and improving it by adding and removing links. Fig. 13 is a screenshot of the web usage mining program. The algorithm, mainly, increases efficiency by finding good links to add and bad links to remove. For the algorithm to work properly, a threshold for the gap value is set.
Fig. 11. The main interface for server-side control program.
138
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
Fig. 12. The web structure mining program.
For example, in Fig. 13, for existing links, the gap value was set to 0.7 which means that if the gap value for the link is higher than 0.7, the system removes the link. Likewise, for nonexistent links, the gap value was set to 0.8 which means that if the gap value is higher than 0.8, the system adds in the link. 4.2.3. Visualization for the web structure program This program displays the web architecture and gives some useful information to the designer such as the average
browsing time and the number of users accessing each page, and the gap value for each link between pages, etc. The designer can also use the program to adjust the website architecture to add or remove links directly (Fig. 14). For example, if the designer wants to know how many users access the page and the average staying time of a page, he/she simply needs to click the page icon to obtain the relevant information. Additionally, the system can suggest, to the designer, which links have to be added or removed so as to increase operating-efficiency, based on the results from
Fig. 13. The web usage mining program.
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
139
Fig. 14. Visualization of the web structure.
usage mining. When the designer wants to change the website architecture (by adding or removing links), he/she needs to use the mouse to drag a link between the two items, or to delete the link between two pages. 4.2.4. Configurations for the adaptive website program This program configures the adaptive website system according to a time schedules. When the designer sets up the time schedule for execution (Fig. 15), the program will, for each scheduled period of time, automatically run the web structure and web usage mining programs, and remove/add links according to the collected user behavior (Fig. 16). The website architecture, therefore, will be different in each such period of time, reflecting both user access behavior and a possible increase in the operatingefficiency of the website.
browsing behavior, from which we could determine how best to increase user operating-efficiency. The first updated website was on April 23 (see Fig. 18), which is the index page for this website. The system adds 3 links and removes one. After the first update, the system deletes all browsing records in the database and continually records user behaviors for the next scheduled update. On May 23, the system changes the website architecture again. This update is based on the browsing records between 4/23 and 5/23. In this update, the system removes 2 links and adds none. This will be done repeatedly according to the time schedule. While this design is changing constantly, the designer may select other designs from among all the revisions. The system will retain all revisions for designer selection. Should the designer be satisfied with a current design setting, he/she can shut off the system’s automatic change function.
4.3. Adaptive website for client side This program mainly gives suggestions, to web users, on how to access pages more efficiency, and to more easily acquire the information they want. While a designer browses the website, the system adjusts the website architecture from the usage mining result, and the website architecture is updated instantly. The changes are based on user browsing behavior in order to find the most efficiently accessible web architecture. As the result, users become more comfortable and can efficiently surf pages in the website. User browsing behavior is continually recorded onto the database for the next website adjustment. Fig. 17 shows the interface for the client side. The original website was set up on March 23. An interval was set for a monthly mining of the website. Up to now, I have two revised versions of the website. Each version is based on the records collected reflecting user
Fig. 15. Set the time schedule for the adaptive website system.
140
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
Fig. 16. System will execute mining activities with a time schedule.
Fig. 17. System suggestions for on-line users (On client-side).
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
141
5. Discussions and conclusion The followings are the issues raised in this paper:
Fig. 18. Illustrating the process of a change to the website.
& Justification of the assumed target pages. We apply a time threshold to determine if a page is a target or not. This is an assumption, which may not always hold. However, so far, it has shown to provide for feasible solutions. & Others factors to be considered in constructing the web structure. In this paper, we focus on improving the operating-efficiency of a website. However, there are other design aspects that can be considered when determining the web structure. These include factors such as learnability, memorisability, and user satisfaction, which have not been considered in this paper. & Placement of links within the content structure. Web structure constructs are based on content, and links are based on dependencies between the content and information in other web pages. Co-occurring web pages may have similar content, but determining this can be intuitive. This paper uses ‘efficiency’ as the means to add links without focusing on the web content architecture. & Limitations on the number of links. A page with a large number of links will increase the efficiency of access of each page in the site. However, the inclusion of too many links may be bothersome to users, and perhaps, may even be confusing. This paper provides a way of finding efficient links among websites that can be added; however, the paper desists from suggest the proper number for links to be added. This involves other issues in interface design. Accordingly, this might be a critical factor in the trade-off between efficiency and the layout of the interface. Whenever users spend a minimal operating cost to accomplish a piece of work, the efficiency of operation is high. Although the highest efficiency is 1, not all of operations in an interface can have this efficiency of 1, because of other issues that affect web design. For example, to increase the efficiency to 1 means that for a web site with N pages each page must have NK1 links. Each page will link to every other page in the site. It is then really efficient since each page can be accessed from everywhere within the website; however, this may, almost certainly will, cause other issues relating to web organization. Therefore, finding the suitable number of links becomes a critical factor in determining the growth in efficiency. Data mining technology is used to create adaptive websites. In developing the technology I used concepts of recursive design. This improves design production. The core of this research is an algorithm to increase efficiency by adding links or removing links based on user browsing behavior. For future work, we can employ the same conceptual ideas but change the core; for example, employing another (and perhaps, better) way to arrange layouts from user behavior and employing this new approach to changing the design repeatedly until most designers/users are satisfied.
142
J.-H. Lee, W.-K. Shiu / Advanced Engineering Informatics 18 (2004) 129–142
References [1] Perkowitz M, Etzioni O. Adaptive web site: an AI challenge. IJCAI97 1997. [2] Koutri M, Daskalaki S, Avouris N. Adaptive interaction with web site: an overview of methods and techniques. Computer Science and Information Technologies, CSIT 2002. [3] Srivastava J, Cooley R, Deshpande M, Tan P-N. Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD 2000. [4] Raskin J. The human interface, first ed. Menlo, CA: Stratford Publishing, Inc.; 2000. [5] Srikant R, Yang Y. Mining web logs to improve website organization. ACM 2001. [6] Spiliopoulou M, Faulstich L. Wum: A web utilization miner. EDBT Workshop WebDB 98. Spain: Valencia; 1998. [7] Wu K-L, Yu P-S, Ballman A. Speed-tracer: A web usage mining and analysis tool. IBM Systems Journal 1998;37(1).
[8] Zaiane O, Xin M, Han J. Discovering web access patterns and trends by applying olap and data mining technology on web logs. In: Advances in Digital Libraries. CA: Santa Barbara; 1998. p. 19–29. [9] Shahabi C, Zarkesh A, Adibi J, Shah V. Knowledge discovery from users web-page navigation Workshop on Research Issues in Data Engineering. England: Birmingham; 1997. [10] Chen M-S, Park J-S, Yu P-S. Data mining for path traversal patterns in a web environment. 16th International Conference on Distributed Computing Systems 1996;385–92. [11] Zarkesh A, Adibi J, Shahabi C, Sadri R, Shah V. Analysis and design of server informative www-sites 6th International Conference on Information and Knowledge Management. Nevada: Las Vegas; 1997. [12] Nakayama T, Kato H, Yamane Y. Discovering the gap between web site designers’ expectations and users’ behavior. Computer Networks 2000;33(1-6):811–22. [13] Nielsen//NetRatings_Global. Global Internet Index; March 2001.