Text Mining Using STM™, CART®, and TreeNet® from Salford Systems

Text Mining Using STM™, CART®, and TreeNet® from Salford Systems

TUTORIAL J Text Mining Using STM Ô, CART Ò, and TreeNet Ò from Salford Systems: Analysis of 16,000 iPod Auctions on eBay Dan Steinberg dans_salford@y...

452KB Sizes 0 Downloads 52 Views

TUTORIAL J

Text Mining Using STM Ô, CART Ò, and TreeNet Ò from Salford Systems: Analysis of 16,000 iPod Auctions on eBay Dan Steinberg [email protected]

Mykhaylo Golovnya [email protected]

Ilya Polosukhin [email protected]

CONTENTS Installing the Salford Text Miner ..................................................................................................415 Comments on the Challenge .........................................................................................................415

This tutorial is based on a data mining competition held over the past few years. For the original DMC2006 competition website, visit http://www.data-mining-cup.de/en/review/dmc-2006/. We recommend that you visit this site for information only. The URLs for data and tools for preparing that data are available in this tutorial. Text mining is an important and fascinating area of modern analytics. On the one hand, text mining can be thought of as just another application area for powerful learning machines. On the other hand, text mining is a distinct field with its own dedicated concepts, vocabulary, tools, and techniques. In this tutorial we aim to illustrate some important analytical methods and strategies from both perspectives on data mining by introducing tools that are specific to the analysis text and deploying general machine learning technology. The Salford Text Mining (STM) utility is used in this tutorial for the text processing system that prepares data for machine learning analytics. This is followed by predictive analytics modeling using the Salford Systems CARTÒ decision tree and stochastic gradient boosting TreeNetÒ . (Evaluation copies of the proprietary technology in CART and TreeNet, as well as the STM are available from http://www .salford-systems.com.) To follow along with this tutorial, you may want to have the analytical tools being demonstrated installed on your computer. Everything you need may already be on a CD disk containing this tutorial Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. DOI: 10.1016/B978-0-12-386979-1.00018-9 Ó 2012 Dan Steinberg, Mykhaylo Golovnya, Ilya Polosukhin. Published by Elsevier Inc.

413

414

TUTORIAL J:

Text Mining Using STM Ô , CART Ò, and TreeNet Ò

and analytical software, but if not, you can use the following link. Create an empty folder on your computer hard drive named “stmtutor.” This is the root folder where all of the work files related to this tutorial will reside. You may also use the following link to download the SPM: http://www.salfordsystems.com/dist/SPM/SPM680_Mulitple_Installs_2011_06_07.zip. After downloading the package, unzip its contents into “stmtutor,” which will create a new folder named “SPM680_Mulitple_Installs_2011_06_07.” The Salford Systems software you’ve just downloaded needs to be both installed and licensed. Free license codes for a 30-day period are available on request to visitors of this tutorial. (Be aware, however, that Salford Systems reserves the right to decline to offer the free license at its discretion.) To do this, double click on the “Install_a_Transform_SPM.exe” file located in the “SPM680_Mulitple_Installs_2011_06_07” folder to install the specific version of SPM used in this tutorial. Follow the simple installation steps on your screen. For the original DMC2006 competition website, visit http://www.data-mining-cup.de/en/review/ dmc-2006/. We recommend that you visit this site for information only; data and tools for preparing that data are available at the URL listed following. For the STM package, prepared data files, and other utilities developed for this tutorial, please visit http://www.salford-systems.com/dist/STM .zip. After downloading the archive, unzip its contents into the “stmtutor” folder. When you launch the SPM, you will be greeted with a License dialog containing information needed to secure a license via email (Figure J.1).

FIGURE J.1 License dialog.

Comments on the Challenge

Send the necessary information to Salford Systems to secure your license by entering the “Unlock Code,” which will be emailed back to you. The software will operate for three days without any licensing, but you can secure a 30-day license on request.

INSTALLING THE SALFORD TEXT MINER In addition to the Salford Predictive Modeler (SPM), you will also work with the Salford Text Miner (STM) software. No installation is needed, and you should already have the “stm.exe” executable in the “stmtutor\STM\bin” folder from unzipping the “STM.zip” package. STM builds upon the Python 2.6 distribution and the NLTK (Natural Language Tool Kit) but makes text data processing for analytics very easy to conduct and manage. Expect to see several folders and a large number of files located under the “stmtutor\STM” folder. It is important to leave these files in the location to which you have installed them. In order for the software to work properly, please do not move or alter any of the installed files other than those explicitly listed as user-modifiable. NOTE: “stm.exe” will expire in the middle of 2012. Please contact Salford Systems to get an updated version beyond this date (http://www.salford-systems.com/). Their phone number is 619-5438880, and their email address is [email protected]). The best examples are drawn from real-world data sets, and we were fortunate to locate data publicly released by eBay. Good teaching examples also need to be simple. Unfortunately, real-world text mining could easily involve hundreds of thousands, if not millions, of features characterizing billions of records. Professionals need to be able to tackle such problems, but to learn, we need to start with simpler situations. Fortunately, there are many applications in which text is important but the dimensions of the data set are radically smaller, either because the data available are limited or because a decision has been made to work with a reduced problem. In this tutorial, we use our simpler example to illustrate many useful ideas for beginning text miners, while pointing the way to working on larger problems. In 2006 the DMC data mining competition (restricted to student competitors only) introduced a predictive modeling problem for which much of the predictive information was in the form of unstructured text. These data sets can be downloaded from http://www.data-mining-cup.de/en/review/ dmc-2006/. For your convenience, however, we have repackaged these data and made it somewhat easier to work with. This repackaged data are included in the STMU package described at the beginning of this tutorial. The data summarize 16,000 iPod auctions held on eBay from May 2005 through May 2006 in Germany. Each auction item is represented by a text description written by the seller (in German), as well as a number of flags and features available to the seller at the time of the auction. Auction items were grouped into 15 mutually exclusive categories based on distinct iPod features: storage size, type (regular, mini, nano), and color. The competition’s goal was to predict whether the closing price would be above or below the category average.

COMMENTS ON THE CHALLENGE One might think that a challenge written in German might not be of general interest outside of Germany. However, working with a language that is essentially unfamiliar to any member of the analysis team helps to illustrate one important point: Text mining via tools that have no “understanding” of the language can be strikingly effective.

415

416

TUTORIAL J:

Text Mining Using STM Ô , CART Ò, and TreeNet Ò

We have no doubt that dedicated tools that embed knowledge of the language being analyzed can yield predictive benefits. We also believe we could have gained further valuable insight into the data if any of the authors spoke German! But our performance without this knowledge is still impressive. In contexts where simple methods can yield more than satisfactory results, or in contexts where the same methods must be applied uniformly across multiple languages, the methods described in this tutorial can provide excellent guidance.  

Figure J.2 summarizes the positioning of the four basic models made in this tutorial with respect to the 173 official competition entries. The TN model with text mining processing is among the top 10 winners!

FIGURE J.2 Four models.

Now to work through this tutorial, please go to the companion website and find “Tutorial J” in the electronic folder, and then find the PowerPoint of over 100 slides. This tutorial is presented as a PPT. This PPT is extremely thorough, taking you through all of the steps needed to learn how to use Salford Systems and their new text mining module.