Find all the frequent itemsets you should illustrate the


Data Mining - Mining Frequent Patterns, Association Rules and Frequent Sequences

This is a combined practical and exercise session, in which we have three tasks. First, we use Weka to run the Apriori algorithm on a large dataset to find a set of association rules and analyse them. Second, we run the Apriori algorithm on a small dataset to find a set of association rules and use the knowledge of the Apriori algorithm that we have learned in lectures to work out how these association rules are generated. In particular, we want to work out how frequent itemsets are found. This will help us to familiarise ourselves with the conceptsand algorithms we have learned in lectures. Third, we use Weka to run the GSP algorithm to find a set of frequent sequences.

Step 1:Launching Weka and Loading the Transaction Dataset

Launch Weka by clicking on: RunWeka.bat

Select ‘Explorer' from the list of Applications.

Select the ‘Preprocess' tab and click on ‘Open File'. Choose the file ‘supermarket.arff' which contains a large transaction dataset.

This is a point of sale transaction dataset. The data is nominal and each data instance represents a customer transaction at a supermarket, the products purchased and the departments involved.

The dataset contains 4,627 instances and 217 attributes. The data is denormalized. Each attribute is binary and either has a value ("t" for true) or no value ("?" for missing). There is a nominal class attribute called "total" that indicates whether the transaction was less than $100 (low) or greater than $100 (high).We are not interested in creating a predictive model for total. Instead we are interested in what items were purchased together. We are interested in finding useful patterns in this dataset that may or may not be related to the predicted attribute.

Step 2: Exploring the Apriori Algorithm with a Large Dataset

Select the ‘Associate' tab and make sure that "Apriori" is chosen from the associator list.Click ‘Start'will see the Weka bird jumping up and down and the associator output displayed.

Step 3:Analysing Results

The real work for mining association rules is in the interpretation of results.

From looking at the "Associator output" window, you can see that Weka displays the top 10 rules learned from the supermarket dataset:

biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 conf:(0.92)
baking needs=t biscuits=t fruit=t total=high 760 ==> bread and cake=t 696 conf:(0.92)
baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705 conf:(0.92)
biscuits=t fruit=t vegetables=t total=high 815 ==> bread and cake=t 746 conf:(0.92)
party snack foods=t fruit=t total=high 854 ==> bread and cake=t 779 conf:(0.91)
biscuits=t frozen foods=t vegetables=t total=high 797 ==> bread and cake=t 725 conf:(0.91)
baking needs=t biscuits=t vegetables=t total=high 772 ==> bread and cake=t 701 conf:(0.91)
biscuits=t fruit=t total=high 954 ==> bread and cake=t 866 conf:(0.91)
frozen foods=t fruit=t vegetables=t total=high 834 ==> bread and cake=t 757 conf:(0.91)
frozen foods=t fruit=t total=high 969 ==> bread and cake=t 877 conf:(0.91)

You can see that the rules are presented in antecedent => consequent format. The number associated with the antecedent is the count of instances that contain the items in the antecedent (in this case out of a total of 4,627). The number next to the consequent is the count of instances that contain the items in both the antecedent and the consequent. The number in brackets on the end is the confidence for the rule (the second count divided by the first count). You can see that the minimum support of 0.15 and the minimum confidence of 0.9 were used in finding frequent itemsets and selecting association rules, as displayed is in the "Associator output" window.

Q1. What were the supports for Rule #1 and Rule #2?

Q2. Can you explain why the confidence with each of the rules is calculated by dividing the second count by the first count?

The algorithm is configured by a set of default values. You can click on the algorithm name and change any of its default values to re-configure it. For example, the number of selected top rules is set at 10, the minimum support is set at 0.1 and the minimum confidence is set at 0.9. In this case, the minimum support applied was actually increased from 0.1 to 0.15, which still allowed the top 10 rules to be generated. You can change any of these default values, run the algorithm again with the new configuration, and see the effects of these values have on the frequent itemsets found and association rules generated.

Q3. If you change the minimum support to 0.2 and run the algorithm again, how many rules are generated? Can you explain why?

Q4. If you keep the minimum support at 0.2, what can you do in order to generate top 10 rules?

Step 4: Discovering the Apriori Algorithm with a Small Dataset

We now focus on discovering how the Ariori algorithm produces the output. We want to familiarise ourselves with the theories and algorithms that we have learned in the lectures by working through an example using Weka.

Select the ‘Preprocess' tab and click on ‘Open File'. Choose the file ‘books.csv' which contains a small transaction dataset and examine the dataset. The dataset contains transactions of book purchases in a bookshop. Even though this is a transaction dataset, the format of the file cannot be directly used by Weka. It needs to be converted into an ARFF file. An ARFF file presents a relation with a fixed number of attributes, and not a transaction dataset, where each transaction consists of a set of items.

The key is to convert each transaction into a fixed-width data instance in ARFF format. This requires encoding each item as an attribute. For the Apriori algorithm the dataset can be encoded in either dense or sparse formats. For example, suppose that there were only four possible categories of books that could be purchased in a bookshop and three hypothetical transactions are contained in the following csv file, where the categories of books that were purchased in the same transaction are identified by the same TID.

TID, Category
01, category2
02, category1
02, category3
03, category1
03, category2
03, category3
04, category4

The above transaction dataset can be converted into two ARFF formats:

Dense format (absence from a basket is encoded as a missing value (?) in Weka.

@relation transactions
@attribute category1 {t}
@attribute category2 {t}
@attribute category3 {t}
@attribute category4 {t}

@data
?,t,?,? % category2 is in the first transaction (all others are absent from the basket)
t,?,t,?
t,t,t,t

Sparse format (missing values(?) are not explicitly represented)

@relation book_transactions_sparse
@attribute category1 {t,f}
@attribute category2 {t,f}
@attribute category3 {t,f}
@attribute category4 {t,f}

@data
{1 t}
{0 t, 2 t}
{0 t, 1 t, 2 t, 3 t}

This issue has already been dealt with in "supermarket.arff" and the same solution can be used for converting books.cvs into "books.arff". For converting a large transaction dataset, we would probably need a program. Since our dataset is very small, we just do it manually to save time.

Q5. Convert books.cvs into books.arff in the dense format (you can use any text editor).

We can now run the Apriori algorithm on books.arff to get a set of association rules.

We next work out how the Apriori algorithm generates these association rules.

Q6.Assume that the minimum support is 0.5 (minsup = 0.5).Find all the frequent itemsets. You should illustrate the process of finding these frequent itemsets (refer to slide 24 in the lecture notes).

Q7.Generate all association rules with their confidences greater than 0.7from the frequent itemsets found in Q6above.

Q8. If a user has bought a book from the Mystery category, from which category will he/she most probably buy another book? Please explain your reason.

Step 5: Exploring the Generalised Sequential Patterns (GPS) Algorithm with a Small Dataset

Select the ‘Preprocess' tab and click on ‘Open File'. Choose the file ‘SequentialPatterns.arff' which contains three sequences. Note that the itemsets/elements in each sequence is identified by a sequence ID.

Q9. Can you write down the three sequences?

Select the ‘Associate' tab and make sure that "GeneralisedSequentialPatterns" is chosen from the associator list. Click ‘Start' will see the associator output displayed. As you can see four sets of frequent subsequences are displayed. The number at the end of each subsequence represents the number of sequences in the dataset that contain the subsequence.

Q10. Can you illustrate the process of finding these frequent subsequences?

Attachment:- Practical3.rar

Request for Solution File

Ask an Expert for Answer!!
Computer Engineering: Find all the frequent itemsets you should illustrate the
Reference No:- TGS02270424

Expected delivery within 24 Hours