How does value compare to empirical distribution of feature


Problem I

Load the iris sample dataset from sklearn (load iris()) into Python using a Pandas dataframe. Induce a set of binary Decision Trees with a minimum of 2 instances in the leaves, no splits of subsets below 5, and an maximal tree depth from 1 to 5 (you can leave other parameters at their defaults). Which depth values result in the highest Recall? Why? Which value resulted in the lowest Precision? Why? Which value results in the best F1 score? Explain the difference between the micro/macro/weighted methods of score calculation.

Problem II

Load the Breast Cancer Wisconsin (Diagnostic) sample dataset from the UCI Machine Learning Repository (The discrete version at: breast-cancer- wisconsin.data) into Python using a Pandas dataframe. Induce a binary Decision Tree with a minimum of 2 instances in the leaves, no splits of subsets below 5, and a maximal tree depth of 2 (use the default Gini criterion). Calculate the Entropy, Gini, and Misclassification Error of the first split - what is the Information Gain? What is the feature selected for the first split, and what value determines the decision boundary?

Problem III

Load the Breast Cancer Wisconsin (Diagnostic) sample dataset from the UCI Machine Learning Repository (The continuous version at: wdbc.data) into

Assigned: Due: February 09, 2020 Homework 2 February 22, 2020

Python using a Pandas dataframe. Induce the same binary Decision Tree as above (now using the continuous data) but perform a PCA dimensionality reduction beforehand. Using only the first principal component of the data for a model fit, what is the F1, Precision, and Recall of the PCA-based single factor model compared to the original (continuous) data? Repeat using the first and second principal components. Using the Confusion Matrix, what are the values for FP and TP as well as FPR/TPR? Is using continuous data in this case beneficial within the model? How?

Problem IV

Simulate a binary classification dataset with a single feature using a mixture of normal distributions with NumPy (Hint: Generate two data frames with the random number and a class label, and combine them together). The normal distribution parameters (np.random.normal) should be (5,2) and (-5,2) for the pair of samples. Induce a binary Decision Tree of maximum depth 2, and obtain the threshold value for the feature in the first split. How does this value compare to the empirical distribution of the feature?

Request for Solution File

Ask an Expert for Answer!!
Database Management System: How does value compare to empirical distribution of feature
Reference No:- TGS03213168

Expected delivery within 24 Hours