DATA: Breast cancer dataset from here
Data Set Characteristics: MultivariateNumber of instances: 699
Attribute Characteristics: Integer
Number of Attributes: 10
Missing Values: Yes.
Dataset description
- Sample Code Number: ID
- Clump Thickness: 1-10
- Uniformity of cell size: 1-10
- Uniformity of cell shape: 1-10
- Marginal Adhesion: 1- 10
- Single Epithelial Cell Size: 1-10
- Bare Nuclei: 1-10
- Bland Chromatin: 1-10
- Normal Nucleoli: 1-10
- Mitosis: 1-10
- Class:(2 for benign, 4 for malignant)
This dataset provides us with a new challenge in the sense that it contains missing data. So far in this blog, we have never worked with any dataset where data is missing. So this calls for some data exploration. We need to gather some information about the data, that is it calls for some EDA(Exploratory Data Analysis). But first import the libraries:
Step 1:
Step 2:
Read the data and check for interesting information, i.e. missing data or rather which all columns contain such stuff.
Read the data and check for interesting information, i.e. missing data or rather which all columns contain such stuff.
Step 3:
Check the output. Here we find that while all other fields are showing integer which should be the case, Bare_nuclei is shown as an object, thus identifying the culprit.
Check the output. Here we find that while all other fields are showing integer which should be the case, Bare_nuclei is shown as an object, thus identifying the culprit.
Step 4:
Here we require some jugglery with a regular expression to find out what exactly is there instead of missing data. It may be tempting to just open the file and check manually, but that can be messy if the dataset is large. So here we go:
Here we require some jugglery with a regular expression to find out what exactly is there instead of missing data. It may be tempting to just open the file and check manually, but that can be messy if the dataset is large. So here we go:
and here is the output, showing that they have been populated with '?' and total missing data is 16
Step 5:
Here we have two options, either drop the rows having missing data, or use Imputer class from sklearn.preprocessing. But there is a problem with the second option. In case dataset is skewed in favor of either class and this feature happens to be the most influential, we may end up getting too many misclassifications. Also notice that the total number is 16, whereas the dataset is 699. Hence percentage is quite low. Hence we go for the first option and separate the target from the features, i.e. decide on Xs and ys
Here we have two options, either drop the rows having missing data, or use Imputer class from sklearn.preprocessing. But there is a problem with the second option. In case dataset is skewed in favor of either class and this feature happens to be the most influential, we may end up getting too many misclassifications. Also notice that the total number is 16, whereas the dataset is 699. Hence percentage is quite low. Hence we go for the first option and separate the target from the features, i.e. decide on Xs and ys
Step 6:
Now just to satisfy ourselves, let's take a look at the correlation between the class and the different features. But there is a problem, and that is Bare_Nuclei is still not an integer, but an object. So we need to convert it into an integer, or else it will never appear in our heatmap. So that's what we do in this part of the code.
As the result shows, our fear about Bare Nuclei turns out to be true. It does play a major role in classification.
Now, we can safely put the rest of the work in our template. Split the data, standardize them, fit the training data and finally check the misclassification
No comments:
Post a Comment