Friday, May 18, 2018

Logistic Regression

Why linear regression is not useful in classification problem?
In linear regression the Y variable is always a continuous variable (i.e. real value either yes or no). If suppose, the Y variable was categorical (probability of values), using linear regression model is useless. Logistic regression can be used to model and solve such problems, also called as binary classification problems.

INTRODUCTION

Logistic regression is supervised learning method. Under supervised learning techniques, the learning models that are categorized under statistical methods are instance-based learning methods Bayesian learning methods and regression analysis. Let us focus on regression analysis and other related regression models. Regression analysis is known to be one of the most important statistical techniques. As mentioned, it is a statistical methodology that is used to measure the relationship and check the validity and strength of the relationship between two or more variables. Traditionally, researchers, analysts, and traders have been using regression analysis to build trading strategies to understand the risk contained in a portfolio. Regression methods are used to address both classification and prediction problems.

The method serves two purposes: (1) it can predict the value of the dependent variable for new values of the independent variables, and (2) it can help describe the relative contribution of each independent variable to the dependent variable, controlling for the influences of the other independent variables. The four main multi-variable methods used in health science are linear regression, logistic regression, discriminant analysis, and proportional hazard regression. The four multi-variable methods have many mathematical similarities but differ in the expression and format of the outcome variable. In linear regression, the outcome variable is a continuous quantity, such as blood pressure. In logistic regression, the outcome variable is usually a binary event, such as alive versus dead, or case versus control. In discriminant analysis, the outcome variable is a category or group to which a subject belongs. For only two categories, discriminant analysis produces results similar to logistic regression. The logistic regression is the most popular multi-variable method used in health science (Tetrault, Sauler, Wells, & Concato, 2008). In this article logistic regression (LR) will be presented from basic concepts to interpretation.

MATHEMATICAL MODEL



Logistic regression sometimes called the logistic model or logit model, analyzes the relationship between multiple independent variables and a categorical dependent variable, and estimates the probability of occurrence of an event by fitting data to a logistic curve. There are two models of logistic regression, binary logistic regression and multinomial logistic regression. Binary logistic regression is typically used when the dependent variable is dichotomous and the independent variables are either continuous or categorical. When the dependent variable is not dichotomous and is comprised of more than two categories, a multinomial logistic regression can be employed. As an illustrative example, consider how coronary heart disease (CHD) can be predicted by the level of serum cholesterol. The probability of CHD increases with the serum cholesterol level. However, the relationship between CHD and serum cholesterol is nonlinear and the probability of CHD changes very little at the low or high extremes of serum cholesterol. This pattern is typical because probabilities cannot lie outside the range from 0 to 1. The relationship can be described as an ‘S’-shaped curve. The logistic model is popular because the logistic function, on which the logistic regression model is based, provides estimates in the range 0 to 1 and an appealing S-shaped description of the combined effect of several risk factors on the risk for an event (Kleinbaum & Klein, 2010).

logistic regression method is used for classification problems. Instead of predicting the values of response  0 & 1 as in linear regression will predict probability of  response variables. making sure cure fitted in range of response variable as,

 [0, 1]
 R
y = response variable
R =  -ve ∞ to +ve 
The term logistic refers to 
"logit" = "log odds"
ODDS
Odds of an event are the ratio of the probability that an event will occur to the probability that it will not occur. If the probability of an event occurring is p, the probability of the event not occurring is (1-p). Then the corresponding odds is a value given by, 

odds = p/1-p
Let probability = P(Y=1|x) = p(x)
where 1 = it is not a number it is probability of class or category.
 R
P(x)  [0, 1]
where p(x) = sigmoid function
p(x) =1/1+e-βx

Fig. 1 S -curve sigmoid function
Pic Courtesy: Wiki
above equation can be write as,
Parameter Estimation
goal is to estimate right hand side of above equation  i.e β vector. logistic regression uses maximum likelihood for parameter estimation. let us see how does it works..

consider N samples with labels 0 and 1

for label 1 = estimate  such that P(x) is as close to 1 as possible
for label 0 = estimate  such that P(x) is as close to 0 as possible
therefore for every sample we can write mathematically as

for 1 highest possible value =
for 0 lowest possible value = 

where xis feature vector for ith  sample 
to estimate the log likelihood function take product of highest possible and lowest possible responses which must be maximum over all the elements of database.now log likelihood function will be as ,
further simplifying 
now take log likelihood function and convert into summation,
put P(xi) back in above equation

now group Yi  together
further simplifying we get,
above equation is called as transcendental equation.  The goal is to define the data which maximizes the function


Step By Step Execution of Algorithm

  • Choosing a classification algorithm
Choosing an appropriate classification algorithm for a particular problem task requires practice each algorithm has its own quirks and is based on certain assumptions. To restate the "No Free Lunch" theorem no single classifier works best across all possible scenarios. In practice, it is always recommended that you compare the performance of at least a handful of different learning algorithms to select the best model for the particular problem, these may differ in the number of features or samples, the amount of noise in a dataset, and whether the classes are linearly separable or not.  The five main steps that are involved in training a machine learning algorithm can be summarized as follows:

  1. Selection of features.
  2. Choosing a performance metric.
  3. Choosing a classifier and optimization algorithm.
  4. Evaluating the performance of the model.
  5. Tuning the algorithm.

Since the approach of this blog is to build machine learning knowledge step by step, main focus is on the principal concepts of the different algorithms.

  • probabilities with logistic regression
Although the perceptron rule offers a nice and easygoing introduction to machine learning algorithms for classification, its biggest disadvantage is that it never converges if the classes are not perfectly linearly separable. Intuitively, the reason is as the weights are continuously being updated since there is always at least one misclassified sample present in each epoch. Of course, it can change the learning rate and increase the number of epochs, but be warned that the perceptron will never converge on this dataset. To make better use of our time, we will now take a look at another simple yet more powerful algorithm for linear and binary classification problems: logistic regression. Note that, in spite of its name, logistic regression is a model for classification, not regression. 

  • DATA

The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y). The dataset can be downloaded from here.


fig. 2 code overview
photo Courtesy-www.quantinsti.com  
  • Input variables

  1. age (numeric)
  2. job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
  3. marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
  4. education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
  5. default: has credit in default? (categorical: “no”, “yes”, “unknown”)
  6. housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
  7. loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
  8. contact: contact communication type (categorical: “cellular”, “telephone”)
  9. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  10. day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
  16. emp.var.rate: employment variation rate — (numeric)
  17. cons.price.idx: consumer price index — (numeric)
  18. cons.conf.idx: consumer confidence index — (numeric)
  19. euribor3m: euribor 3 month rate — (numeric)
  20. nr.employed: number of employees — (numeric)
  • Predict variable (desired target):

y — has the client subscribed a term deposit? (binary: “1”, means “Yes”, “0” means “No”)

Step 1

import all necessary libraries as,

output 


Step 2

The education column of the dataset has many categories and we need to reduce the categories for a better modelling. The education column has the. Let us group “basic.4y”, “basic.9y” and “basic.6y” together and call them “basic”.


output 


Step 3

Observations

The average age of customers who bought the term deposit is higher than that of the customers who didn’t. calculate categorical means for other categorical variables such as education and marital status to get a more detailed sense of our data. Now create dummy variables That is variables with only two values, zero and one.


Step 4

To get final data column write as,


output of final array


Step 5

Feature Selection


Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.


output


Step 6

Model fitting into logistic regression algorithm and predict test results.


Step 7

Calculate the accuracy


output
The average accuracy remains very close to the Logistic Regression model accuracy; hence, we can conclude that our model generalizes well.

Step 8

confusion matrix


Output


The result is telling us that we have 10872+254 correct predictions and 1122+109 incorrect predictions

Logistic regression vs. other approaches


Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution  is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes rather than the probability of the outcomes themselves.
Logistic regression is an alternative to Fisher's 1936 method, linear discriminant analysis. If the assumptions of linear discriminant analysis hold, the conditioning can be reversed to produce logistic regression. The converse is not true, however, because logistic regression does not require the multivariate normal assumption of discriminant analysis

EXAMPLE  2

Dataset comes from the UCI Machine Learning Repository and it is related to Iris plant
 database containing flower classes setosa, versicolor and virginica. Attributes are sepal
length sepal width and petal length, petal width

Step 1

import the required libraries as,

Step 2

Train the data set as,

Step 3

apply the logistic regression model
Output

Step 4

Make predictions and get the confusion matrix

Output

The result is telling us that we have 13+15 correct predictions and 0+2 incorrect predictions
Reference-  Sebastian Raschka -Python Machine Learning
















Tuesday, May 1, 2018

Perceptron Model

A computer program is said to 'learn' from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

PROGRAMMING LANGUAGE
Python 3.6 version of the language is used in the blog codes. The main reason for a new version of Python is to remove all the small problems and nit-picks that have accumulated over the years and to make the language cleaner. If you already have a lot of Python 2.x code, then there is a utility to assist you to convert 2.x to 3.x source. According to me best book to learn python language is “PYTHONCRASH COURSE a hands on, project based introduction to programming” written by Eric Matthes, available online

INTRODUCTION
Machine learning is field of computer science that uses statistical techniques to give computer system the ability to ‘learn’ (i.e. successively improve performance on an original task) with data, without being comprehensibly programmed. Most sciences are empirical in nature, and this means their theories are based on robust empirical phenomena, such as the law of combining volumes and the ideal gas law. A growing number of machine learning researchers are focusing their efforts on discovering analogous phenomena in the behaviour of learning systems, and this is an encouraging sign.

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers (task of classifying the elements of a given set into two groups on the basis of classification rule.),the perceptron is the first artificial neural network.The artificial neural networks implement a computational paradigm inspired by the anatomy of the brain. The corresponding algorithms simulate simple processing units (called neurons) linked through a complex web of connections. This enables the networks to process separately different pieces of information while keeping into account their mutual constraints and relationships.

HISTORY
In considering the history of the perceptron it is important to note that the formal perceptron definition has only recently begun to stabilize. Given this lack of definitional stability (even Rosenblatt himself changed his definitions multiple times), this history section will simply mention a number of main influences on the development process. This is not really satisfactory, but it is probably the best that can be done short of a book length historical treatment (for a more extensive discussion see [Anderson and Rosenfeld, 1998]).

The idea that it might be possible to consider mathematical models of biological neurons goes back many decades [Anderson and Rosenfeld, 1988]. Before 1940, some attempts were made to mathematically model neural function, but none of these models were compelling. The first breakthrough came in 1943 with the seminal paper of McCulloch and Pitts [McCulloch and Pitts, 1943]. They postulated that neurons functioned as Boolean logic devices. While the logic circuit idea of McCulloch and Pitts was soon discredited as a biological theory, their work, and that of several others, led to a model of a neuron as a linear or affine sum followed by an activation function(usually a unit
step function with a threshold offset). This neuron model (which, over time, began to be frequently, but incorrectly attributed directly to McCulloch and Pitts) was widely adopted as a model for neuron behaviour.

The thinking stimulated by McCulloch and Pitts and those they inspired didn’t lead immediately to any important new concrete ideas, but it did generate a widespread feeling among prominent exponents of automated information processing (e.g., Norbert Weiner and John von Neumann, and many others) that building ‘artificial brains’ might become possible someday ‘soon’. This excitement became even greater in 1949 with the publication of Hebb’s hypothesis [Hebb 1949] that neuronal learning involves experience induced modification of synaptic strengths. Incorporating this hypothesis into the formal neuron (i.e. variable weights) created many new avenues of possible investigation.

Although many investigations were launched in the early 1950s into neural networks composed offormal neurons with variable (Hebbian) weights, none of them yielded significant results until about 1957. At that point, two disparate themes emerged which, astoundingly, have still not been reconciled or connected, the learnmatrixand the perceptron. Studies of the learnmatrix associative memory neural network architecture were launched by Karl Steinbuch in about 1956 and led in 1961 to his writing of Automat und Menschthe first technical monograph on artificial neural networks [Steinbuch, 1961, Steinbuch and Piske, 1963].

The perceptron (a term originally coined to mean a single threshold logic neuron designed to carry out a binary i.e., two-class pattern recognition function) was developed by Frank Rosenblatt beginning in 1956. By 1957 he had developed a learning method for the perceptron and was soon able to prove mathematically that, in the case of linearly separable classes, the perceptron would be able to learn, by means of a Hebb-style supervised training procedure, to carry out its linear pattern classification function optimally. For the first time, a formal neuron with trainable Hebbian weights was shown to be capable of carrying out a useful ‘cognitive’ function.


Because of the limitations of digital computers in the 1950s, Rosenblatt was initially unable to try out his ideas on practical problems. So he and his colleagues successfully designed, built and demonstrated a large analog circuit neurocomputer to enable experiments His machine (called the Perceptron Mark I) had 512 adjustable weights and a crude 400-pixel camera as its visual sensor. It successfully demonstrated an ability to recognize alphabetic characters. Rosenblatt’s work (which was widely publicized, including numerous popular magazine articles and appearances on major network television programs) inspired a large number of investigators to begin work on neural networks.





  Frank Rosenblatt, c 1959 Photographed
in connection with a television appearance 
with the "eye" of the Perceptron Mark I
Pic Courtesy: Wiki
By 1962, Rosenblatt’s book (the second monograph on neural networks) Principles of Neurodynamics[Rosenblatt, 1962] discussed a more advanced sort of perceptron one similar to that discussed in thisbook. However, a big unsolved problem remained an effective method for training the hidden layerneurons.
Another important development of the late 1950s was the development of the ADALINE(ADAptiveLINearNEuron – but it was actually affine) by Widrow and Hoff [Widrow and Hoff, 1960]. The learning law for this network was the delta rule used in the output layer neurons of the perceptron. Widrow’s perspicuous mathematical derivation of the delta rule learning law set the foundation for many important later developments. As with Rosenblatt, Widrow turned to analogelectronics to build an implementation of the ADALINE (including development of a variable-electrical resistance weight implementation device called a MEMISTOR [Hecht-Nielsen, 1990]).

A development which was understood only by a few experts at the time, but which is now recognized as a phenomenally prescient insight, was the discovery in 1966 of the essential part of the generalized delta rule by Amari [Amari, 1967]. The only thing missing was how to calculate the hidden neuron errors.

By the mid-1960s many of the researchers who had been attracted to the field in the late 1950s began to become discouraged. No significant new ideas had emerged for some time and there was no indication that any would for a while. Many people gave up and left the field. Contemporaneously, a series of coordinated attacks were lodged against the field in talks at research sponsor headquarters, talks at technical conferences and talks at colloquia. These attacks took the form of rigorous mathematical arguments showing that the original perceptron and some of its relatives were mathematically incapable of carrying out some elementary information processing operations (e.g., the Exclusive-OR logical operation). By considering several perceptron variants and showing all of them to be ‘inadequate’ in this same way, an implication was conveyed that all neural networks were subject to these severe limitations. These arguments were refined and eventually published as a book [Minsky and Papert, 1969] (see [Anderson and Rosenfeld, 1998] for a more thorough discussion). The final result was the emergence of a pervasive conventional wisdom (which persisted until 1985) that ‘neural networks have been mathematically proven to be useless’.

Although the missing piece of the generalized delta rule (i.e., backpropagation) was independentlydiscovered by at least two people during the period 1970-1985 [Anderson and Rosenfeld, 1998], namely Paul Werbos in 1974 and David Parker in 1982, these discoveries had little impact and did not becomewidely known until after the work of Rumelhart, Hinton, and Williams in 1985 [Rumelhart et al 1986]. As can be seen by referring to any current text on neural networks (e.g., [Haykin, 1999 Fine, 1999]), enormous progress has occurred since then. The faith that Rosenblatt (who died in a boating accident in 1971) had in the perceptron has been resoundingly vindicated. 

Charles Wightman holding a subrack of eight
perceptron adaptive weight implementation units
Pic Courtesy: Veridian Engineering

DEFINITION
Definition: It’s a step function based on a linear combination of real-valued inputs. If the combination is above a threshold it outputs a 1, otherwise it outputs a –1.
Fig. 3 Perceptron Model

O(x1,x2,..,xn) =1 or -1
Output is 1 if w0+w1x1+w2x2+....+wnx≥0 otherwise -1
where
x1,x2,..,xn= input vector
w1, w2,..,wn= weight vector
A perceptron draws a hyperplane as the decision boundary over the (n-dimensional) input space.

Fig. 4 Decision boundry over the input space

A perceptron can learn only examples that are called “linearly separable”. These are examples that can be perfectly separated by a hyperplane.

Fig. 5 Graphical from of Linearly separable and Non-linearly separable

Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but not XOR
However, every boolean function can be represented with a perceptron network that has two levels of depth or more.

Perceptron Learning
Learning algorithms can be divided into supervised and unsupervised methods. Supervised learning denotes a method in which some input vectors are collected and presented to the network. The output computed by the network is observed and the deviation from the expected answer is measured. The weights are corrected according to the magnitude of the error in the way defined by the learning algorithm. This kind of learning is also called learning with a teacher, since a control process knows the correct answer for the set of selected input vectors. Unsupervised learning is used when, for a given input, the exact numerical output a network should produce is unknown. Supervised learning is further divided into methods which use reinforcement or error correction. Reinforcement learning is used when after each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this information (that is, the Boolean values true or false) so that only the input vector can be used for weight correction. In learning with error correction, the magnitude of the error, together with the input vector, determines the magnitude of the corrections to the weights, and in many cases we try to eliminate the error in a single correction step.
Fig. 6 classes of learning algorithms

The perceptron learning algorithm is an example of supervised learning with reinforcement. Some of its variants use supervised learning with error correction (corrective learning).

Learning a perceptron means finding the right values for W. The hypothesis space of a perceptron is the space of all weight vectors. The perceptron learning algorithm can be stated as below. Learning algorithm is explained in sentence form and mathematical formate.
1. Assign random values to the weight vector
2. Apply the weight update rule to every training example
3. Are all training examples correctly classified?
a. Yes. Quit
b. No. Go back to Step 2.
There are two popular weight update rules.
i) The perceptron rule, and
ii) Delta rule
OR

We are now in a position to introduce the perceptron learning algorithm. The training set consists of two sets, P and N, in n-dimensional extended input space. We look for a vector w capable of absolutely separating both sets, so that all vectors in P belong to the open positive half-space and all vectors in N to the open negative half-space of the linear separation
1.      Start : The weight vector w0 is generated randomly, 
Set t = 0
2.      test: A vector x  P N is selected randomly,
a.       if x∈P and w.x>0 go to test,
b.      if x∈P and w.x0 go to add,
c.       if x∈N and w.x< 0 go to test,
d.      if x∈N and w.x0 go to subtract.
3.      add: set wt+1= wt+x and t=t+1 go to test
4.      subtract: set wt+1= wt-and t= t+1 go to test
This algorithm makes a correction to the weight vector whenever one of the selected vectors in P or N has not been classified correctly. The perceptron convergence theorem guarantees that if the two sets P and N are linearly separable the vector w is updated only a finite number of times. The routine can be stopped when all vectors are classified correctly. The corresponding test must be introduced in the above pseudocode to make it stop and to transform it into a fully-fledged algorithm
     The Perceptron Rule 
      For a new training example X=(x1,x2,..,xnupdate each weight according to this rule:wi=wi+Δwi

       Where
      Δwi=-η(t-o)xi
         t = target output
         o = output generated by the perceptron
         η = constant called the learning rate (e.g., 0.1)
         Comments about the perceptron training rule
• If the example is correctly classified the term (t-o) equals zero, and no update on the weight is necessary.
• If the perceptron outputs –1 and the real answer is 1, the weight is increased.
• If the perceptron outputs a 1 and the real answer is -1, the weight is decreased. 
• Provided the examples are linearly separable and a small value for η is used, the rule is proved to classify all training examples correctly (i.e, is consistent with the training data). 

Execution of Code Step By Step
Perceptron is linear classifier (binary). It is used in supervised learning. It helps to classify the given input data. One can see it has multiple stage of execution as below
1. Input Values
2. Weights and Bias
3. Net sum
4. Activation Function
Why we need weights and biases? Weights shows the strength of the particular node. A bias value allows you to shift the activation to the left or right. Why we need activation function? The activation function are used to map the input between the required values lie (0,1) or (-1,1). The machine learning algorithms work as the perceptron model in neural network. Learning the basic of perceptron model will help to understand rest of machine learning algorithms. Let us see code execution in steps.

ASSUMPTIONS
1.      Perceptrons can only converge on a linearly separable inputs.
2.      Perceptrons learning procedure can only be applied to a single layer of neurons.
3.      Perceptrons  can only learn online, because the perceptron learning is based on the error of binary classifier (error can be -1,0 or 1)

DATA
Dataset comes from the UCI Machine Learning Repository and it is related to Iris plant database containing flower classes setosa, versicolor and virginica. Attributes are sepal length and sepal width, petal length and petal width.

Step 1
Iris dataset have labels and raw data which is unsupervised. Import this data set using Sklearn library. Print the iris data find what the output is. Total number of instances are 150 i.e first 50 Iris_setosa, next 50 Iris_vesicolour and remaining 50 areIris_ virginica.


Output

Step 2
Selection of class and attributes i.e from the class we are going to select Iris_versicolour and Iris_verginica i.e data from 50-150 and selected class attributes are sepal width and petal length.

Output

Step 3

Training and splitting the data. Usually data is split into training data and test data.

 Fig. 7 splitting of data into train and test datasets. 

Training set contains a known output and the model learn on this data in order to generalize to other data later on. Use test dataset (subset) in order to test model’s prediction on the subset. For splitting data sklearn library is used as “from sklearn.model_selection import train_test_split” Here data is split into training 80% and test data 20% i.e test_size = 0.2


Step 4
In statistics and machine learning the term Data standardisation is very important. Usually data is split into two subsets training and test datasets (sometime it split into three data set train, validate and test) and fit model on train dataset in order to make prediction on test dataset. While in prediction two possibilities can happen overfit of model and underfit of model. Which is avoided to happen. Will see what exactly overfit and underfit actually mean with data standardisation afterwords. Now predict and fit the data. Here training data must be predict, fit and transform but test data is only predict and transform. For data standardisation use “from sklearn.preprocessing import StandardScaler”



 Step 5
Now use the perceptron model from “from sklearn.linear_model import Perceptron” to check the misclassifications of chosen class and attributes.


Where
n_iter = The number of passes over the training data (aka epochs). Defaults to 5.
eta0 = Constant by which the updates are multiplied. Defaults to 1.
Now print ppn.fit and y_pred check the output.
Output