Data Science for beginners by beginner: Part 4. Making a first model.

“In theory, theory and practice are the same. In practice, they are not”

Albert Einstein

Greetings! Today we will finally start making some predictions and recommendations. In previous part we converted our data set to numeric form so it is ready for most models form scikit-learn library. In this part we will perform next:

Divide data set into parts
Try few simple models from scikit-learn
Try Neural Network from pybrain package

I will use standard algorithms from scikit-learn and pybrain libraries. If you feel you can implement your own model, you can share it in comments. So let's do the practical Machine Learning.

For those who are not familiar with Machine Learning, I recommend to complete course by Andrew Ng on Coursera. From that course you will get understanding of Machine Learning principles and implementations of some models.

Lets return to our first topic: dividing data set. What for are we going do this? The answer is obvious. We need one part to train our model and second part to evaluate results. It is very basic idea in Machine Learning, if you hear about it first time, I recommend to learn some theory.

In my case I have 356 marked movies as watched in data set. So I will use 300 for training and rest for testing model. You can separate your data set in proportion 80% : 20%. For maximum purity I will randomly pick 300 elements and call it "train" and rest I will call "test". Python code below:

import random
X['y'] = pd.Series(y)
train_idx = random.sample(list(X.index), 300)
test_idx = set(X.index) - set(train_idx)
train = X.ix[train_idx]
test = X.ix[test_idx]

Short description of code above: first I added "y" array with movie rating to our data set, to avoid mismatch after shuffling data set. Than, using "random.sample" function I picked 300 indexes from data set and the difference between all indexes and these 300 indexes I saved to "test_idx". In the end I created 2 pandas data frames with selected indexes rows.

After we divided our data set, we can finally use Machine Learning and predict something. From scikit-learn I will show you three models: Logistic Regression, Decision Tree and Linear Support Vector Classification. I've chosen these models, because they are common use and also they could handle multi class classification (remember, that we rate our movies using 3 scores). In this part I will show just standard use of these classifiers, using their defaults. So it could look like some magic. We will start form the basic one: Logistic Regression:

from sklearn import linear_model

classifier = linear_model.LogisticRegression()
classifier.fit(train.iloc[:, 0:25], train.iloc[:, 25])
predicted = classifier.predict(test.iloc[:, 0:25])
result = classifier.decision_function(test.iloc[:, 0:25])[-1]
print(classifier.score(test.iloc[:, 0:25], test.iloc[:, 25]))
print(result)
print(predicted)

And the result is next:

Score: 0.553571428571 (Mean accuracy of predicting with respect to "y")

Decision function result for last test sample: [-1.71857373 -0.77146262 -0.07399194] (Confidence scores per (sample, class) combination). The higher value is, the more confident that score belongs to it's value [-1, 0 ,1].

Next is Decision Tree Classifier:

from sklearn import tree

classifier = tree.DecisionTreeClassifier()
classifier.fit(train.iloc[:, 0:25], train.iloc[:, 25])
predicted = classifier.predict(test.iloc[:, 0:25])
print(classifier.score(test.iloc[:, 0:25], test.iloc[:, 25]))
print(predicted)

And the result is next:

Score: 0.464285714286 (Mean accuracy of predicting with respect to "y")

For this classifier there is no decision function.

Next is Linear Support Vector Classifier:

from sklearn import svm

classifier = svm.LinearSVC()
classifier.fit(train.iloc[:, 0:25], train.iloc[:, 25])
predicted = classifier.predict(test.iloc[:, 0:25])
print(classifier.score(test.iloc[:, 0:25], test.iloc[:, 25]))
result = classifier.decision_function(test.iloc[:, 0:25])[-1]
print(result)
print(predicted)

And the result is next:

Score: 0.571428571429 (Mean accuracy of predicting with respect to "y")

Decision function result for last test sample: [-0.53212502 -0.33535368 -0.1261217 ]

And the last model is Neural Network from pybrain package. It is more complicated, so I recommend to read the documentation first:

from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer

ds = SupervisedDataSet(25, 3)
for idx in range(len(train)):
    ds.addSample(train.iloc[idx, 0:25], train.iloc[idx, 25])
net = buildNetwork(25, 5, 3, bias=True)

trainer = BackpropTrainer(net, ds, learningrate = 0.0005, momentum = 0.99)
trainer.trainUntilConvergence(ds, maxEpochs=5, verbose=True, validationProportion=0.2)

result = []
for idx in range(len(test)):
    result.append(net.activate(test.iloc[idx, 0:25]))

predict = []
for res in result:
    predict.append(np.argmax(res)-1)
    
len([i for i, j in zip(predict, test.iloc[:, 25]) if i == j])/len(predict)

Mean accuracy is 0.35714285714285715. It is pretty low comparing to other models, but you can spend some time playing with parameters and get better results.

As a result of this article we get 4 working models you can actually use as a recommender. To do so, just convert movies data set from MongoDB to pandas as we did in previous part and put to the "predict" function. Actual results could differ from mine, but it is not actual result I want. So in the next part I will show how to evaluate correctly our models and we will use cross-validation to look how good our model could generalize random data set.

Data Science for beginners by beginner

Saturday, May 30, 2015

Part 4. Making a first model.

No comments:

Post a Comment