Data Science for beginners by beginner: Part 0. Introduction.

"Docendo discimus"

(by teaching, we learn)

Hello to everybody reading this blog. I am software engineer from Ukraine. This blog is dedicated for learning Data Science for beginners through practice. You can find dozens of blogs, articles with similar description, but the main distinction is I am also beginner. So if you are experienced in this area do not hesitate to point out my mistakes in comments or just write me via e-mail.

Why have I chosen Data Science? I just like to solve complicated tasks and puzzles. Also I like to play with data: plot it, make some predictions. So fasten your seatbelts, we are taking off!

First I want to determine my objectives:

make working project using data mining and machine learning principles;
learn something new by achieving first goal;
get practical experience in working with data;
have fun.

Few words about what I am going to do. I will create a movie recommendation program. Not a new idea? Yes. But it should recommend a movie for me. I'm not going to use somebody's ratings or collaborative filtering. This program will be written from scratch: starting from getting movie database and ending with evaluating results.

Why movie recommendation? How often have you been thinking about "what to watch this evening"? Plenty of times for me. Why not to use ready solutions? I haven't found any solution which would always recommend me something worthwhile. But the main reason is just because it is interesting for me. The goal of my project is to create a program which would advise me a movie I will like. The process of creation is divided into next steps:

Create movie database
Collect information about movies I've watched
Play with this data and select features
Make simple recommendation model
Evaluate results
Improve precision of algorithm
Add NLP
Add functionality

To reach the goal I will use Python and MongoDB. Of course I will share all the code via GitHub. Preliminary list of packages I'm going to use: scikit-learn, pandas, scipy, numpy, pymongo, beautifulsoup, ipython, matplotlib.

1. To create a movie database I will use either open API or in the worst case web scrapping. I want to note that this DB would have information in English and Russian.

2. I will manually mark and rate watched movies.

3. This section is one of the most interesting and important: select features, normalize them and plot data. This part is often underestimated.

4. Using machine learning I will use some simple algorithm to make a recommendation.

5. Evaluate accuracy and precision of model using cross-validation.

6. I will try to improve precision of model. Target precision is at least 80%.

7. In this section I will try to use some NLP principles to create tags and perform some clustering of movies.

8. I will try to add some more complicated features, such as "movie by mood", "movie for collective watching" and also I will try to implement ideas posted in comments.

I will try to update my blog at least once per week. Hope you will enjoy it!

Data Science for beginners by beginner

Thursday, April 2, 2015

Part 0. Introduction.

2 comments: