Data Science for beginners by beginner: Part 2. Collect information about watched movies

"You can have data without information, but you cannot have information without data."

Daniel Keys Moran

Hello again. In this part I will complete movie database with watched movies. Later this information will be used to create training set for machine learning algorithm. This part is short but still important. We will consider next topics:

How to collect information about watched movies
How to rate movies

    After finishing these two topics you can start analyzing data and dig for dependencies of favorite movies. If you will find out some interesting observations, post it in comments and I will include it my next Part. Let's do the job.

    In previous part I pointed out few approaches how to create movie database. In this part we will use it, so if you do not have it, you won't be able to write your implementation. If you are interested only in theoretical part, next article should be more exciting.

    As always I will try to describe few approaches of reaching the goal. I know two basic methods of collecting information about watched movies: manual and semi-automatic. Why not fully automatic? Because in my case I will use not popular way to rate movies. But we will talk about rating later.

    So what do I mean saying manual collecting of information? I created a simple form, which randomly picks movie and I should answer two questions:

have I watched this movie?
if yes did I like it?

For this purpose I used ipython because I did not want to spend time on creating any GUI. But fully random approach is very inefficient: database contains over 100 000 movies and chance that the random generator will pick the one you've seen is going to zero (depending on how many movies you've seen). To increase this probability I searched using different attributes such as year, director, actors, even keywords in title. Also if you remember all the films you watched you can find it by title. Another approach is to get list of popular movies and search it in database one by one. I will show my implementation:

from pymongo import MongoClient
from random import randint
from IPython.display import Image

client = MongoClient()
db = client.movies_db
watched_movies = db.watched_movies
movies = db.movies

random_movie = movies.find()[randint(0, movies.count())]
print(random_movie['title'], random_movie['year'])
print(random_movie['countries'])
print(random_movie['genres'])
print(random_movie['actors'])
print(random_movie['description'])
print(random_movie['imdb'])
Image(url=random_movie['poster_link'])

Better way to pick random film from specific collection, for example pick movie produced in 2014 year:

new_movies = movies.find({'year': '2014'})
random_movie = new_movies[randint(0, new_movies.count())]

But the best way is to pick some popular movies. You can use TMDb API for this case. I will show just fragment of code how to get the list of popular movies titles:

def get_params(page):
    # fill parameters for request    
    api_key = 'Your API key'    
    params = {
        'Accept': 'application/json',        
        'api_key': api_key,        
        'page': page
    }
    return params


def request_api(page):
    # request the API for specific page and return results    
    service_url = 'http://api.themoviedb.org/3/movie/popular'    
    url = service_url + '?' + urllib.parse.urlencode(get_params(page))
    response = json.loads(urllib.request.urlopen(url).read().decode('utf-8'))
    return response['results']


def get_movies():
    # get titles of popular movies    
    movie_titles = []
    for page in range(1, 20):  # I've chosen first 20 pages        
        for movie in request_api(page):
            movie_titles.append(movie['title'])
    return movie_titles

When you get this list, you can easily iterate it and search these movies in database. I also used mentioned above ipython form but instead of "random_movie" you should implement next:

import sys
sys.path.append('Path to your TMDb API implementation')
from tmdb_api import get_movies

popular_movies = get_movies()  # Do it only once

title = popular_movies.pop(0)
random_movie = movies.find_one({'title': title})

Another approach is to get list of watched movies. Where can you find it? For example on Facebook or another service, where you have rated movies with open API. I will show only how to get the list using Facebook Graph API. Before you need to get access token and grant user_actions.video permission (you can easily do it in Facebook Graph API Explorer). Below you can find my Python code:

def get_params():
    # fill parameters for request    
    access_token = 'Your access token'    
    params = {
        'access_token': access_token,        
        'limit': 500    
    }
    return params


def request_api():
    # request the API and return results    
    service_url = 'https://graph.facebook.com/me/video.watches'    
    url = service_url + '?' + urllib.parse.urlencode(get_params())
    response = json.loads(urllib.request.urlopen(url).read().decode('utf-8'))
    return response['data']


def get_movies():
    # get titles of watched movies    
    movie_titles = []
    for movie in request_api():
        movie_titles.append(movie['data']['movie']['title'])
    return movie_titles

Now you know how to get list of watched movies or just popular movies. Next step is to mark them as watched and rate it.

Few words about rating. There are a lot of services with movie rankings such as IMDB or Kinopoisk and often these services use ranks counted with different formulas. Also you can find different measurement scales (5-stars, 10-stars, etc.). I won't say which is the best, but in my opinion either you liked a movie or not. In my system I will use next ratings:

1 - I liked the movie
-1 - I didn't like the movie
0 - no impressions, some average film

To store information about watched movies I will use the same MongoDB database but another collection (I called it watched_movies). It will have the same information from our main collection, but I will add "rating" attribute. To add movies simply modify ipython code I showed above with next:

random_movie['rating'] = 1
watched_movies.insert(random_movie)

As a result of this part you will get second collection in your movie database. Now everything is ready to start our analysis. Next part is one of the most interesting. I will try to describe all my thoughts about feature selection. After I will upload sources to GitHub (from this and previous parts) I will update this post. Do not miss the next part!

UPD: uploaded code to GitHub

Data Science for beginners by beginner

Thursday, April 9, 2015

Part 2. Collect information about watched movies

No comments:

Post a Comment