Data Science for beginners by beginner: Part 1. Create movie database.

“It is a capital mistake to theorize before one has data”

Sherlock Holmes

Welcome back to my blog. In this part I will try to describe one of the complicated task I encountered, and which is mostly not mentioned in courses and literature. The process of creating data warehouse which is going to be used for our analysis and recommendations.

For myself I divided this part into three sections. Each of them should answer three simple questions:

What data do I need?
Where can I find it?
How shall I use it?

Next I will try to answer these questions and mention problems I faced during creating my movie database. I will describe my thoughts about data, movies and technologies, but your opinion could differ, so I will show some approaches and you can choose any of them or use your own. I will be also grateful if you will share your thoughts in comments. So let's start!

First I want to say few words about data as how I imagine it. In my opinion before starting any analysis you should know at least some basics in subject domain. In my case: how can I recommend a movie, if I haven't seen any? How can I rate one movie and compare it with another, if I do not their characteristics (title, description, starring, etc.). So if you haven't seen any movie yet, I strongly recommend you to watch some. But if you have, let's talk about movies more concretely.

There are a lot of films nowadays. Starting from the end of 19th century technologies are evolving rapidly. We can observe different special effects, computer graphics in modern movies. But how can we measure it? Or even do we need to measure it for our recommendations? Depends on your taste, but I've chosen some simpler properties to describe movies:

title (the most common use property to identify a film)
genre
actors who are starring in a movie
director
description
year of release

Also there are a lot more features we can describe a movie, but these are the main I've chosen. But that's not all. I will put some constraints which our data source should meet. My native language is Russian, so it's much easier for me to recognize the movie by it's translated title. That is why I will store original title and title in Russian. Also I will store posters to simplify manual rating of watched movies. You can decide on your own constraints.

After we found out what we are looking for, we should easily find and store it. Oh really? Not so fast. Of course there are hundreds of websites with such information, but only few of them contain enough information and free for extraction (non-commercial). There are few approaches to get needed data:

manually enter all needed information
search some websites and use data scraping
find completed database
use open APIs

I am not going to describe all the difficulties you will face during manual data processing. So I will start from data scrapping. This approach can give good results, but you should be beware of breaking the law. Also you read carefully terms of use of the concrete website, because such actions could be prohibited. But even you find such website, data scrapping could take a lot of time for load and processing web pages to extract data. In short a program should load web page of a movie and find in HTML needed attributes. Depending on your algorithm, Internet connection this could take a some time. Below you can find my code for data scraping using Python:

__author__ = 'i.yesilevsky'
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'website url'

def load_page(movie_id):
    # load web page with given movie_id and returning parsed information    
    try:
        http = urlopen(url+str(movie_id))
        charset = http.info().get_param('charset')
        soup = BeautifulSoup(http.read(), from_encoding=charset)
        return parse_page(soup, movie_id)
    except(Exception):
        print('Movie with id', movie_id, 'was not found')
        return None

def parse_page(soup, movie_id):
    # parse Beautiful soup object to extract needed information and returning    
    # dict with it    
    title = soup.find(attrs={'itemprop': 'name'}).string
    alt_title = soup.find(attrs={'itemprop': 'alternateName'}).string
    year = soup.find(name='small').a.string
    genres = list(genre.string for genre in soup.find_all(attrs={
        'itemprop': 'genre'}))
    countries = list(a.string for a in soup.find(attrs={
        'class': 'main'}).find_all('a') if not a.get('itemprop'))
    description = soup.find(attrs={'itemprop': 'description'}).contents[0].strip()
    director = soup.find(id= 'directors').find(attrs={'class': 'person'}).string
    actors = list(actor.string for actor in soup.find(id='actors').find_all(
        attrs={'class': 'person'}))
    imdb = soup.find(attrs={'class': 'rating'}).string
    tags = 'No tags'    if soup.find(id='tags'):
        tags = list(tag.string for tag in soup.find(id='tags').find_all('a'))
    poster_link = soup.find(attrs={'class': 'posterbig'}).find(name=
                                                               'img').get('src')

    movie_info = {'movie_id': movie_id, 
                  'title': title, 
                  'alt_title': alt_title, 
                  'year': year, 
                  'genres': genres, 
                  'countries': countries, 
                  'description': description, 
                  'director': director, 
                  'actors': actors, 
                  'imdb': imdb, 
                  'poster_link': poster_link}

    if tags is not 'No tags':
        movie_info['tags'] = tags

    return movie_info

    Code above will simply load web page and search for mentioned data information in HTML tags (another website could use totally different tags). Before implementing the python script first you should manually search for attributes.

    Now let's look at second approach. Many websites provide API to their services or databases. You can easily find API for Facebook, Twitter and so on. In short API returns you some structured response for your request. For example you want to find some information about movie, so you send the title of movie and receive some JSON/XML response with found information. Sounds easy? Mostly it is, but again you can encounter some constraints on using API such as limitation of queries per hour or results limit and so on.

    If you will search movie databases with API you'll certainly find TMDb, OMDb, Rotten Tomatoes. If they suit your demands everything is perfect. In my case TMDb returns only 1000 pages or 20000 results, so to create database you should use some more complicated scripts to separate results. To use OMDb API you should request either movie ID or title.

    I've chosen Google's Freebase. It has some nuances such as using MQL for queries or storing data in RDF graph. But it's database is not perfect also. Freebase contains 266 565 movies, but only 24 190 have title in Russian. Nevertheless you can find my Python code below:

__author__ = 'i.yesilevsky'import json
import urllib.parse
import urllib.request


def get_params(cursor_value):

    api_key = 'Your API key'
    query = [{
        "id": None,        
        "name": None,        
        "ru:name": {
            "lang": "/lang/ru",            
            "value": None        
        },        
        "initial_release_date": None,        
        "genre": [],        
        "country": [],        
        "directed_by": [],        
        "starring": [{
            "actor": None        
        }],        
        "type": "/film/film",        
        "limit": 500    
    }]

    params = {
        'key': api_key,        
        'query': json.dumps(query),        
        'cursor': cursor_value
    }
    return params


def request_api(cursor):
    service_url = 'https://www.googleapis.com/freebase/v1/mqlread'    
    url = service_url + '?' + urllib.parse.urlencode(get_params(cursor))
    response = json.loads(urllib.request.urlopen(url).read().decode('utf-8'))
    return response


def get_movies(response_result):
    movies_info = []
    for movie in response_result:
        movie_info = {
            'movie_id': movie['id'],            
            'title': movie['name'],            
            'alt_title': movie['ru:name']['value'],            
            'year': movie['initial_release_date'],            
            'genres': movie['genre'],            
            'countries': movie['country'],            
            'director': movie['directed_by'],            
            'actors': list(stars['actor'] for stars in movie['starring'])
        }
        movies_info.append(movie_info)
    return movies_info

    Using API is much faster than web scraping, but it takes more time to find proper service. If you will find some better API, post it in comments.

    Third approach is to get completed database. You may find SQL database or download Freebase Dump. While I was searching the proper source for this project I haven't found any completed free database which is downloadable. You can try to extract information from Freebase Dump, but beware of it's size (28GB compressed), so you should have enough computing capabilities to perform such actions.

    I hope I answered first two questions and only one left: How shall I use this data? First of all you should decide how to store it. I've decided to try document-oriented NoSQL database MongoDB. It's perfectly suits for my demands: it is fast for inserting and reading, ability to store documents (movies) without some filled attributes, easy to use. Of course you can handle with SQL relative database, but I like to explore something new.

    To start MongoDB you should just download distributive, start MongoDB server (mongod) and use it. Below you can find my Python code for inserting data to MongoDB:

__author__ = 'i.yesilevsky'
from pymongo import MongoClient


def add_to_mongo(movies):
    try:
        client = MongoClient()
        db = client.movies_freebase
        collect = db.movies
        mov_id = collect.insert(movies)
    except Exception as exp:
        print(exp)

And after you implemented it you can just use add_to_mongo() method to insert movies you extracted using API:

def make_db():
    response = request_api('')
    movie_list = []
    while response['cursor']:
        movie_list.extend(get_movies(response['result']))
        print(len(movie_list))
        if len(movie_list) >= 10000:
            add_to_mongo(movie_list)
            movie_list = []
        try:
            response = request_api(response['cursor'])
        except Exception as exp:
            print(exp)

Or web scraping:

def make_db():
    movie_list = []
    for idx in range(1, last_idx):
        movie = load_page(idx)
        print(movie)
        if movie is not None:
            movie_list.append(load_page(idx))
    add_to_mongo(movie_list)
    print(len(movie_list))

That's all for now. I will upload source code to GitHub and update this post. Next week I will show how I marked watched movies.

UPD: uploaded code to GitHub

Data Science for beginners by beginner

Sunday, April 5, 2015

Part 1. Create movie database.

1 comment: