Data Science for beginners by beginner: Part 3. Data analysis.

"Not everything that can be counted counts,

and not everything that counts can be counted"

Albert Einstein

Hi there! For now we have database which contains information about movies and also collection of watched films. In this part we will start to analyze data. I will show how I've chosen features to describe a movie, how I've made them to be measurable and what I got in the result. As always I will use "divide and conquer" tactics and separate this article in next sections:

Converting words to numbers
Converting numbers to features
Normalizing and scaling features

I will try to describe all actions I performed to reach the goal, but mostly I used my intuition. If your opinion differs feel free to post it in comments. So let the analytics begin!

First I want to say few words about importance of this process. I think it is the most important part of creating machine learning application, because depending on input data the results could totally different. After implementing our algorithm we will return to features to increase precision of model, but now we should create our foundation.

The first problem I encountered is absence of numbers. The only numbers in our data are the rating we gave and the year of release. I'm pretty sure that such information is not enough to recommend a movie. So our first goal is to convert string description into numbers.

But we are not magicians so how can we do it? I used next approach: scoring each genre/actor/director and so on by a presence of it in movies and accumulating rating. Sounds complicated but implementation is very easy. I've just iterated over watched movies and incremented the score of each attribute by movie's rating (as you remember I used 1 to indicate that I liked a movie, -1 for dislike and 0 for others). My code in Python below:

def get_scores():
    # creating dictionaries of features and accumulate their score     
    # by movie's rating    
    genres = {}
    actors = {}
    countries = {}
    directors = {}
    years = {}
    tags = {}
    movies = get_data()
    for movie in movies:
        for g in movie['genres']:
            genres[g] = movie['rating'] if g not in genres.keys() else \
                genres[g]+movie['rating']
        for a in movie['actors']:
            actors[a] = movie['rating'] if a not in actors.keys() else \
                actors[a]+movie['rating']
        for c in movie['countries']:
            countries[c] = movie['rating'] if c not in countries.keys() \
                else countries[c]+movie['rating']
        directors[movie['director']] = movie['rating'] \
            if movie['director'] not in directors.keys() \
            else directors[movie['director']]+movie['rating']
        years[movie['year']] = movie['rating'] \
            if movie['year'] not in years.keys() \
            else years[movie['year']]+movie['rating']
        if 'tags' in movie.keys():
            for t in movie['tags']:
                tags[t] = movie['rating'] if t not in tags.keys() else \
                    tags[t]+movie['rating']

    return genres, actors, countries, directors, years, tags

In a result we obtain six python dictionaries with some values. In this part I will use only five of them and leave "tags" for NLP part. So now we have values, what next? First approach is just using them as features and implement algorithm to evaluate a movie using this values without machine learning. I will show it next time, now we will try to convert these dictionaries with values into something reasonable.

First let's look at "genres". I will copy here code and results from ipython notebook. I marked 356 movies as watched so I will use this information. If you just plot "genres" dictionary you will observe something similar to my result:

As you can see there are only 20 genres to describe movies I watched. Also I want to mention, that one movie can have few genres. I decided to make features for each genre (e.g. "is_fantastic", "is_action"...). After such manipulation we will receive 20 binary features with values either 0 or 1.

Next let's look at year of release. It's already numerical, so it should be easy. Again I plotted quantity of movies I liked released by specific year:

This feature will be just normalized and there is no need for other transformations. To normalize it I will subtract the minimum value and divide it by difference between maximum and minimum value, so I will get value between 0 and 1. Now let's look at more complicated attributes, such as actors and directors. In my case there are 2510 actors who were starring in movies I watched. I think it would be useless to create feature for each actor, because it would increase computing time significantly, but not accuracy (one actor could occur only in one film).

For this case I've decided to choose only liked actors (with high rating from "get_scores" method) or disliked. To achieve it, first I rearranged dictionary of actors to dictionary of their ratings. Key is the rating now and value is number of actors with such rating. Then I took median of values and created a list of ratings, value of which is less than median. By doing these steps I've picked actors whose score is either pretty high or low. Python code below:

rate_actors = {}

for key, value in sorted(actors.items()):
    rate_actors.setdefault(value, []).append(key)

for key in rate_actors.keys():
    print(key, ':', len(rate_actors[key]))

pdata = pd.DataFrame(columns={'rating', 'quantity'})
pdata['rating'] = rate_actors.keys()
pdata['quantity'] = list(len(rate_actors[key]) for key in rate_actors.keys())
selected_ratings = list(pdata[pdata['quantity'] <= pdata['quantity'].quantile(0.5)]
                                                                    ['rating'])

liked_actors = []
disliked_actors = []
for actor, rating in actors.items():
    if rating > 0 and rating in selected_ratings:
        liked_actors.append(actor)
    elif rating < 0 and rating in selected_ratings:
        disliked_actors.append(actor)

The same steps I performed for directors:

rate_directors = {}

for key, value in sorted(directors.items()):
    rate_directors.setdefault(value, []).append(key)
    
pdata = pd.DataFrame(columns={'rating','quantity'})
pdata['rating'] = rate_directors.keys()
pdata['quantity'] = list(len(rate_directors[key]) for key in rate_directors.keys())

selected_ratings = list(pdata[pdata['quantity'] <= pdata['quantity'].quantile(0.5)]
                                                                     ['rating'])
liked_directors = []
disliked_directors = []
for director, rating in directors.items():
    if rating > 0 and rating in selected_ratings:
        liked_directors.append(director)
    elif rating < 0 and rating in selected_ratings:
        disliked_directors.append(director)

We converted our attributes to numerical features. All of them are binary except of one. It is "year" feature, but also it's value between 0 and 1. Now we need to convert our data to a set, which would be ready for machine learning algorithms (I used pandas):

col_names = ['year', 'has_liked_actors', 'has_disliked_actors', 
            'madeby_liked_director', 'madeby_disliked_director']
X = pd.DataFrame(columns=[col_names+genres.keys()])
y = []
for mov in movies.find():
    y.append(mov['rating'])
    mx = {}
    for key in genres.keys():
        if key in mov['genres']:
            mx[key] = 1        
        else:
            mx[key] = 0    
    year_diff = (int(max(years.keys()))-int(min(years.keys())))        
    mx['year'] = (int(mov['year'])-int(min(years.keys())))/year_diff
    mx['has_liked_actors'] = 0    
    mx['has_disliked_actors'] = 0    
    mx['madeby_liked_director'] = 0    
    mx['madeby_disliked_director'] = 0    
    for act in mov['actors']:
        if act in liked_actors:
            mx['has_liked_actors'] = 1        
        if act in disliked_actors:
            mx['has_disliked_actors'] = 1            
    for dirct in mov['director']:
        if dirct in liked_directors:
            mx['madeby_liked_director'] = 1        
        if dirct in disliked_actors:
            mx['madeby_disliked_director'] = 1    
    X = X.append(mx, ignore_index=True)

In the result we will get dataset with 25 features (could be different depending on number of genres), which is ready for training. If you noticed, I missed one feature - "country". For now, I haven't come up with the method how to convert it. Because it could be either a single value string or a list of few countries. If you have any ideas - feel free to post it or write me via mail.

That's all for now, but I will return to this section, when we will be improving accuracy of our program. In the next article, I will show how to divide our dataset into training, validation and test sets and I will show implementation of simple machine learning model, using scikit-learn library.

Data Science for beginners by beginner

Tuesday, May 12, 2015

Part 3. Data analysis.

No comments:

Post a Comment