"You can have data without information, but you cannot have information without data."
Daniel Keys Moran
Hello again. In this part I will complete movie database with watched movies. Later this information will be used to create training set for machine learning algorithm. This part is short but still important. We will consider next topics:
- How to collect information about watched movies
- How to rate movies
In previous part I pointed out few approaches how to create movie database. In this part we will use it, so if you do not have it, you won't be able to write your implementation. If you are interested only in theoretical part, next article should be more exciting.
As always I will try to describe few approaches of reaching the goal. I know two basic methods of collecting information about watched movies: manual and semi-automatic. Why not fully automatic? Because in my case I will use not popular way to rate movies. But we will talk about rating later.
So what do I mean saying manual collecting of information? I created a simple form, which randomly picks movie and I should answer two questions:
- have I watched this movie?
- if yes did I like it?
from pymongo import MongoClient from random import randint from IPython.display import Image
client = MongoClient() db = client.movies_db watched_movies = db.watched_movies movies = db.movies
random_movie = movies.find()[randint(0, movies.count())] print(random_movie['title'], random_movie['year']) print(random_movie['countries']) print(random_movie['genres']) print(random_movie['actors']) print(random_movie['description']) print(random_movie['imdb']) Image(url=random_movie['poster_link'])
Better way to pick random film from specific collection, for example pick movie produced in 2014 year:
new_movies = movies.find({'year': '2014'}) random_movie = new_movies[randint(0, new_movies.count())]
But the best way is to pick some popular movies. You can use TMDb API for this case. I will show just fragment of code how to get the list of popular movies titles:
def get_params(page): # fill parameters for request api_key = 'Your API key' params = { 'Accept': 'application/json', 'api_key': api_key, 'page': page } return params def request_api(page): # request the API for specific page and return results service_url = 'http://api.themoviedb.org/3/movie/popular' url = service_url + '?' + urllib.parse.urlencode(get_params(page)) response = json.loads(urllib.request.urlopen(url).read().decode('utf-8')) return response['results'] def get_movies(): # get titles of popular movies movie_titles = [] for page in range(1, 20): # I've chosen first 20 pages for movie in request_api(page): movie_titles.append(movie['title']) return movie_titles
When you get this list, you can easily iterate it and search these movies in database. I also used mentioned above ipython form but instead of "random_movie" you should implement next:
import sys sys.path.append('Path to your TMDb API implementation') from tmdb_api import get_movies
popular_movies = get_movies() # Do it only once
title = popular_movies.pop(0) random_movie = movies.find_one({'title': title})
Another approach is to get list of watched movies. Where can you find it? For example on Facebook or another service, where you have rated movies with open API. I will show only how to get the list using Facebook Graph API. Before you need to get access token and grant user_actions.video permission (you can easily do it in Facebook Graph API Explorer). Below you can find my Python code:
def get_params(): # fill parameters for request access_token = 'Your access token' params = { 'access_token': access_token, 'limit': 500 } return params def request_api(): # request the API and return results service_url = 'https://graph.facebook.com/me/video.watches' url = service_url + '?' + urllib.parse.urlencode(get_params()) response = json.loads(urllib.request.urlopen(url).read().decode('utf-8')) return response['data'] def get_movies(): # get titles of watched movies movie_titles = [] for movie in request_api(): movie_titles.append(movie['data']['movie']['title']) return movie_titles
Now you know how to get list of watched movies or just popular movies. Next step is to mark them as watched and rate it.
Few words about rating. There are a lot of services with movie rankings such as IMDB or Kinopoisk and often these services use ranks counted with different formulas. Also you can find different measurement scales (5-stars, 10-stars, etc.). I won't say which is the best, but in my opinion either you liked a movie or not. In my system I will use next ratings:
- 1 - I liked the movie
- -1 - I didn't like the movie
- 0 - no impressions, some average film
random_movie['rating'] = 1 watched_movies.insert(random_movie)
As a result of this part you will get second collection in your movie database. Now everything is ready to start our analysis. Next part is one of the most interesting. I will try to describe all my thoughts about feature selection. After I will upload sources to GitHub (from this and previous parts) I will update this post. Do not miss the next part!
UPD: uploaded code to GitHub
No comments:
Post a Comment