Pandas UDF for PySpark

I came across pyspark Pandas UDF when I was trying to rewrite my python code for pyspark for scaling purpose. I would like to perform a train test split within each user group, but there’s no easy way to do that in pyspark due to the way the groups, rows and columns are presented in spark. Luckily, Pandas UDF came to my rescue. I found it a very powerful tool and I’d like to share it in this post.

Read More

Content Based Movie Recommendation System

Many of the recommendation systems we see today are based on the collaborative filtering approach, which makes recommendations based on the similarity in user tastes. The advantage of collaborative filtering is that it doesn’t need content information. It “automatically” discovers the elements for recommendation based on users’ behavior, like ratings or click-through rate. But the disadvantage of collaborative filtering is that it doesn’t cope with new product as there’s no data to start with. This is commonly called a cold start problem. In this case, content based movie recommendation could be a complement.

Read More