Pandas UDF for PySpark

I came across pyspark Pandas UDF when I was trying to rewrite my python code for pyspark for scaling purpose. I would like to perform a train test split within each user group, but there’s no easy way to do that in pyspark due to the way the groups, rows and columns are presented in spark. Luckily, Pandas UDF came to my rescue. I found it a very powerful tool and I’d like to share it in this post.

Content Based Movie Recommendation System

Many of the recommendation systems we see today are based on the collaborative filtering approach, which makes recommendations based on the similarity in user tastes. The advantage of collaborative filtering is that it doesn’t need content information. It “automatically” discovers the elements for recommendation based on users’ behavior, like ratings or click-through rate. But the disadvantage of collaborative filtering is that it doesn’t cope with new product as there’s no data to start with. This is commonly called a cold start problem. In this case, content based movie recommendation could be a complement.

Instacart Reorder Product Analysis

This is a project adpated from the Instacart Market Basket Analysis Competition from Kaggle.

Use Linear Regression Model to Predict Salaries from Start Up Companies

When I saw the estimated salaries feature on the job post from Glassdoor, I though it would cool if I could do something simiar: to build a model to predict salaries and to gain insights on the factors that impact salaries. And I did it.