Churn rate prediction on Sparkify service

Spakify is a music streaming sevice as similar to Spotify. Every users’ activities on Sparkify application are logged and sent to Kafka cluster. To improve the business, the data team will collect data to a Big Data Platform for further processing, analysing and extracting insights info for respective actions. One of the focusing topic is churn user prediction.

Business understanding

We assume that user of Sparkify can be either Premium or Free Tier subscription, The premium plan with subscription fee and enjoy more songs without advertisements.

As an end user, he/she can do:

  • Upgrade from free tier to Premium subscription
  • Downgrade from Premium to free tier
  • Drop their account and leave the service

The goals are:

  • analyse data
  • extract insights to identify churn indicators
  • build machine learning model to help predict potential churn ones

The notebook including analysis, feature engineering and model building can be found in the jupyter folder from below repository:


  • This is a public dataset named Million Song Dataset and can be download under json format prepared by Udacity from here

  • Contains 18 columns which has the information of customers(gender, name, etc.) and API events(login, playing next song, etc.)

  • Experiment period: 2018–10–01 to 2018–12–01

  • Kafka message example:

    "auth":"Logged In",
    "location":"New York-Newark-Jersey City, NY-NJ-PA",
    "userAgent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
    "song":"You\\'re Not Alone",
  • Data is ingested into Hive database, so Spark can easily query as normal database. What we are doing is in Jupyter paart the below data pipeline diagram.

Data cleaning and exploration

Data cleaning

There are 543,703 records with 15,700 empty userid values found. To prevent the similar issue on new incoming data if any, the filter will be apply before further step.

df = spark.sql(f"select * from {target_table} where userid is not null and sessionid is not null and userid <>''")

Data exploration

  • Churn indicator: users who visited the service and submit Cancellation Confirmation are treated as churned.

  • Churn user can be happening at both level of free or paid, however we see it likely more on free. User might not prefer the less features, more ads on free service but not intend for a paid account.

  • As denoted on below histogram, loyal users with more registered days are less churned.

  • Look at average item item in a session, we see that churn users are less interactive with less item count per session.

  • There are 2 interesting findings: churn users have less Thumbs up, more Thumps down and Roll Advert in the Page Event extraction. Next steps are to feed these features into machine learning algorithm to identify which contributes most to the churning rate.

Feature engineering

In this step, we will target to perform the engineering for number of available features in provided dataset, and observed during data exploration step.

  • Define churn indicator derived from its return reponse of Cacellation Confirmation
  • Last level of paid or free
  • Registered days which is the duration between the registration and last event user logged to the music service
  • The gender is kept to see how much it impacts to the final model
  • Average length of song duration
  • Average session duration user spent
  • Average daily count for number of page event

Model training and evaluation

The purpose of this step is to build a machine learning model that can predict which customers are likely to churn and which not. So the model is consider to use to solve a binary classification problem. The classes are churned and not churned.

Model evaluation metrics

Because the churned users are a fairly small subset compared to non-churn ones, the f1-score is selected for model evaluation metric.
This metric will help to balance the prevision and recall metrics.

Trained models and evaluation

We did evaluate the models on there common algorithms including Logistic Regression (lr), Random Forest (rf) and Gradient-Boosted Trees (gbt). The models are then compared to in term of f1 score.

Based on the result, Random Forest is selected as its result is better compared to others (f1-score=0.82). By perform further parameters tuning with numTreses=75, maxDepth=20, Random Forest can achieve better result of 0.87.

As described in the feature importance chart, top features contributing to churn rate are registered_days, rolling advert, etc. which are valuable information for product team to improve the service for better customer retention.

Conclusion and potential improvement

In this project, we have implemented a model to predict customer churn for Sparkify streaming music service. The whole project was taken place on Apache Spark data processing engine.

The complete workflow is comprised of cleaning, exploring, featuring, modeling, evaluating result steps to select the best suitable model for the churn user prediction problem.

With future improvement, we can try more other machine learning algorithms, and extract more features from location and useragent columns which potential containing some impact insights. Furthermore, applying hyperparameter tuning on Spark to utilize the engine power for better optimal parameters.

Data Platform on AWS Stackoverflow survey insights


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now