layonsan

Finetuning LLMs using Federated Learning

2025-07-24T00:00:00+00:00

My capstone while pursuing my masters in data science was centered on finetuning large language models (LLMs) using Federated Learning (FL). I explored the potential and usage of flower framework to finetune LLMs on finance dataset via FL, a privacy-preserving training paradigm where multiple parties can collaboratively train a model under the coordination of a central server. A pre-trained LLM ready for usage on HuggingFace is used as the base for training, with instruction-tuning applied as the representative training procedure. The process of training the model using FL is carried out through 4 iterative steps – (1) global model updating (server), (2) local model training (client), (3) local model updating (client) and (4) global model aggregating.

Overall Methodology

Data

The dataset utilized is a comprehensive financial instruction dataset sourced from Huggingface, specifically the 4DR1455/finance_questions collection. This dataset comprises 53,837 records, providing a robust foundation for the fine-tuning process of Large Language Models (LLMs) using federated learning techniques. Given its substantial size and focus on financial instructions, this dataset offers a rich variety of financial queries and responses, making it particularly suitable for training LLMs to understand and generate finance-related content.

Frameworks for Federated LLMs

There are several emerging frameworks designed to support federated fine-tuning of large language models:

OpenFedLLM: Provides a concise framework for federated instruction tuning and federated value alignment, with support for multiple domains (e.g., finance, education) and techniques like LoRA for parameter-efficient fine-tuning.
FederatedScope-LLM (FS-LLM): An extension of the FederatedScope platform with modules for benchmarks, algorithms, and training workflows, making it easier to evaluate and experiment with federated LLMs.

Both are exciting contributions, but they’re still very new and face challenges like limited community support, unclear backwards compatibility, and adoption barriers in real-world applications.

Why I Used Flower

I chose to build on Flower, an open-source federated learning framework that has gained stronger traction and community adoption.

Proven foundation: Flower focuses on federated learning and privacy-enhancing technologies, with practical use cases demonstrated in both academia and industry
Community & support: Unlike newer frameworks, Flower already has an active developer community, more robust documentation, and backing from venture funding (Felicis Ventures), giving it momentum for long-term sustainability.
Origins: Flower started as a research project at the University of Cambridge and later evolved into Flower Labs, an AI startup.
Practical relevance: Within the federated learning landscape, Flower is increasingly used in real-world implementations, making it a reliable choice for experimentation with federated fine-tuning of LLMs.

Federated Learning Strategies Used

I experimented with five different federated learning strategies to fine-tune large language models (LLMs). Each strategy tackles the challenge of training models across distributed, non-shared datasets in slightly different ways:

FedAvg (Federated Averaging) The classic baseline in federated learning. Each client trains locally, and then the server averages the updates, weighted by data size. It’s simple and communication-efficient, but struggles when client data is very different (non-IID).
FedProx An improvement over FedAvg. It adds a “proximal term” during training to keep local updates closer to the global model. This helps reduce instability when client datasets vary a lot.
FedAdam Brings the popular Adam optimizer into federated learning. Instead of just averaging updates, it adapts learning rates and uses momentum to speed up and stabilize training—especially useful when data is inconsistent across clients.
FedAdaGrad Adapts the AdaGrad optimizer for federated learning. Each client adjusts its learning rate based on past gradients, so common patterns converge faster while rare ones don’t get overemphasized. Great for clients with very different data characteristics.
FedAvgM (FedAvg with Momentum) Adds a momentum term to FedAvg. Rather than averaging updates naively, it considers past updates too, which smooths training and reduces oscillations. This makes it more stable in challenging federated setups.

Federated Instructed Tuning

To make the project feasible within limited time and compute resources, I used smaller T5 models instead of larger LLMs. While compact, T5 still captures many of the behaviors of bigger models, making it a practical stand-in (proxy model) for testing federated learning in financial instruction tasks. This approach demonstrates that meaningful research can still be done responsibly, without requiring massive infrastructure.

Training Setup

To ensure a fair comparison between the baseline (centralized training) and federated learning, both were trained for a total of 9 epochs:

Baseline (centralized): 9 epochs straight
Federated models: 3 epochs × 3 rounds = 9 epochs total
Key training settings:
Batch size: 64 (with auto-adjust to fit hardware)
Optimizer: Adafactor + cosine learning rate scheduler
Learning rate: 2e-5
Weight decay: 0.01 (to reduce overfitting)
Max sequence length: 512 tokens
Precision: bfloat16 (bf16) for efficiency
Packing: Enabled (to improve efficiency with variable-length inputs)
Evaluation & logging: Every 50 steps (tracked with Weights & Biases)
Hardware: Google Colab A100 GPU, optimized with 10 dataloader workers
Checkpoints: Saved in safetensors format for compatibility and security

This setup was tuned to strike a balance between efficiency, performance, and generalization on the financial instruction dataset.

Evaluation Metrics

Since the models generate text, I evaluated them using ROUGE (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-SUM) and BLEU. These metrics measure how closely the model’s output matches the reference answers, focusing on overlap of words, phrases, and sequences.

Results

Finetuning Training Evaluation

The experiments revealed a clear performance hierarchy:

Baseline (centralized training): As expected, this model performed best across all metrics. It served as the benchmark for comparison.
FedAvgM (with momentum): Consistently the top-performing federated strategy. It achieved the lowest loss (0.160) and the highest ROUGE scores, outperforming FedProx (0.175 loss), FedAvg (0.182), and FedAdaGrad (0.216).
FedAdam: Surprisingly, this strategy lagged far behind the others. Its performance dropped significantly across both loss and ROUGE metrics.

In short: while federated models still trailed the centralized baseline, FedAvgM showed strong promise as a practical strategy for stabilizing training and improving text generation quality in federated setups.

Predictions Evaluation

When evaluating predictions with ROUGE metrics, the results aligned closely with the training phase but revealed some interesting nuances:

Baseline (centralized training): Maintained its lead, confirming its role as the strongest benchmark.
FedAvgM, FedProx, and FedAvg: Performed very similarly during prediction, with ROUGE-SUM scores clustered at 0.046 (FedAvgM), 0.045 (FedProx), and 0.046 (FedAvg). This suggests that, in practice, their performance differences are marginal.
FedAdaGrad: While competitive during training, its performance dipped slightly in prediction evaluation, falling behind the leading strategies.
FedAdam: Consistently underperformed, with a ROUGE-SUM score of just 0.003, far below all other strategies.

Takeaways

The experiments highlighted several important findings about federated optimization strategies for fine-tuning language models on financial data:

FedAvgM consistently leads Across all metrics, FedAvgM outperformed other federated strategies. Its use of momentum helped speed up convergence and reduce instability during training, making it especially effective in decentralized settings. This mirrors findings from other research showing that momentum improves stability in distributed optimization.
Domain-specific challenges Because the dataset consisted of finance-related chats and responses, models had to capture precise terminology and context. Federated learning complicates this further, since each client may have slightly different distributions of financial text. FedAvgM’s stability appears to help the model generalize better across these variations while preserving terminological accuracy.
Impact of model size The project used T5-small, which has far fewer parameters than large models like GPT. While this made training feasible, smaller models are generally more sensitive to data heterogeneity in federated setups. This may explain why some strategies struggled: they simply don’t have the capacity to absorb highly diverse updates. The weak performance of FedAdam further supports this idea, as Adam-based optimizers are known to struggle in federated environments with non-IID client data.
ROUGE scores matter in finance Since financial conversations require both accuracy and fluency, ROUGE scores were especially relevant. Higher ROUGE indicates the model captured the right terminology while staying coherent. FedAvgM’s strong ROUGE results suggest it not only reduced loss but also generated more consistent, domain-appropriate responses.
FedAdam underperformed One unexpected finding was how poorly FedAdam performed compared to other strategies. While Adam is a go-to optimizer in centralized deep learning, it doesn’t translate well to federated learning. Its adaptive moment updates seem too unstable when client data is non-IID, reinforcing the need to carefully re-evaluate optimizers before applying them in decentralized contexts.

Implementing an end-to-end ML system using batch-serving architecture

2024-01-16T00:00:00+00:00

Here’s the development process of an end-to-end machine learning (ML) platform designed to accommodate a batch-serving architecture. This initiative is part of my 2023 goal plan which aims to expand my engineering capabilities into the realm of AI/ML deployments. It draws inspiration and insights from Paul Iusztin’s comprehensive Full Stack MLOps Guide. Rather than merely duplicating his project, I elevated the endeavor by incorporating a distinct dataset. Capitalizing on the geographical context of Singapore, I utilized the Open Government Application Programming Interface (API) to extract PM2.5 data. Consequently, although the infrastructure stack and logic align closely with the reference guide, notable distinctions arise in the components responsible for preprocessing, prediction, and inference. The source code can located in this GitHub repository.

Overall Architecture

1. Feature Pipelines

The first component of the ML system is to extract and perform feature engineering on the data before loading the transformed data into a feature store.

1.1. Data

I decided to use a real time API from Data Gov as the data source. The API allow us to query hourly recorded PSI data for various regions in Singapore.

An extraction API script will serve to pull the data using a GET http request.

1.2. Feature Engineering

Some fair amount of preprocessing will be required to prepared the data as features. The payload schema will need to be flatten and transformed to get the relevant records - timestamp, update_timestamp, readings_. Regions comprises of north, south, east, west and central. For instance, a target variable reading_average is created from averaging the hourly PSI for each regions.

1.3. Hopswork Feature Store

Hopsworks is a flexible and modular feature store that provides seamless integration for existing pipelines, superior performance for any SLA, and increased productivity for data and AI teams.

The feature pipelines section focuses on leveraging APIs for extracting data, performing some feature engineering before loading them into a feature store (Hopswork).

2. Training Pipelines

The second component will be a series of training pipelines which handles the heavylifting of model training. Data is first pulled from the feature store, with metadata loaded into wandb. The data will then undergo a series of model training with the output artifacts rendered and uploaded to wandb.

2.1. Model Training

A baseline model using naive bayes will serve as a benchmark. Next, a fancy model comprising of sktime and LightGBM will be tuned and trained using the best configs.

The best model will also be loaded into Hopswork’s model registry.

2.2. Weights & Biases (wandb)

Weights & Biases helps AI developers build better models faster. Quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, and manage your ML workflows end-to-end.

For each of the runs, we can track the experimental output and performance, as well as the various model metrics.

3. Batch Prediction Pipelines

The third component, centering on batch prediction, entails a relatively straightforward procedure. Data is retrieved in batches from the Hopswork feature store, subjected to model inference to produce predictions, and subsequently linked to a cloud storage facility for caching the generated outputs.

3.1. Google Cloud Storage (GCS)

The Google Cloud Storage (GCS) serves as the repository for diverse data files stored in parquet formats, encompassing X and y features, predictions, and monitoring data. Although several tools, like Redis, are adept at caching predictions, incorporating such tools would have introduced complexity to the components, which falls outside the primary scope of this project.

To connect to a GCS bucket, I’ll create a GCP service account with the appropriate access credentials in order to connect to the bucket from the python scripts.

3.2. Batch Prediction

Each run involves extracting a batch of data within a specified datetime range, streamlining the batch inference process. The most recent and optimal model is loaded into memory by downloading the artifact from the model registry. Subsequently, the model predicts PSI values for the upcoming 24 hours, and these predictions are then stored in the Google Cloud Storage (GCS) bucket.

4. Scheduling and Orchestration using Airflow

4.1. Pypi Server

The PyPi registry is a server where you can host various Python modules. Only people with access to the PyPi server can install packages from it. A private PyPi server is configured to host the feature, training and batch prediction pipelines.

Poetry is used to package the feature, training and batch prediction pipelines as individual packages before uploading to the server.

4.2. Airflow

Airflow is used to schedule and orchestrate the pipelines using DAGs. Here’s an overview of how the flow and branching of DAGs are configured in Airflow.

5. Continuous Monitoring for Model Performance

5.1. Great Expectation (GE) Suite

GE Suite serves as a tool comprising verifiable assertions regarding data integrity. Hopsworks integrates GE support, enabling the addition of a GE validation suite to Hopsworks to define the expected behavior of new data.

An expectation is a verifiable assertion about data

Several expectations include:

Ensuring that table columns align with a predefined ordered list.
Verifying that the total number of columns is 7.
Affirming that timestamp columns cannot be null.
Specifying that readings columns are of type int32 and possess a minimum and maximum value of 0 and 500, respectively.

5.2. ML Monitoring

Ensuring the consistent and expected performance of the production system over time is crucial. Implementing a machine learning monitoring process establishes a mechanism to address any issues that may arise, facilitating the adaptation of the system and retraining the model in response to changes in the environment.

For instance, the Mean Absolute Percentage Error (MAPE) metric is continuously computed. A spike in this metric serves as an alarm, prompting actions such as fine-tuning the model or adjusting model configurations as necessary.

6. FastAPI and Streamlit

FastAPI and Streamlit will serve as the backend and frontend backbone for retrieving model ouputs (predictions and monitoring metrics) and rendering as an dashboard for visual purposes. Both applications are dockerised and deployed.

6.1. FastAPI

FastAPI is used as the backend to consume predicions and monitoring metrics from GCS and expose them through a RESTful API. A variety of endpoints are defined to GET the predictions and monitoring metrics.

Endpoints:

\health: Health check
\predictions: GET prediction values
\monitoring/metrics: GET aggregated monitoring metrics

Upon receiving the data request, it will access the data storage encoded to the preconfigured Pydantic schema. The retrieved response is subsequently decoded to JSON.

6.2. Streamlit

Streamlit will be the frontend application that renders the data to visualise 2 dashboards:

predictions
monitoring metrics

7. System Deployment using GCP

Due to cost considerations, I have opted to exclude this section, as it falls outside the project’s defined scope.

In a production environment, the preferred approach involves deploying all machine learning components to a cloud provider (e.g., AWS, GCP, Azure) and establishing a Continuous Integration/Continuous Deployment (CI/CD) pipeline utilizing tools such as Github Actions or Azure Pipelines, among others.

Data Systems using Azure

2024-01-04T00:00:00+00:00

I have been working with Azure cloud services for the past 1-2 years, complemented by the acquisition of two Microsoft Certificates: Azure Fundamentals and Azure Data Engineering. In this post, I will highlight a few pivotal projects where I played a central role. This exposition is less of a guide but rather a comprehensive display illustrating the integration of these services to accomplish each project’s specific objectives.

SFTP Architecture

The primary aim of this project was to establish a data system capable of automating the encrypted file transfer to a commercial partner through SFTP. It was imperative that the transferred data remained encrypted throughout the process.

Database (Sink)

Data essential for batch transfer underwent various transformations for business use within the data warehouse before being loaded onto this server at regular batch intervals. This server acted as the source for retrieving data for our data system.
Azure Data Factory (Transformation)

Azure Data Factory (ADF) served as the central pipeline orchestration tool for executing batch data transfers. ADF played a crucial role in integrating multiple services to fulfill the project’s objectives. Its primary functions encompassed:
- Adapting data to align with the format required by our commercial partner’s existing SFTP server infrastructure.
- Transmitting the finalized, encrypted, and formatted files to the destination SFTP server.
Azure Blob Storage (Staging)

Transformed data was stored as blob files in Azure Blob Storage, functioning as an interim staging area.

To maintain a clear demarcation between encrypted and unencrypted data, encrypted data was stored separately, facilitating a more transparent debugging process. Staging data twice allowed pinpointing issues from the unencrypted files onward, bypassing the data transformation process.

Additionally, each batch’s encrypted AES keys were stored within this storage environment.
Azure Function (Encryption)

Adhering to security requirements mandating encryption at rest and in transit, a 2-stage hybrid encryption employing RSA and AES was implemented on the data files themselves.

While Azure ensures encryption at rest and in transit, the intricacies of hybrid encryption demanded a custom solution. Leveraging Azure Function, the encryption logic was managed and deployed using the python V2 programming model.
Azure Key Vault (Secure keys)

RSA key certificates were securely stored within the Azure Key Vault. The Azure Function accessed these keys solely during the encryption process, guaranteeing the constant protection and security of the RSA key.

dbt Architecture

The project aimed to devise and establish a straightforward architecture supporting the deployment of dbt. This shift was intended to transition away from Alteryx as the primary ETL/ELT tool toward a more adaptable and resilient infrastructure that champions dbt for data transformation.

Container Registry (Containerisation)

Upon containerizing the dbt project into a Docker image, the image is stored within the Container Registry.
Container Instance

Deployment of the containerized application is accomplished via a singular instance of Azure Container Instances. Unlike continuous container operation, the container is activated solely during the execution of dbt runs. This approach ensures that the container remains inactive during periods of inactivity in dbt runs to save cost.
Data Factory (Orchestration & Monitoring)

Azure Data Factory operates as the orchestration tool, responsible for scheduling and executing the containerized dbt application. Triggers are utilized to initiate the container and commence the dbt runs.

Furthermore, the REST API is leveraged for monitoring the container’s status. This enables efficient tracking of the container’s state.

2023 Review

2024-01-03T00:00:00+00:00

This reflection and review for 2023 will incorporate a more personal element instead of focusing on purely work topics.

Summary

In 2023, my life was a whirlwind, a rollercoaster of experiences that brought both challenges and exhilarating moments across different aspects. Reflecting on this eventful year, I appreciate the highs and acknowledge the lows. While I aspired to achieve numerous goals, I fell short in several areas. However, this has only fueled my determination to make 2024 a year of growth and achievement. I will instead commit myself to focusing on excelling in a few key areas that matter most to me.

School

Academically, diving into my part-time master’s program in Data Science at NTU exposed me to intriguing modules like data systems, machine learning applications, and the mathematics behind AI. Balancing work and academics was tough, but the rewards were immense, leaving me eagerly anticipating more learning and personal growth in 2024.

Work

Professionally, my focus shifted towards data engineering and Azure cloud computing at work. Engaging in various projects expanded my expertise in data modeling, ETL/ELT pipelines, and developing AI/ML Proof of Concepts within the Azure ecosystem. However, the year ended with some professional turbulence, leaving a bittersweet feeling as I reflected on my growth.

Personal

On a personal note, milestones abounded. My engagement and subsequent wedding planning with my partner, along with our joint venture in buying and renovating a resale flat, marked significant strides in our lives. Despite time constraints limiting my interactions, catching up with close friends and witnessing my best friend’s marriage were treasured moments.

Travel

Travel-wise, the experiences were diverse and enriching. Beginning the year with an enchanting trip to Germany and Poland with my partner set the tone for an adventurous year. Family trip to Krabi and a rejuvenating Bali getaway with friends were refreshing. The last trip of the year was a bachelor adventure to Bangkok which introduced me to new friendships and unforgettable memories.

Improvements

Looking back on my goal last year

🥅 2023 Goals: Gain a deeper understanding in Causal Inference and engage in more practical application of data/ML engineering and ops.

Causal Inference: Unfortunately, I didn’t make much progress in this area due to the significant influx of events—starting my master’s, the GenAI hype, etc.
DataOps and Data Engineering: My work allowed me to make significant progress in this area. I acquired Microsoft Certificates: Azure Fundamentals and Azure Data Engineer. I became proficient in various Azure cloud services such as Data Factory, Function, Logic App, Stream Analytics, EventHub, Machine Learning, etc.
MLOps and ML Engineering: Although slightly delayed, I am currently working on a project that focuses on implementing an end-to-end ML platform.

2024 Plan

Plan and enjoy a lovely wedding and honeymoon that my wife and I will always remember fondly.
Dive deeper into GenAI projects, aligning with my studies to explore the realms of deep learning, neural networks, and reinforcement learning. This will fortify my grasp on generative models, enhancing my expertise in this dynamic field.
Expand my reading horizons. Although my exploration of stoicism in 2023 impacted my reading pace, I aim to continue delving into diverse genres while striving to maintain a steady reading habit.
Cultivate a consistent writing practice. While I authored two articles in 2022, I didn’t contribute any last year. Writing not only reinforces my understanding but also allows me to share my insights with the community — an endeavor I consider invaluable. I aim to write and share my learnings more frequently in 2024, contributing to the collective growth of knowledge.

On a side note, it’s possible that this portfolio website is due for an exciting upgrade soon!

2022 Project Summary

2022-10-19T00:00:00+00:00

Instead of working on analytical insights projects in 2022, I decided to spin up something different. There are 2 notable projects I have been working on this year: Medium Articles and Churn Models on Streamlit.

1. Medium Articles

I wrote and published 2 articles to Towards Data Science (TDS) on medium

2. Churn Models on Streamlit

Another key focus on the year was learning and deploying streamlit. There are many web frameworks to render dashboards using python such as dash, flask and streamlit. I decided to go with streamlit as it is capable of turning data scripts into shareable web apps in minutes all in pure Python with no requirements on front‑end experience. Regarding the demo topic, I chose churn models to illustrate how supervised classification models such as GLMs and Random Forests can be utilised to address problems and generate insights.

Feel free to visit my streamlit app to see how streamlit can be used to showcase experimental machine learning models to tackle churn prediction problems. To view the source code, you can also checkout the GitHub repo here.

LinkedIn Network Analysis

2021-12-01T00:00:00+00:00

What does your LinkedIn network really look like? I visualized my own connections using NetworkX and Plotly, turning a list of names into a living, breathing graph. Along the way, I explored concepts from network and graph theory—like centrality, clusters, and bridges—that reveal hidden patterns in how people are connected. Dive in to see how data visualization can turn something familiar into something surprisingly insightful.

## Installing Libraries

import numpy as np
import pandas as pd
import networkx as nx
from pyvis import network as net
import janitor

import plotly.express as px
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import display, HTML

## Loading dataset
df = pd.read_csv('data/Connections.csv',skiprows=2)
df.info() # summary info

RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   First Name     397 non-null    object
 1   Last Name      397 non-null    object
 2   Email Address  7 non-null      object
 3   Company        390 non-null    object
 4   Position       390 non-null    object
 5   Connected On   400 non-null    object
dtypes: object(6)
memory usage: 18.9+ KB

At a quick glance, I have about 400 connections.

Data Cleaning

I will perform some cleaning, remove unnecessary attributes and remove null values from the data.

new_df = (
        df.clean_names() # remove spacing and capitalisation
        .drop(columns=['first_name','last_name','email_address']) # dropped first, last and email
        .dropna(subset=['company','position']) # remove null values in company and position
        .to_datetime('connected_on', format='%d %b %Y') # convert date column to datetime object
)
new_df.head()

	company	position	connected_on
0	InfoCepts	Talent Acquisition Lead	2021-11-28
1	Yara International	Associate data engineer	2021-11-27
2	Yara International	Lead Recruiter, Digital Ag Solutions	2021-11-25
3	Yara International	Data Scientist	2021-11-25
4	Yara International	Associate Digital Information Specialist	2021-11-25

Data Exploration

Connnections at a glance
New connections over time
Top 15 companies my connections work at
Top 15 roles my connections work as

Connections at a glance

new_df1 = new_df[['company','position']]
new_df1['My Network'] = 'My Network'

px.treemap(new_df1, path=['My Network', 'company', 'position'], width=1200, height=1200)

New Connections over time

daily_connections = (new_df
                    .groupby(by=['connected_on']) # group by date
                    .size() # sum up new connections per day
                    .plot() # plot line chart
)

Looking at the number of new connections over time since i joined LinkedIn, bulk of my connections were created during the start - period between end 2019 and start of 2020).

Top 15 companies my connections work at

companies_count = (new_df
                    .groupby(by=['company']) # group by country
                    .size() # sum up count for each company
                    .to_frame('size') # convert to frame
                    .sort_values(by=['size'],ascending=False) # sort by descending order
                    .reset_index()
)
companies_count.head(15).plot(kind='barh').invert_yaxis() # convert to horizontal plot

Top 15 roles my connections are working in

position_count = (new_df
                    .groupby(by=['position']) # group by country
                    .size() # sum up count for each company
                    .to_frame('size') # convert to frame
                    .sort_values(by=['size'],ascending=False) # sort by descending order
)
position_count.head(15).plot(kind='barh').invert_yaxis() # convert to horizontal plot

The top 3 companies my connections are working in are from Yara, Archisen and NTU, which is expected given that I did my undergraduate degree in NTU, worked at Archisen after graduation before joining Yara International.

Most of my connections are Research Assistants, Data Scientist and Software Engineers.

Network Analysis

companies_count.reset_index(inplace=True,drop=True)
companies_count_reduced = companies_count.loc[companies_count['size'] >=2]
print(companies_count_reduced.shape)

(42, 2)

position_count.reset_index(inplace=True)
position_count_reduced = position_count.loc[position_count['size'] >=2]
print(position_count_reduced.shape)

(35, 2)

# Initialise Graph
g1 = nx.Graph()
g1.add_node('root') # initialising myself as centrala node

# 
for id,row in companies_count_reduced.iterrows():

    # store company name and count
    company = row['company']
    count = row['size']
    
    title = f"{company} - {count}"
    # extract the positions my connections hold and store them in a set to prevent duplication
    positions = set([x for x in new_df[company == new_df['company']]['position']])
    positions = ''.join('{}
'.format(x) for x in positions)

    position_list = f"{positions}"
    hover_info = title + position_list

    g1.add_node(company, size = count*2, title = hover_info, color='#3449eb')
    g1.add_edge('root',company,color='grey')

# Generate the graph
company_nt = net.Network(height='700px', width='700px', bgcolor="grey", font_color='white',notebook=True)
company_nt.from_nx(g1)
company_nt.hrepulsion()

company_nt.show('company_graph.html')
display(HTML('company_graph.html'))

# initialize graph
g2 = nx.Graph()
g2.add_node('root') # intialize yourself as central

# use iterrows tp iterate through the data frame
for id, row in position_count_reduced.iterrows():

  count = f"{row['size']}"
  position= row['position']
  
  g2.add_node(position, size=count, color='#3449eb', title=count)
  g2.add_edge('root', position, color='grey')

# generate the graph
position_nt = net.Network(height='700px', width='700px', bgcolor="black", font_color='white', notebook = True)
position_nt.from_nx(g2)
position_nt.hrepulsion()

position_nt.show('position_graph.html')
display(HTML('position_graph.html'))

Rule-based Sentiment Analysis on Syfe, Stashaway and Endowus

2021-09-24T00:00:00+00:00

Are all app reviews created equal? I put investment platforms — Syfe, StashAway, and Endowus — under the microscope using three lexicon-based sentiment tools: TextBlob, VADER, and SentiWordNet. Each one tells a slightly different story about what users love (or don’t), and together they reveal the hidden tone behind the feedback.

import numpy as np
import pandas as pd
import regex as re
import warnings
warnings.filterwarnings('ignore')

# import file
app_reviews = pd.read_csv('app_reviews.csv')
app_reviews.head()

	app_name	content
0	Syfe	1. The portfolio “card user interface” can be ...
1	Syfe	This hybrid app is quite buggy compared Stasha...
2	Syfe	The app and website is just a bunch of fake li...
3	Syfe	The app looks fantastic and it’s so fresh with...
4	Syfe	Hi there,\n\nThe app checks for latest version...

1. Data Preprocessing

Data preprocessing steps:

a. Cleaning the text
b. Tokenization
c. Enrichment – POS tagging
d. Stopwords removal
e. Obtaining the stem words

1a. Cleaning the Text

Remove the special characters, numbers from the review text using regex

# Define a function to clean the text
def clean(text):
# Removes all special characters and numericals leaving the alphabets
    text = re.sub('[^A-Za-z]+', ' ', text)
    return text

# Cleaning the text in the review column
app_reviews['cleaned_reviews'] = app_reviews['content'].apply(clean)
app_reviews.head()

	app_name	content	cleaned_reviews
0	Syfe	1. The portfolio “card user interface” can be ...	The portfolio card user interface can be inco...
1	Syfe	This hybrid app is quite buggy compared Stasha...	This hybrid app is quite buggy compared Stasha...
2	Syfe	The app and website is just a bunch of fake li...	The app and website is just a bunch of fake li...
3	Syfe	The app looks fantastic and it’s so fresh with...	The app looks fantastic and it s so fresh with...
4	Syfe	Hi there,\n\nThe app checks for latest version...	Hi there The app checks for latest version dur...

1b. Tokenisation

Using nltk tokenize function word_tokenize() to perform word-level tokenization

1c. Enrichment – POS tagging

Using the nltk pos_tag function to perform Parts of Speech (POS) tagging - converting each token into a tuple having the form (word, tag). POS tagging essential to preserve the context of the word and is essential for Lemmatization

1d. Stopwords removal

Stopwords in English are words that carry very little useful information. We need to remove them as part of text preprocessing. nltk has a list of stopwords of every language.

import nltk
from nltk.tokenize import word_tokenize
# Download punkt resource if unavailable
# nltk.download('punkt') 

from nltk.tag import pos_tag
# Download averaged_perceptron_tagger resource if unavailable
# nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.corpus import wordnet
# Download wordnet resource if unavailable
# nltk.download('wordnet')

## POS tagger dictionary
# To obtain the accurate Lemma, the WordNetLemmatizer requires POS tags in the form of ‘n’, ‘a’, etc. 
# But the POS tags obtained from pos_tag are in the form of ‘NN’, ‘ADJ’, etc.
# To map pos_tag to wordnet tags, we created a dictionary pos_dict. 
# Any pos_tag that starts with J is mapped to wordnet.ADJ, any pos_tag that starts with R is mapped to wordnet.ADV, and so on.
# Our tags of interest are Noun, Adjective, Adverb, Verb. Anything out of these four is mapped to None.
pos_dict = {'J':wordnet.ADJ, 'V':wordnet.VERB, 'N':wordnet.NOUN, 'R':wordnet.ADV}

def token_stop_pos(text):
    tags = pos_tag(word_tokenize(text)) # tokenise the reviews, and pos tag the tokens
    newlist = [] # create empty list to append tags to the words
    for word, tag in tags: # interate through the tuples (word:pos tag) in tags
        if word.lower() not in set(stopwords.words('english')): # remove stop words
            newlist.append(tuple([word, pos_dict.get(tag[0])])) # append new pos tags in the correct form by mapping to pos_dict
    return newlist

app_reviews['pos_tagged'] = app_reviews['cleaned_reviews'].apply(token_stop_pos) # apply token_stop_pos function to the reviews
app_reviews.head()

	app_name	content	cleaned_reviews	pos_tagged
0	Syfe	1. The portfolio “card user interface” can be ...	The portfolio card user interface can be inco...	[(portfolio, n), (card, n), (user, None), (int...
1	Syfe	This hybrid app is quite buggy compared Stasha...	This hybrid app is quite buggy compared Stasha...	[(hybrid, a), (app, n), (quite, r), (buggy, a)...
2	Syfe	The app and website is just a bunch of fake li...	The app and website is just a bunch of fake li...	[(app, n), (website, n), (bunch, n), (fake, a)...
3	Syfe	The app looks fantastic and it’s so fresh with...	The app looks fantastic and it s so fresh with...	[(app, n), (looks, v), (fantastic, a), (fresh,...
4	Syfe	Hi there,\n\nThe app checks for latest version...	Hi there The app checks for latest version dur...	[(Hi, n), (app, n), (checks, n), (latest, a), ...

1e. Obtaining the stem words

A stem is a part of a word responsible for its lexical meaning. The two popular techniques of obtaining the root/stem words are Stemming and Lemmatization.

The key difference is Stemming often gives some meaningless root words as it simply chops off some characters in the end. Lemmatization gives meaningful root words, however, it requires POS tags of the words.

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize(pos_data):
    lemma_rew = " " # create empoty string
    for word, pos in pos_data: # iterate through tuples (word,POS tag)
        if not pos: 
            lemma = word
            lemma_rew = lemma_rew + " " + lemma
        else:
            lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
            lemma_rew = lemma_rew + " " + lemma
    return lemma_rew

app_reviews['Lemma'] = app_reviews['pos_tagged'].apply(lemmatize)
app_reviews.head()

	app_name	content	cleaned_reviews	pos_tagged	Lemma
0	Syfe	1. The portfolio “card user interface” can be ...	The portfolio card user interface can be inco...	[(portfolio, n), (card, n), (user, None), (int...	portfolio card user interface inconvenient m...
1	Syfe	This hybrid app is quite buggy compared Stasha...	This hybrid app is quite buggy compared Stasha...	[(hybrid, a), (app, n), (quite, r), (buggy, a)...	hybrid app quite buggy compare Stashaway How...
2	Syfe	The app and website is just a bunch of fake li...	The app and website is just a bunch of fake li...	[(app, n), (website, n), (bunch, n), (fake, a)...	app website bunch fake lie Starting onboardi...
3	Syfe	The app looks fantastic and it’s so fresh with...	The app looks fantastic and it s so fresh with...	[(app, n), (looks, v), (fantastic, a), (fresh,...	app look fantastic fresh different color muc...
4	Syfe	Hi there,\n\nThe app checks for latest version...	Hi there The app checks for latest version dur...	[(Hi, n), (app, n), (checks, n), (latest, a), ...	Hi app check late version launch alert user ...

2. Rule-Based Sentiment Analysis

a. TextBlob
b. VADER
c. SentiWordNet

# Creating a new data frame with the review, Lemma columns 
fin_data = pd.DataFrame(app_reviews[['app_name','cleaned_reviews', 'Lemma']])

2a. Sentiment Analysis using TextBlob

Polarity – talks about how positive or negative the opinion is

Polarity ranges from -1 to 1 (1 is more positive, 0 is neutral, -1 is more negative)

Subjectivity – talks about how subjective the opinion is

Subjectivity ranges from 0 to 1(0 being very objective and 1 being very subjective)

from textblob import TextBlob

# function to calculate subjectivity
def getSubjectivity(review):
    return TextBlob(review).sentiment.subjectivity

# function to calculate polarity
def getPolarity(review):
    return TextBlob(review).sentiment.polarity

# function to analyze the reviews
def analysis(score):
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Positive'

# Apply the above functions
fin_data['subjectivity'] = fin_data['Lemma'].apply(getSubjectivity) 
fin_data['polarity'] = fin_data['Lemma'].apply(getPolarity) 
fin_data['textblob-analysis'] = fin_data['polarity'].apply(analysis)

fin_data.head()

	app_name	cleaned_reviews	Lemma	subjectivity	polarity	textblob-analysis
0	Syfe	The portfolio card user interface can be inco...	portfolio card user interface inconvenient m...	0.436364	0.236364	Positive
1	Syfe	This hybrid app is quite buggy compared Stasha...	hybrid app quite buggy compare Stashaway How...	0.500000	0.200000	Positive
2	Syfe	The app and website is just a bunch of fake li...	app website bunch fake lie Starting onboardi...	0.465833	-0.125000	Negative
3	Syfe	The app looks fantastic and it s so fresh with...	app look fantastic fresh different color muc...	0.473333	0.146667	Positive
4	Syfe	Hi there The app checks for latest version dur...	Hi app check late version launch alert user ...	0.551515	-0.154545	Negative

tb_counts = fin_data.groupby(by=['app_name','textblob-analysis']).size()
tb_counts

app_name   textblob-analysis
Endowus    Negative                5
           Neutral                12
           Positive              193
StashAway  Negative               97
           Neutral               157
           Positive             1401
Syfe       Negative               30
           Neutral                42
           Positive              102
dtype: int64

2b. Sentiment Analysis using VADER

positive if compound >= 0.5
neutral if -0.5 < compound < 0.5
negative if -0.5 >= compound

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

# Function to return sentiment based on input text. Sentiment label consist of:
def calc_vader_sentiment(text):
    vs = analyzer.polarity_scores(str(text))
    compound = vs['compound']
    if (compound >= 0.5):
        sentiment = 'Positive'
    elif(compound <= -0.5):
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    return sentiment

fin_data['vader-analysis'] = fin_data['Lemma'].apply(calc_vader_sentiment)
fin_data.head()

	app_name	cleaned_reviews	Lemma	subjectivity	polarity	textblob-analysis	vader-analysis
0	Syfe	The portfolio card user interface can be inco...	portfolio card user interface inconvenient m...	0.436364	0.236364	Positive	Positive
1	Syfe	This hybrid app is quite buggy compared Stasha...	hybrid app quite buggy compare Stashaway How...	0.500000	0.200000	Positive	Positive
2	Syfe	The app and website is just a bunch of fake li...	app website bunch fake lie Starting onboardi...	0.465833	-0.125000	Negative	Negative
3	Syfe	The app looks fantastic and it s so fresh with...	app look fantastic fresh different color muc...	0.473333	0.146667	Positive	Positive
4	Syfe	Hi there The app checks for latest version dur...	Hi app check late version launch alert user ...	0.551515	-0.154545	Negative	Positive

vd_counts = fin_data.groupby(by=['app_name','vader-analysis']).size()
vd_counts

app_name   vader-analysis
Endowus    Neutral             44
           Positive           166
StashAway  Negative            30
           Neutral            466
           Positive          1159
Syfe       Negative            10
           Neutral            103
           Positive            61
dtype: int64

2c. Sentiment Analysis using SentiWordNet

if positive score > negative score, the sentiment is positive
if positive score < negative score, the sentiment is negative
if positive score = negative score, the sentiment is neutral

# Download sentiwordnet resource if unavailable
# nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn

def sentiwordnetanalysis(pos_data):
    sentiment = 0
    tokens_count = 0
    for word, pos in pos_data:
        if not pos:
            continue
        lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
        if not lemma:
            continue
        synsets = wordnet.synsets(lemma, pos=pos)
        if not synsets:
            continue
        # Take the first sense, the most common
        synset = synsets[0]
        swn_synset = swn.senti_synset(synset.name())
        sentiment += swn_synset.pos_score() - swn_synset.neg_score()
        tokens_count += 1
        # print(swn_synset.pos_score(),swn_synset.neg_score(),swn_synset.obj_score())
        if not tokens_count:
            return 0
        if sentiment>0:
            return "Positive"
        if sentiment==0:
            return "Neutral"
        else:
            return "Negative"

fin_data['swn-analysis'] = app_reviews['pos_tagged'].apply(sentiwordnetanalysis)
fin_data.head()

	app_name	cleaned_reviews	Lemma	subjectivity	polarity	textblob-analysis	vader-analysis	swn-analysis
0	Syfe	The portfolio card user interface can be inco...	portfolio card user interface inconvenient m...	0.436364	0.236364	Positive	Positive	Neutral
1	Syfe	This hybrid app is quite buggy compared Stasha...	hybrid app quite buggy compare Stashaway How...	0.500000	0.200000	Positive	Positive	Neutral
2	Syfe	The app and website is just a bunch of fake li...	app website bunch fake lie Starting onboardi...	0.465833	-0.125000	Negative	Negative	Neutral
3	Syfe	The app looks fantastic and it s so fresh with...	app look fantastic fresh different color muc...	0.473333	0.146667	Positive	Positive	Neutral
4	Syfe	Hi there The app checks for latest version dur...	Hi app check late version launch alert user ...	0.551515	-0.154545	Negative	Positive	Neutral

swn_counts = fin_data.groupby(by=['app_name','swn-analysis']).size()
swn_counts

app_name   swn-analysis
Endowus    Negative         12
           Neutral         134
           Positive         63
StashAway  Negative        130
           Neutral         983
           Positive        511
Syfe       Negative         19
           Neutral         117
           Positive         37
dtype: int64

3. Visualise Results

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Convert sentiment results from series into dataframes
tb_counts_df = pd.DataFrame(tb_counts).reset_index().rename(columns={0:'count'})
vd_counts_df = pd.DataFrame(vd_counts).reset_index().rename(columns={0:'count'})
swn_counts_df = pd.DataFrame(swn_counts).reset_index().rename(columns={0:'count'})

Absolute Comparison

sns.set_style( 'darkgrid' )
col = sns.color_palette("Set2")
fig, axes = plt.subplots(1,3,figsize=[30,8])
fig.suptitle('Rule Based Sentiment Analysis on Syfe, Endowus and StashAway')

## Plot 1
sns.barplot(ax=axes[0],data=tb_counts_df,x='app_name',y='count',hue='textblob-analysis', palette=col)
axes[0].set_title('Sentiment Analysis using TextBlob')
axes[0].set_ylabel('Score Count')

## Plot 2
sns.barplot(ax=axes[1],data=vd_counts_df,x='app_name',y='count',hue='vader-analysis', palette=col)
axes[1].set_title('Sentiment Analysis using VADER')
axes[1].set_ylabel('Score Count')

## Plot 3
sns.barplot(ax=axes[2],data=swn_counts_df,x='app_name',y='count',hue='swn-analysis', palette=col)
axes[2].set_title('Sentiment Analysis using SentiWordNet')
axes[2].set_ylabel('Score Count')

Text(0, 0.5, 'Score Count')

Percentage Comparison

# Convert sentiment results from series into dataframes
tb_grouped_df = tb_counts_df.groupby(['app_name',tb_counts_df['textblob-analysis']]).agg({'count':'sum'})
tb_percent_df = tb_grouped_df.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
tb_percent_df = pd.DataFrame(tb_percent_df).reset_index().rename(columns={'count':'perc_count'})

vd_grouped_df = vd_counts_df .groupby(['app_name',vd_counts_df['vader-analysis']]).agg({'count':'sum'})
vd_percent_df = vd_grouped_df.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
vd_percent_df = pd.DataFrame(vd_percent_df).reset_index().rename(columns={'count':'perc_count'})


swn_grouped_df = swn_counts_df .groupby(['app_name',swn_counts_df['swn-analysis']]).agg({'count':'sum'})
swn_percent_df = swn_grouped_df.groupby(level=0).apply(lambda x: 100*x/float(x.sum()))
swn_percent_df = pd.DataFrame(swn_percent_df).reset_index().rename(columns={'count':'perc_count'})

sns.set_style( 'darkgrid' )
col = sns.color_palette("Set2")
fig1, axes1 = plt.subplots(1,3,figsize=[30,8])
fig1.suptitle('Rule Based Sentiment Analysis on Syfe, Endowus and StashAway')

## Plot 1
sns.barplot(ax=axes1[0],data=tb_percent_df,x='app_name',y='perc_count',hue='textblob-analysis', palette=col)
axes1[0].set_title('Sentiment Analysis using TextBlob')
axes1[0].set_ylabel('Score Count (%)')

## Plot 2
sns.barplot(ax=axes1[1],data=vd_percent_df,x='app_name',y='perc_count',hue='vader-analysis', palette=col)
axes1[1].set_title('Sentiment Analysis using VADER')
axes1[1].set_ylabel('Score Count (%)')

## Plot 3
sns.barplot(ax=axes1[2],data=swn_percent_df,x='app_name',y='perc_count',hue='swn-analysis', palette=col)
axes1[2].set_title('Sentiment Analysis using SentiWordNet')
axes1[2].set_ylabel('Score Count (%)')

Text(0, 0.5, 'Score Count (%)')

Key Takeaways

Looking purely at their absolute numbers, Stashaway have the highest number of scores given that it has the largest number of app reviews. It’s number of positive scores are overwhelmingly higher than negative scores.
Using SentiWordNet appears to depress the scores variance, with more scores distributed around neutral scores. Focusing on Stashaway, the number of positive reviews decreased, and the number neutral scores shot up.
In terms of percentage score, Endowus leads in this aspect, with the highest percentage of positive reviews compared to StashAway and Syfe.
All 3 roboadvisors have a higher percentage of positive scores, with a small percentage of negative reviews.
Hierarchy of choice: Endowus or Stashaway > Syfe
Anyone looking to choose any one of these roboadvisors can rest assured that all 3 apps have garnered good reviews from the users.

Scrapping App Reviews for popular roboadvisors in Singapore using Python

2021-09-02T00:00:00+00:00

Behind every app lies thousands of user voices. I used Python to scrape reviews for Syfe, Endowus, and StashAway from both the Apple App Store and Google Play. In this three-part series, I walk through collecting reviews from each platform and then bringing them together into one dataset ready for analysis. This work draws reference from Apple Store Scraper and Google Play Store Scraper.

## Importing relevant libraries
import pandas as pd
import numpy as np

# for scraping app info from App Store
from itunes_app_scraper.scraper import AppStoreScraper

# for scraping app reviews from App Store
from app_store_scraper import AppStore

# for scraping app reviews from GPS
from google_play_scraper import app, Sort, reviews

# for pretty printing data structures
from pprint import pprint

# for keeping track of timing
import datetime as dt
from tzlocal import get_localzone

# for building in wait times
import random
import time

Part 1 - Scrap reviews from Apple Store

## Read in file containing app names and IDs
apple_app_df = pd.read_excel('app_info.xlsx', sheet_name='apple')
print(f"""
Printing first few rows of app's info in the csv file:
------------------------------------------------------
{apple_app_df.head()}
""")

## Get list of app names and app IDs
apple_app_names = list(apple_app_df['app_name'])
apple_app_ids = list(apple_app_df['iOS_app_id'])

Printing first few rows of app's info in the csv file:
------------------------------------------------------
    app_name                 iOS_app_name  iOS_app_id
0       Syfe           syfe-invest-better  1497156434
1    Endowus  endowus-invest-cpf-srs-cash  1531067679
2  StashAway    stashaway-invest-and-save  1229966330

## Set up App Store Scraper
scraper = AppStoreScraper()
apple_app_store_list = list(scraper.get_multiple_app_details(apple_app_ids))

https://itunes.apple.com/lookup?id=1497156434&country=nl&entity=software
https://itunes.apple.com/lookup?id=1531067679&country=nl&entity=software
https://itunes.apple.com/lookup?id=1229966330&country=nl&entity=software

# Converting list into dataframe
apple_app_info_df = pd.DataFrame(apple_app_store_list)

Given that there are no user rating counts, we can ignore itunes ratings in our analysis.

Scrapping App Reviews from Apple Store

# Empty list for storing reviews
apple_app_reviews = []

## Set up loop to go through all apps
for app_name, app_id in zip(apple_app_names, apple_app_ids):
    
    # Get start time
    start = dt.datetime.now(tz=get_localzone())
    fmt= "%m/%d/%y - %T %p"
    
    # Print starting output for app
    print('---'*20)
    print('---'*20)    
    print(f'***** {app_name} started at {start.strftime(fmt)}')
    print()
    
    # Instantiate AppStore for app
    app_ = AppStore(country='sg', app_name=app_name, app_id=app_id)
    
    # Scrape reviews posted since February 28, 2020 and limit to 10,000 reviews
    app_.review(how_many=10000,
                after=dt.datetime(2020, 2, 28),
                sleep=random.randint(20,25))
    
    reviews = app_.reviews
    
    # Add keys to store information about which app each review is for
    for rvw in reviews:
        rvw['app_name'] = app_name
        rvw['app_id'] = app_id
    
    # Print update that scraping was completed
    print(f"""Done scraping {app_name}. 
    Scraped a total of {app_.reviews_count} reviews.\n""")

     # Convert list of dicts to Pandas DataFrame
    review_df = pd.DataFrame(reviews)
    apple_app_reviews.append(review_df)
    
    # Get end time
    end = dt.datetime.now(tz=get_localzone())
    
    # Print ending output for app
    print(f"""Successfully wrote {app_name} reviews to df
    at {end.strftime(fmt)}.\n""")
    print(f'Time elapsed for {app_name}: {end-start}')
    print('---'*20)
    print('---'*20)
    print('\n')
    
    # Wait 5 to 10 seconds to start scraping next app
    time.sleep(random.randint(5,10))

------------------------------------------------------------
------------------------------------------------------------
***** Syfe started at 10/03/21 - 16:20:08 PM



2021-10-03 16:20:09,532 [INFO] Base - Initialised: AppStore('sg', 'syfe', 1497156434)
2021-10-03 16:20:09,534 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/sg/app/syfe/id1497156434
2021-10-03 16:20:30,811 [INFO] Base - [id:1497156434] Fetched 20 reviews (20 fetched in total)
2021-10-03 16:21:13,458 [INFO] Base - [id:1497156434] Fetched 57 reviews (57 fetched in total)
2021-10-03 16:21:13,755 [INFO] Base - [id:1497156434] Fetched 67 reviews (67 fetched in total)


Done scraping Syfe. 
    Scraped a total of 67 reviews.

Successfully wrote Syfe reviews to df
    at 10/03/21 - 16:21:13 PM.

Time elapsed for Syfe: 0:01:05.326312
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** Endowus started at 10/03/21 - 16:21:21 PM



2021-10-03 16:21:23,187 [INFO] Base - Initialised: AppStore('sg', 'endowus', 1531067679)
2021-10-03 16:21:23,188 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/sg/app/endowus/id1531067679
2021-10-03 16:21:44,489 [INFO] Base - [id:1531067679] Fetched 20 reviews (20 fetched in total)
2021-10-03 16:22:27,085 [INFO] Base - [id:1531067679] Fetched 60 reviews (60 fetched in total)
2021-10-03 16:22:27,500 [INFO] Base - [id:1531067679] Fetched 74 reviews (74 fetched in total)


Done scraping Endowus. 
    Scraped a total of 74 reviews.

Successfully wrote Endowus reviews to df
    at 10/03/21 - 16:22:27 PM.

Time elapsed for Endowus: 0:01:05.738188
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** StashAway started at 10/03/21 - 16:22:35 PM



2021-10-03 16:22:36,806 [INFO] Base - Initialised: AppStore('sg', 'stashaway', 1229966330)
2021-10-03 16:22:36,807 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/sg/app/stashaway/id1229966330
2021-10-03 16:22:59,209 [INFO] Base - [id:1229966330] Fetched 17 reviews (17 fetched in total)
2021-10-03 16:23:43,855 [INFO] Base - [id:1229966330] Fetched 50 reviews (50 fetched in total)
2021-10-03 16:24:28,497 [INFO] Base - [id:1229966330] Fetched 82 reviews (82 fetched in total)
2021-10-03 16:25:13,169 [INFO] Base - [id:1229966330] Fetched 119 reviews (119 fetched in total)
2021-10-03 16:25:57,920 [INFO] Base - [id:1229966330] Fetched 148 reviews (148 fetched in total)
2021-10-03 16:26:42,669 [INFO] Base - [id:1229966330] Fetched 180 reviews (180 fetched in total)
2021-10-03 16:27:27,424 [INFO] Base - [id:1229966330] Fetched 213 reviews (213 fetched in total)
2021-10-03 16:28:12,182 [INFO] Base - [id:1229966330] Fetched 244 reviews (244 fetched in total)
2021-10-03 16:28:56,825 [INFO] Base - [id:1229966330] Fetched 273 reviews (273 fetched in total)
2021-10-03 16:29:41,572 [INFO] Base - [id:1229966330] Fetched 309 reviews (309 fetched in total)
2021-10-03 16:30:26,278 [INFO] Base - [id:1229966330] Fetched 337 reviews (337 fetched in total)
2021-10-03 16:31:10,974 [INFO] Base - [id:1229966330] Fetched 366 reviews (366 fetched in total)
2021-10-03 16:31:55,623 [INFO] Base - [id:1229966330] Fetched 391 reviews (391 fetched in total)
2021-10-03 16:31:55,944 [INFO] Base - [id:1229966330] Fetched 391 reviews (391 fetched in total)


Done scraping StashAway. 
    Scraped a total of 391 reviews.

Successfully wrote StashAway reviews to df
    at 10/03/21 - 16:31:55 PM.

Time elapsed for StashAway: 0:09:20.442317
------------------------------------------------------------
------------------------------------------------------------

 # Convert list of dfs to Pandas DataFrame and write to csv
apple_reviews = pd.concat(apple_app_reviews)

Part 2 - Scrap reviews from Google Play Store

## Extracting data and relevant app names + Ids
google_app_df = pd.read_excel('app_info.xlsx',sheet_name='google')
print(google_app_df.head())

## Get list of app names and app IDs
google_app_names = list(google_app_df['app_name'])
google_app_ids = list(google_app_df['app_id'])

    app_name                 app_id
     Syfe               com.syfe
  Endowus  com.endowus.mobileapp
StashAway      com.awp.stashaway

## Loop through app IDs to get app info
google_app_info = []
for i in google_app_ids:
    info = app(i)
    del info['comments']
    google_app_info.append(info)

## Pretty print the data for the first app
pprint(google_app_info[0])

google_app_infos = pd.DataFrame(google_app_info)
# app_infos_df.to_csv('apps.csv', index=None, header=True)
# google_app_infos

{'adSupported': None,
 'androidVersion': None,
 'androidVersionText': None,
 'appId': 'com.syfe',
 'containsAds': False,
 'contentRating': None,
 'contentRatingDescription': None,
 'currency': None,
 'description': None,
 'descriptionHTML': None,
 'developer': None,
 'developerAddress': None,
 'developerEmail': None,
 'developerId': None,
 'developerInternalID': None,
 'developerWebsite': None,
 'editorsChoice': False,
 'free': None,
 'genre': None,
 'genreId': None,
 'headerImage': None,
 'histogram': [0, 0, 0, 0, 0],
 'icon': None,
 'inAppProductPrice': None,
 'installs': None,
 'minInstalls': None,
 'offersIAP': False,
 'originalPrice': None,
 'price': None,
 'privacyPolicy': None,
 'ratings': None,
 'recentChanges': None,
 'recentChangesHTML': None,
 'released': None,
 'reviews': None,
 'sale': False,
 'saleText': None,
 'saleTime': None,
 'score': None,
 'screenshots': [],
 'size': None,
 'summary': None,
 'summaryHTML': None,
 'title': None,
 'updated': None,
 'url': 'https://play.google.com/store/apps/details?id=com.syfe&hl=en&gl=us',
 'version': [None,
             [[[[[None,
                  [None,
                   [[None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/_TcrYZaOKkM12SLSZyKWO4l_QgHSkhvXi1m0tm7OnwyxzAY3YrTUKYSpmhp5QM1gf-zF'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/_TcrYZaOKkM12SLSZyKWO4l_QgHSkhvXi1m0tm7OnwyxzAY3YrTUKYSpmhp5QM1gf-zF'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/_TcrYZaOKkM12SLSZyKWO4l_QgHSkhvXi1m0tm7OnwyxzAY3YrTUKYSpmhp5QM1gf-zF'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'SGX Mobile',
                  None,
                  [[['SGX',
                     [None,
                      None,
                      None,
                      None,
                      [None, None, '/store/apps/developer?id=SGX']],
                     True]],
                   [None,
                    [None,
                     [None,
                      'Live market data, news and company announcements of all '
                      'SGX-listed companies']]]],
                  [],
                  [[None, None, [None, ['4.1', 4.0926642]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None, None, '/store/apps/details?id=com.sgx.SGXandroid']],
                  None,
                  ['CAIaWAoaEhgKEmNvbS5zZ3guU0dYYW5kcm9pZBABGAMQADITCKDuzMPsrfMCFbqhSwUdxdUFmnITCNKM1PjrrfMCFfWESwUdtUoDVIoBDQgAEgkKBWVuLVVTEACqAl0aWwgAEhoKGAoSY29tLnNneC5TR1hhbmRyb2lkEAEYA0oTCKDuzMPsrfMCFbqhSwUdxdUFmpoBEwjSjNT4663zAhX1hEsFHbVKA1T6AQ8KDQgAEgkKBWVuLVVTEAA='],
                  ['com.sgx.SGXandroid', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/HJUz_In6O2_dQ-cLMju7pt9qq5xFXzp25xCr_P663EFr3f3C2rcraCvNtrIF9YrX5FxI'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/HJUz_In6O2_dQ-cLMju7pt9qq5xFXzp25xCr_P663EFr3f3C2rcraCvNtrIF9YrX5FxI'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/HJUz_In6O2_dQ-cLMju7pt9qq5xFXzp25xCr_P663EFr3f3C2rcraCvNtrIF9YrX5FxI'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'UOBAM Invest',
                  None,
                  [[['UOB Asset Management Ltd',
                     [None,
                      None,
                      None,
                      None,
                      [None,
                       None,
                       '/store/apps/developer?id=UOB+Asset+Management+Ltd']],
                     True]],
                   [None,
                    [None,
                     [None,
                      'UOBAM Invest is your personal robo-adviser to help you '
                      'build your future wealth.']]]],
                  [],
                  [[None, None, [None, ['3.8', 3.8282828]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None,
                    None,
                    '/store/apps/details?id=com.uobam.uobaminvest']],
                  None,
                  ['CAIaWwodEhsKFWNvbS51b2JhbS51b2JhbWludmVzdBABGAMQATITCKDuzMPsrfMCFbqhSwUdxdUFmnITCNKM1PjrrfMCFfWESwUdtUoDVIoBDQgAEgkKBWVuLVNHEACqAmAaXggBEh0KGwoVY29tLnVvYmFtLnVvYmFtaW52ZXN0EAEYA0oTCKDuzMPsrfMCFbqhSwUdxdUFmpoBEwjSjNT4663zAhX1hEsFHbVKA1T6AQ8KDQgAEgkKBWVuLVNHEAA='],
                  ['com.uobam.uobaminvest', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/BxJeLxjKGNka1wdqF8SF5hXq3gRbDYBDDSJN14T4QwvtsKhqgVgUT4ms9yvtt-O1QPEU'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/BxJeLxjKGNka1wdqF8SF5hXq3gRbDYBDDSJN14T4QwvtsKhqgVgUT4ms9yvtt-O1QPEU'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/BxJeLxjKGNka1wdqF8SF5hXq3gRbDYBDDSJN14T4QwvtsKhqgVgUT4ms9yvtt-O1QPEU'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'StashAway: Invest and save',
                  None,
                  [[['Asia Wealth Platform Pte Ltd',
                     [None,
                      None,
                      None,
                      None,
                      [None,
                       None,
                       '/store/apps/developer?id=Asia+Wealth+Platform+Pte+Ltd']],
                     True]],
                   [None, [None, [None, 'Personal finance and investing']]]],
                  [],
                  [[None, None, [None, ['4.1', 4.1464176]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None, None, '/store/apps/details?id=com.awp.stashaway']],
                  None,
                  ['CAIaVwoZEhcKEWNvbS5hd3Auc3Rhc2hhd2F5EAEYAxACMhMIoO7Mw+yt8wIVuqFLBR3F1QWachMI0ozU+Out8wIV9YRLBR21SgNUigENCAASCQoFZW4tVVMQAKoCXBpaCAISGQoXChFjb20uYXdwLnN0YXNoYXdheRABGANKEwig7szD7K3zAhW6oUsFHcXVBZqaARMI0ozU+Out8wIV9YRLBR21SgNU+gEPCg0IABIJCgVlbi1VUxAA'],
                  ['com.awp.stashaway', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/8606sVmXHRX7HJTtoS8jSuVS7HVl4BXt-SLqVo7tKNEw4dDMP27KvEcd3d2NXH3hkpE'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/8606sVmXHRX7HJTtoS8jSuVS7HVl4BXt-SLqVo7tKNEw4dDMP27KvEcd3d2NXH3hkpE'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/8606sVmXHRX7HJTtoS8jSuVS7HVl4BXt-SLqVo7tKNEw4dDMP27KvEcd3d2NXH3hkpE'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'Tiger Trade-Global Invest&Save',
                  None,
                  [[['TIGER BROKERS',
                     [None,
                      None,
                      None,
                      None,
                      [None, None, '/store/apps/developer?id=TIGER+BROKERS']],
                     True]],
                   [None,
                    [None, [None, 'ETF,Options,Futures&Free Quote']]]],
                  [],
                  [[None, None, [None, ['4.5', 4.451025]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None,
                    None,
                    '/store/apps/details?id=com.tigerbrokers.stock']],
                  None,
                  ['CAIaXAoeEhwKFmNvbS50aWdlcmJyb2tlcnMuc3RvY2sQARgDEAMyEwig7szD7K3zAhW6oUsFHcXVBZpyEwjSjNT4663zAhX1hEsFHbVKA1SKAQ0IABIJCgVlbi1VUxAAqgJhGl8IAxIeChwKFmNvbS50aWdlcmJyb2tlcnMuc3RvY2sQARgDShMIoO7Mw+yt8wIVuqFLBR3F1QWamgETCNKM1PjrrfMCFfWESwUdtUoDVPoBDwoNCAASCQoFZW4tVVMQAA=='],
                  ['com.tigerbrokers.stock', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/pHMOIRJ21PkHLMdk1yjQJPsVnyx-CKgdtjd3VOnGb1JY7inJECHe_o7hFljJa8wcHlA']],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/pHMOIRJ21PkHLMdk1yjQJPsVnyx-CKgdtjd3VOnGb1JY7inJECHe_o7hFljJa8wcHlA']],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/pHMOIRJ21PkHLMdk1yjQJPsVnyx-CKgdtjd3VOnGb1JY7inJECHe_o7hFljJa8wcHlA']]],
                   2,
                   2],
                  'Wahed Invest',
                  None,
                  [[['Wahed Inc.',
                     [None,
                      None,
                      None,
                      None,
                      [None, None, '/store/apps/developer?id=Wahed+Inc.']],
                     True]],
                   [None, [None, [None, 'Ethical Investing Made Simple']]]],
                  [],
                  [[None, None, [None, ['3.7', 3.6796117]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None, None, '/store/apps/details?id=com.wahed.mobile']],
                  None,
                  ['CAIaVgoYEhYKEGNvbS53YWhlZC5tb2JpbGUQARgDEAQyEwig7szD7K3zAhW6oUsFHcXVBZpyEwjSjNT4663zAhX1hEsFHbVKA1SKAQ0IABIJCgVlbi1VUxAAqgJbGlkIBBIYChYKEGNvbS53YWhlZC5tb2JpbGUQARgDShMIoO7Mw+yt8wIVuqFLBR3F1QWamgETCNKM1PjrrfMCFfWESwUdtUoDVPoBDwoNCAASCQoFZW4tVVMQAA=='],
                  ['com.wahed.mobile', 7]]],
                'Similar',
                None,
                [None,
                 None,
                 None,
                 None,
                 [None,
                  None,
                  '/store/apps/collection/cluster?clp=ogoWCBEqAggIMg4KCGNvbS5zeWZlEAEYAw%3D%3D:S:ANO1ljIao8I&gsr=ChmiChYIESoCCAgyDgoIY29tLnN5ZmUQARgD:S:ANO1ljJbwVo']],
                True,
                2,
                None,
                [None,
                 'CjWC0_-4Ay8KJvqegZ0DIAgGEKHj-8YKEIqQ4PEDEOLnka0JEMrhhc4KEL_S040MEI-SyKrELxAFGhmiChYIESoCCAgyDgoIY29tLnN5ZmUQARgD'],
                True],
               None,
               None,
               ['CBSqARUKEwiw4c7D7K3zAhW6oUsFHcXVBZo=']]],
             None,
             [],
             True],
 'video': None,
 'videoImage': None}

Scrapping Google Play Store reviews

# for scraping app reviews from Google Play Store
from google_play_scraper import app, Sort, reviews

# Empty list for storing reviews
google_app_reviews = []

## Loop through apps to get reviews
for app_name, app_id in zip(google_app_names, google_app_ids):
    
    # Get start time
    start = dt.datetime.now(tz=get_localzone())
    fmt= "%m/%d/%y - %T %p"    
    
    # Print starting output for app
    print('---'*20)
    print('---'*20)    
    print(f'***** {app_name} started at {start.strftime(fmt)}')
    print()
    
    # Number of reviews to scrape per batch
    count = 200
    
    # To keep track of how many batches have been completed
    batch_num = 0
     
    # Retrieve reviews (and continuation_token) with reviews function
    rvws, token = reviews(
        app_id,           # found in app's url
        lang='en',        # defaults to 'en'
        country='us',     # defaults to 'us'
        sort=Sort.NEWEST, # start with most recent
        count=count       # batch size
    )
    
    # For each review obtained
    for r in rvws:
        r['app_name'] = app_name # add key for app's name
        r['app_id'] = app_id     # add key for app's id
     
    
    # Add the list of review dicts to overall list
    google_app_reviews.extend(rvws)
    
    # Increase batch count by one
    batch_num +=1 
    print(f'Batch {batch_num} completed.')
    
    # Wait 1 to 5 seconds to start next batch
    time.sleep(random.randint(1,5))
    
    # Append review IDs to list prior to starting next batch
    pre_review_ids = []
    for rvw in google_app_reviews:
        pre_review_ids.append(rvw['reviewId'])
    
    # Loop through at most max number of batches
    for batch in range(4999):
        rvws, token = reviews( # store continuation_token
            app_id,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            count=count,
            # using token obtained from previous batch
            continuation_token=token
        )
        
        # Append unique review IDs from current batch to new list
        new_review_ids = []
        for r in rvws:
            new_review_ids.append(r['reviewId'])
            
            # And add keys for name and id to each review dict
            r['app_name'] = app_name # add key for app's name
            r['app_id'] = app_id     # add key for app's id
     
        # Add the list of review dicts to main app_reviews list
        google_app_reviews.extend(rvws)
        
        # Increase batch count by one
        batch_num +=1
        
        # Break loop and stop scraping for current app if most recent batch
          # did not add any unique reviews
        all_review_ids = pre_review_ids + new_review_ids
        if len(set(pre_review_ids)) == len(set(all_review_ids)):
            print(f'No reviews left to scrape. Completed {batch_num} batches.\n')
            break
        
        # all_review_ids becomes pre_review_ids to check against 
          # for next batch
        pre_review_ids = all_review_ids
        
        # Wait 1 to 5 seconds to start next batch
        time.sleep(random.randint(1,5))
      
    
    # Print update when max number of batches has been reached
      # OR when last batch didn't add any unique reviews
    print(f'Done scraping {app_name}.')
    print(f'Scraped a total of {len(set(pre_review_ids))} unique reviews.\n')
    
    # Get end time
    end = dt.datetime.now(tz=get_localzone())
    
    # Print ending output for app
    print(f"""
    Successfully inserted all {app_name} reviews into collection
    at {end.strftime(fmt)}.\n
    """)
    print(f'Time elapsed for {app_name}: {end-start}')
    print('---'*20)
    print('---'*20)
    print('\n')
    
    # Wait 1 to 5 seconds to start scraping next app
    time.sleep(random.randint(1,5))

------------------------------------------------------------
------------------------------------------------------------
***** Syfe started at 09/04/21 - 23:33:25 PM

Batch 1 completed.
No reviews left to scrape. Completed 2 batches.

Done scraping Syfe.
Scraped a total of 110 unique reviews.


    Successfully inserted all Syfe reviews into collection
    at 09/04/21 - 23:33:31 PM.

    
Time elapsed for Syfe: 0:00:05.276724
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** Endowus started at 09/04/21 - 23:33:32 PM

Batch 1 completed.
No reviews left to scrape. Completed 2 batches.

Done scraping Endowus.
Scraped a total of 250 unique reviews.


    Successfully inserted all Endowus reviews into collection
    at 09/04/21 - 23:33:34 PM.

    
Time elapsed for Endowus: 0:00:02.298619
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** StashAway started at 09/04/21 - 23:33:38 PM

Batch 1 completed.
No reviews left to scrape. Completed 8 batches.

Done scraping StashAway.
Scraped a total of 1515 unique reviews.


    Successfully inserted all StashAway reviews into collection
    at 09/04/21 - 23:34:00 PM.

    
Time elapsed for StashAway: 0:00:22.199429
------------------------------------------------------------
------------------------------------------------------------

# Converting output to dataframe
google_reviews = pd.DataFrame(google_app_reviews)

Part 3: Combining both Apple and Google Store reviews

print(f'Apple Store: {np.shape(apple_reviews)[0]} rows and {np.shape(apple_reviews)[1]} columns.')
print(f'Google Play Store: {np.shape(google_reviews)[0]} rows and {np.shape(google_reviews)[1]} columns.')

Apple Store: 524 rows and 9 columns.
Google Play Store: 1515 rows and 12 columns.

new_apple = apple_reviews[['app_name','review']] # Selecting app_name and review from apple reviews into new df
new_apple.rename(columns={'review':'content'},inplace=True) # rename review content column
new_google = google_reviews[['app_name','content']] # subset app_name and content from google reviews into new df
total_reviews = pd.concat([new_apple,new_google]) # Concat both dfs into one

# saving reviews into csv file
total_reviews.to_csv('app_reviews.csv', index=False, header=True)

/Users/a844133yara.com/.pyenv/versions/3.9.5/envs/python_playground/lib/python3.9/site-packages/pandas/core/frame.py:5034: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(

Predicting Housing Prices in Melbourne

2021-01-15T00:00:00+00:00

Predicting housing prices in Melbourne through regression analysis. This notebook walks through the full workflow—data cleaning, exploration, and modelling with linear and multiple regression. I also apply feature selection techniques (correlation and mutual information) and evaluate model performance using MAE, MSE, RMSE, and R². This notebook is adapted from Price Analysis and Linear Regression on Kaggle

Melbourne Housing Market

Housing clearance data from Jan 2016

When did the Melbourne housing cooled off?
Could you see it slowing down? What were the variables that showed the slowing down (was it overall price, amount sold vs unsold, change in more rentals sold and less housing, changes in which CouncilArea or Region, more houses sold in distances further away from Melbourne CBD and less closer)?
Could you have predicted it?
Should I hold off even longer in buying a two bedroom apartment in Northcote??

Some Key Details

Suburb: Suburb Address: Address Rooms: Number of rooms Price: Price in Australian dollars

Method:

S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

Type:

br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from CBD in Kilometres

Regionname: General Region (West, North West, North, North east …etc)

Propertycount: Number of properties that exist in the suburb.

Bedroom2 : Scraped # of Bedrooms (from different source)

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size in Metres

BuildingArea: Building Size in Metres

YearBuilt: Year the house was built

CouncilArea: Governing council for the area

Lattitude: Self explanitory

Longtitude: Self explanitory

## Import libraries

# Data wrangling
import pandas as pd
import numpy as np
from datetime import date # Usage: Determine days from start

# Data Visualisations
%matplotlib inline
import matplotlib.pyplot as plt
import pylab as pl
import seaborn as sns

# Model Development and Evaluation
from sklearn.model_selection import train_test_split # For Model Development
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Reading source files

# df_houseprice = pd.read_csv("data/MELBOURNE_HOUSE_PRICES_LESS.csv")
df_housingfull= pd.read_csv("data/Melbourne_housing_FULL.csv")

Linear Regression

Data Cleaning

Convert arguments in Date column to datetime
Filter out data that are not housing types

I will only be focusing on housing data.

# Data Cleaning
df_housingfull = df_housingfull.rename(columns={'Lattitude':'Latitude'}) # Rename column names

# Remove unrelevant column data
df_housingfull = df_housingfull.drop(['Suburb', 'Address', 'SellerG','Regionname', 'CouncilArea'],axis=1)

# Convert date column to datetime
df_housingfull['Date'] = pd.to_datetime(df_housingfull['Date'],dayfirst=True)
print("There are {} rows and {} columns in this dataframe" .format(df_housingfull.shape[0],df_housingfull.shape[1]))

# Create new dataframe with only housing data
df = df_housingfull[df_housingfull['Type']=='h']
print("After filtering data that are not housing types, there are {} rows and {} columns in this new dataframe" .format(df.shape[0],df.shape[1]))

There are 34857 rows and 16 columns in this dataframe
After filtering data that are not housing types, there are 23980 rows and 16 columns in this new dataframe

Data Exploration using Visualisations

Histogram plot for each variable
Pair plots
Observe average price change per quarter over the years

# Plot Relationships between price and features
sns.set_style( 'darkgrid' )
fig, axes = plt.subplots(3,2,figsize=[20,20])

# Plot 1: Scatterplot of AVerage Price against Date
mean_df = df.sort_values('Date',ascending=False).groupby('Date').mean().reset_index()
axes[0,0].scatter(x='Date',y='Price',data=mean_df,edgecolor='b' )
axes[0,0].set_xlabel( 'Date' )
axes[0,0].set_ylabel( 'Price' )
axes[0,0].set_title( 'Price vs Date')

# Plot 2: Diagonal Correlation Matrix 
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# # Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5},ax=axes[0,1])
axes[0,1].set_xlabel('Date')
axes[0,1].set_ylabel('Property Count per Suburb')
axes[0,1].set_title('Property Count vs Date')

# Plot 3: Boxplot of Price against number of Bathrooms
sns.boxplot(x='Bathroom',y='Price',data=df ,ax=axes[1,0] )
axes[1,0].set_xlabel( 'Bathroom' )
axes[1,0].set_ylabel( 'Price' )
axes[1,0].set_title( 'Price vs Bathroom')

# Plot 4: Boxplot of Price against number of Bedrooms
sns.boxplot(x='Bedroom2',y='Price',data=df ,ax=axes[1,1] )
axes[1,1].set_xlabel( 'Bedroom' )
axes[1,1].set_ylabel( 'Price' )
axes[1,1].set_title( 'Price vs Bedroom')

# Plot 5: Regression plot of Average istance against Average Price
sns.regplot(x='Distance',y='Price',data=mean_df,scatter_kws={"color": "black"}, line_kws={"color": "red"},ax=axes[2,0])
axes[2,0].set_xlabel('Distance')
axes[2,0].set_ylabel('Price')
axes[2,0].set_title('Price vs Distance')

# Plot 6: Regression plot of Distance against Price
sns.regplot(x='Distance',y='Price',data=df,scatter_kws={"color": "black"}, line_kws={"color": "red"},ax=axes[2,1])
axes[2,1].set_xlabel('Distance')
axes[2,1].set_ylabel('Price')
axes[2,1].set_title('Price vs Distance')

Text(0.5, 1.0, 'Price vs Distance')

These visualisations can help to answer the first 2 questions:

The housing prices in Melbourne appears to begin cooling off sometime between April and July in 2017.
Based on the correlation matrix, the top 2 features that affects pricing is the number of Bathrooms, nunber of Bedrooms and distance (kilometres) from CBD. I plotted boxplots to visualise how price varies the number of bedrooms and bathrooms. The boxplot for the number of bedrooms indicate that there’s quite alot of variability. For distance, I used a regression plot to see how price varies. The plot shows a negative relationship between the two, which is logical since housing near CBD are usually priced higher than those in the outer regions.

Linear Regression Model with all Features

In this part, I will evaluate the linear regression model using all the available features. The data is split into training and test data with a 2:1 ratio. The coefficient for each predictor variable is subsequently ranked after, showing that longitude, number of bathrooms and the vendor bid method as the top 3 most significant feature in the model.

## Further data cleanup
# Remove missing values
df1 = df.dropna().sort_values('Date')

###########
##Find out days since start
days_since_start = [(x-df1['Date'].min()).days for x in df1['Date']]
df1['Days'] = days_since_start

# Convert Categorical Variables to dummy/indicator variables
df2_dummies = pd.get_dummies(df1[['Type','Method']])
df2 = df1.drop(['Type','Date','Method'],axis=1).join(df2_dummies)

# Determine x (independent variables or predictor variables) and y (dependent variables) 
y = df2['Price'] # Price being the dependent variable
x = df2.drop(['Price'],axis=1) # Remove price from the independent variables

# Split into training and test set
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.33)

# Fit the model
model = LinearRegression()
model.fit(x_train,y_train)

# Evalute the model
ypredictions = model.predict(x_test)

# Ranking the coefficients
coeff_df = pd.DataFrame(model.coef_,x.columns,columns=['Coefficient'])
ranked_coeff = coeff_df.sort_values("Coefficient", ascending = False)
print(ranked_coeff)

                Coefficient
Longtitude     5.404310e+05
Bathroom       2.050054e+05
Rooms          8.073899e+04
Car            5.194845e+04
Method_VB      4.878232e+04
Method_S       3.758929e+04
Bedroom2       3.111879e+04
BuildingArea   1.683848e+03
Method_PI      1.355322e+03
Postcode       1.044804e+03
Days           1.491648e+02
Landsize       6.700705e+01
Propertycount  1.272204e+00
Type_h         1.164153e-10
YearBuilt     -3.213159e+03
Method_SP     -3.698939e+04
Method_SA     -5.073755e+04
Distance      -5.161010e+04
Latitude      -1.537221e+06

Scatter Plot of Actual vs Predicted

fig_lm,axes_lm = plt.subplots(1,1,figsize=[15,10]) # Create a custom size figure

# # ax1 = fig_lm.add_subplot() # Add subplot
sns.regplot(x=ypredictions,y=y_test,line_kws={"color":"red"},ax=axes_lm)
axes_lm.set_xlabel("Predicted") # Add x label
axes_lm.set_ylabel("Observed") # Add y label
axes_lm.set_title("Observed vs Predicted")

Distribution plot: difference in actual price and predicted price

sns.displot(data=(y_test-ypredictions),bins=50)

Evaluating the Raw Linear Regression model

print("------Evaluated predictions for a raw Linear Regression Model------")
print("MAE: ", metrics.mean_absolute_error(y_test,ypredictions))
print("MSE: ", metrics.mean_squared_error(y_test,ypredictions))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test,ypredictions)))
print("R^2 ", metrics.r2_score(y_test,ypredictions))

------Evaluated predictions for a raw Linear Regression Model------
MAE:  303135.5223904289
MSE:  211634647505.68866
RMSE:  460037.6587907654
R^2  0.5857898755940139

Multiple Regression

Feature Selection

from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.pipeline import Pipeline
from numpy import mean
from numpy import std
from matplotlib import pyplot
from sklearn.model_selection import GridSearchCV

Mutual Information Statistics

This model leverages on the correlation (most common correlation measure being pearsons correlation) to determine which variable is the most relevant.

# Create a function that can implement feature selection for the input training and test data
def select_features_mis(X_train,Y_train, X_test):
    # Configure to select all features
    features = SelectKBest(score_func=mutual_info_regression, k = 16)
    # Learn relationship from training data
    features.fit(X_train,Y_train)
    # Transform training data
    X_train_feats = features.transform(X_train)
    # Transorm test data
    X_test_feats = features.transform(X_test)
    return X_train_feats,X_test_feats,features

# Running the regression model that applies feature selection (mutual information statistics)
# Feature selection
x_train_feats_mis, x_test_feats_mis, features_mis = select_features_mis(x_train,y_train,x_test)

# Scores for the features
for feature in range(len(features_mis.scores_)):
    print('Feature %d: %f' % (feature, features_mis.scores_[feature]))

# Fit the model
model_feats_mis = LinearRegression()
model_feats_mis.fit(x_train_feats_mis,y_train)

# Evaluate the model
ypredictions_feats_mis = model_feats_mis.predict(x_test_feats_mis)

Feature 0: 0.085276
Feature 1: 0.379063
Feature 2: 0.535257
Feature 3: 0.083888
Feature 4: 0.111936
Feature 5: 0.028387
Feature 6: 0.061553
Feature 7: 0.143656
Feature 8: 0.147641
Feature 9: 0.300453
Feature 10: 0.259006
Feature 11: 0.328668
Feature 12: 0.039872
Feature 13: 0.011723
Feature 14: 0.014065
Feature 15: 0.040096
Feature 16: 0.000000
Feature 17: 0.005040
Feature 18: 0.056839

Correlation Statistics

This model leverages on the correlation (most common correlation measure being pearsons correlation) to determine which variable is the most relevant.

# Create a function that can implement feature selection for the input training and test data
def select_features_cs(X_train,Y_train, X_test):
    # Configure to select all features
    features = SelectKBest(score_func=f_regression, k = 16)
    # Learn relationship from training data
    features.fit(X_train,Y_train)
    # Transform training data
    X_train_feats = features.transform(X_train)
    # Transorm test data
    X_test_feats = features.transform(X_test)
    return X_train_feats,X_test_feats,features

# Running the regression model that applies feature selection (correlation statistics)
# Feature selection
x_train_feats_cs, x_test_feats_cs, features_cs = select_features_cs(x_train,y_train,x_test)

# Scores for the features
for feature in range(len(features_cs.scores_)):
    print('Feature %d: %f' % (feature, features_cs.scores_[feature]))

# Create model
model_feats_cs = LinearRegression()
# Fit the model
model_feats_cs.fit(x_train_feats_cs,y_train)
# Evaluate the model
ypredictions_feats_cs = model_feats_cs.predict(x_test_feats_cs)

Feature 0: 624.218533
Feature 1: 831.558520
Feature 2: 1.104028
Feature 3: 559.223771
Feature 4: 1039.871719
Feature 5: 52.724016
Feature 6: 6.962535
Feature 7: 918.872851
Feature 8: 354.376184
Feature 9: 356.552179
Feature 10: 228.442479
Feature 11: 11.192402
Feature 12: 75.345978
Feature 13: nan
Feature 14: 18.467231
Feature 15: 15.835652
Feature 16: 0.421705
Feature 17: 54.838265
Feature 18: 118.419765

Visualising Regression Models

fig_lm,(axes_lm_mis,axes_lm_cs) = plt.subplots(1,2,figsize=[15,10]) # Create a custom size figure

# Creating plot for Mutual Information Statistics
sns.regplot(x=ypredictions_feats_mis,y=y_test,line_kws={"color":"red"},ax=axes_lm_mis)
axes_lm_mis.set_xlabel("Predicted") # Add x label
axes_lm_mis.set_ylabel("Observed") # Add y label
axes_lm_mis.set_title("Linear Regression: Mutual Information Statistics for Observed vs Predicted")

# Creating plot for Correlation Statistics
sns.regplot(x=ypredictions_feats_cs,y=y_test,line_kws={"color":"red"},ax=axes_lm_cs)
axes_lm_cs.set_xlabel("Predicted") # Add x label
axes_lm_cs.set_ylabel("Observed") # Add y label
axes_lm_cs.set_title("Linear Regression: Correlation Statistics for Observed vs Predicted")

Model Evaluation

print("------Evaluated predictions for a raw Linear Regression Model------")
print("MAE: ", metrics.mean_absolute_error(y_test,ypredictions))
print("MSE: ", metrics.mean_squared_error(y_test,ypredictions))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test,ypredictions)))
print("R^2: ", metrics.r2_score(y_test,ypredictions))

print("------Evaluated predictions for a Linear Regression Model with Correlation Statistics Feature Selection------")
print("MAE: ", metrics.mean_absolute_error(y_test,ypredictions_feats_cs))
print("MSE: ", metrics.mean_squared_error(y_test,ypredictions_feats_cs))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test,ypredictions_feats_cs)))
print("R^2: ", metrics.r2_score(y_test,ypredictions_feats_cs))

print("------Evaluated predictions for a Linear Regression Model with Mutual Information Statistics Feature Selection------")
print("MAE: ", metrics.mean_absolute_error(y_test,ypredictions_feats_mis))
print("MSE: ", metrics.mean_squared_error(y_test,ypredictions_feats_mis))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test,ypredictions_feats_mis)))
print("R^2: ", metrics.r2_score(y_test,ypredictions_feats_mis))

------Evaluated predictions for a raw Linear Regression Model------
MAE:  301077.07816341706
MSE:  235171170575.45062
RMSE:  484944.50257266616
R^2:  0.5415862801118618
------Evaluated predictions for a Linear Regression Model with Correlation Statistics Feature Selection------
MAE:  309316.00912912446
MSE:  246586130384.74255
RMSE:  496574.39561937
R^2:  0.5193353631489246
------Evaluated predictions for a Linear Regression Model with Mutual Information Statistics Feature Selection------
MAE:  301072.41752915125
MSE:  235167756102.42468
RMSE:  484940.9820817629
R^2:  0.541592935865105

By applying two types of feature selection techniques and comparing the models, the metrics indicate that mutual information statistics allow us to to achieve a more accurate model - higher R^2 and lower error metrics (MAE, MSE and RMSE).